diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:02:30 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:02:30 +0000 |
commit | 76cb841cb886eef6b3bee341a2266c76578724ad (patch) | |
tree | f5892e5ba6cc11949952a6ce4ecbe6d516d6ce58 /Documentation/networking | |
parent | Initial commit. (diff) | |
download | linux-76cb841cb886eef6b3bee341a2266c76578724ad.tar.xz linux-76cb841cb886eef6b3bee341a2266c76578724ad.zip |
Adding upstream version 4.19.249.upstream/4.19.249upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'Documentation/networking')
160 files changed, 41568 insertions, 0 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX new file mode 100644 index 000000000..02a323c43 --- /dev/null +++ b/Documentation/networking/00-INDEX @@ -0,0 +1,234 @@ +00-INDEX + - this file +3c509.txt + - information on the 3Com Etherlink III Series Ethernet cards. +6pack.txt + - info on the 6pack protocol, an alternative to KISS for AX.25 +LICENSE.qla3xxx + - GPLv2 for QLogic Linux Networking HBA Driver +LICENSE.qlge + - GPLv2 for QLogic Linux qlge NIC Driver +LICENSE.qlcnic + - GPLv2 for QLogic Linux qlcnic NIC Driver +PLIP.txt + - PLIP: The Parallel Line Internet Protocol device driver +README.ipw2100 + - README for the Intel PRO/Wireless 2100 driver. +README.ipw2200 + - README for the Intel PRO/Wireless 2915ABG and 2200BG driver. +README.sb1000 + - info on General Instrument/NextLevel SURFboard1000 cable modem. +altera_tse.txt + - Altera Triple-Speed Ethernet controller. +arcnet-hardware.txt + - tons of info on ARCnet, hubs, jumper settings for ARCnet cards, etc. +arcnet.txt + - info on the using the ARCnet driver itself. +atm.txt + - info on where to get ATM programs and support for Linux. +ax25.txt + - info on using AX.25 and NET/ROM code for Linux +baycom.txt + - info on the driver for Baycom style amateur radio modems +bonding.txt + - Linux Ethernet Bonding Driver HOWTO: link aggregation in Linux. +bridge.txt + - where to get user space programs for ethernet bridging with Linux. +cdc_mbim.txt + - 3G/LTE USB modem (Mobile Broadband Interface Model) +checksum-offloads.txt + - Explanation of checksum offloads; LCO, RCO +cops.txt + - info on the COPS LocalTalk Linux driver +cs89x0.txt + - the Crystal LAN (CS8900/20-based) Ethernet ISA adapter driver +cxacru.txt + - Conexant AccessRunner USB ADSL Modem +cxacru-cf.py + - Conexant AccessRunner USB ADSL Modem configuration file parser +cxgb.txt + - Release Notes for the Chelsio N210 Linux device driver. +dccp.txt + - the Datagram Congestion Control Protocol (DCCP) (RFC 4340..42). +dctcp.txt + - DataCenter TCP congestion control +de4x5.txt + - the Digital EtherWORKS DE4?? and DE5?? PCI Ethernet driver +decnet.txt + - info on using the DECnet networking layer in Linux. +dl2k.txt + - README for D-Link DL2000-based Gigabit Ethernet Adapters (dl2k.ko). +dm9000.txt + - README for the Simtec DM9000 Network driver. +dmfe.txt + - info on the Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver. +dns_resolver.txt + - The DNS resolver module allows kernel servies to make DNS queries. +driver.txt + - Softnet driver issues. +ena.txt + - info on Amazon's Elastic Network Adapter (ENA) +e100.txt + - info on Intel's EtherExpress PRO/100 line of 10/100 boards +e1000.txt + - info on Intel's E1000 line of gigabit ethernet boards +e1000e.txt + - README for the Intel Gigabit Ethernet Driver (e1000e). +eql.txt + - serial IP load balancing +fib_trie.txt + - Level Compressed Trie (LC-trie) notes: a structure for routing. +filter.txt + - Linux Socket Filtering +fore200e.txt + - FORE Systems PCA-200E/SBA-200E ATM NIC driver info. +framerelay.txt + - info on using Frame Relay/Data Link Connection Identifier (DLCI). +gen_stats.txt + - Generic networking statistics for netlink users. +generic-hdlc.txt + - The generic High Level Data Link Control (HDLC) layer. +generic_netlink.txt + - info on Generic Netlink +gianfar.txt + - Gianfar Ethernet Driver. +i40e.txt + - README for the Intel Ethernet Controller XL710 Driver (i40e). +i40evf.txt + - Short note on the Driver for the Intel(R) XL710 X710 Virtual Function +ieee802154.txt + - Linux IEEE 802.15.4 implementation, API and drivers +igb.txt + - README for the Intel Gigabit Ethernet Driver (igb). +igbvf.txt + - README for the Intel Gigabit Ethernet Driver (igbvf). +ip-sysctl.txt + - /proc/sys/net/ipv4/* variables +ip_dynaddr.txt + - IP dynamic address hack e.g. for auto-dialup links +ipddp.txt + - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation +iphase.txt + - Interphase PCI ATM (i)Chip IA Linux driver info. +ipsec.txt + - Note on not compressing IPSec payload and resulting failed policy check. +ipv6.txt + - Options to the ipv6 kernel module. +ipvs-sysctl.txt + - Per-inode explanation of the /proc/sys/net/ipv4/vs interface. +irda.txt + - where to get IrDA (infrared) utilities and info for Linux. +ixgb.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgb). +ixgbe.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgbe). +ixgbevf.txt + - README for the Intel Virtual Function (VF) Driver (ixgbevf). +l2tp.txt + - User guide to the L2TP tunnel protocol. +lapb-module.txt + - programming information of the LAPB module. +ltpc.txt + - the Apple or Farallon LocalTalk PC card driver +mac80211-auth-assoc-deauth.txt + - authentication and association / deauth-disassoc with max80211 +mac80211-injection.txt + - HOWTO use packet injection with mac80211 +multiqueue.txt + - HOWTO for multiqueue network device support. +netconsole.txt + - The network console module netconsole.ko: configuration and notes. +netdev-features.txt + - Network interface features API description. +netdevices.txt + - info on network device driver functions exported to the kernel. +netif-msg.txt + - Design of the network interface message level setting (NETIF_MSG_*). +netlink_mmap.txt + - memory mapped I/O with netlink +nf_conntrack-sysctl.txt + - list of netfilter-sysctl knobs. +nfc.txt + - The Linux Near Field Communication (NFS) subsystem. +openvswitch.txt + - Open vSwitch developer documentation. +operstates.txt + - Overview of network interface operational states. +packet_mmap.txt + - User guide to memory mapped packet socket rings (PACKET_[RT]X_RING). +phonet.txt + - The Phonet packet protocol used in Nokia cellular modems. +phy.txt + - The PHY abstraction layer. +pktgen.txt + - User guide to the kernel packet generator (pktgen.ko). +policy-routing.txt + - IP policy-based routing +ppp_generic.txt + - Information about the generic PPP driver. +proc_net_tcp.txt + - Per inode overview of the /proc/net/tcp and /proc/net/tcp6 interfaces. +radiotap-headers.txt + - Background on radiotap headers. +ray_cs.txt + - Raylink Wireless LAN card driver info. +rds.txt + - Background on the reliable, ordered datagram delivery method RDS. +regulatory.txt + - Overview of the Linux wireless regulatory infrastructure. +rxrpc.txt + - Guide to the RxRPC protocol. +s2io.txt + - Release notes for Neterion Xframe I/II 10GbE driver. +scaling.txt + - Explanation of network scaling techniques: RSS, RPS, RFS, aRFS, XPS. +sctp.txt + - Notes on the Linux kernel implementation of the SCTP protocol. +secid.txt + - Explanation of the secid member in flow structures. +skfp.txt + - SysKonnect FDDI (SK-5xxx, Compaq Netelligent) driver info. +smc9.txt + - the driver for SMC's 9000 series of Ethernet cards +spider_net.txt + - README for the Spidernet Driver (as found in PS3 / Cell BE). +stmmac.txt + - README for the STMicro Synopsys Ethernet driver. +tc-actions-env-rules.txt + - rules for traffic control (tc) actions. +timestamping.txt + - overview of network packet timestamping variants. +tcp.txt + - short blurb on how TCP output takes place. +tcp-thin.txt + - kernel tuning options for low rate 'thin' TCP streams. +team.txt + - pointer to information for ethernet teaming devices. +tlan.txt + - ThunderLAN (Compaq Netelligent 10/100, Olicom OC-2xxx) driver info. +tproxy.txt + - Transparent proxy support user guide. +tuntap.txt + - TUN/TAP device driver, allowing user space Rx/Tx of packets. +udplite.txt + - UDP-Lite protocol (RFC 3828) introduction. +vortex.txt + - info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards. +vxge.txt + - README for the Neterion X3100 PCIe Server Adapter. +vxlan.txt + - Virtual extensible LAN overview +x25.txt + - general info on X.25 development. +x25-iface.txt + - description of the X.25 Packet Layer to LAPB device interface. +xfrm_device.txt + - description of XFRM offload API +xfrm_proc.txt + - description of the statistics package for XFRM. +xfrm_sync.txt + - sync patches for XFRM enable migration of an SA between hosts. +xfrm_sysctl.txt + - description of the XFRM configuration options. +z8530drv.txt + - info about Linux driver for Z8530 based HDLC cards for AX.25 diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt new file mode 100644 index 000000000..fbf722e15 --- /dev/null +++ b/Documentation/networking/3c509.txt @@ -0,0 +1,213 @@ +Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher) +---------------------------------------------------------------------------- + +This file contains the instructions and caveats for v1.18c and higher versions +of the 3c509 driver. You should not use the driver without reading this file. + +release 1.0 +28 February 2002 +Current maintainer (corrections to): + David Ruggiero <jdr@farfalle.com> + +---------------------------------------------------------------------------- + +(0) Introduction + +The following are notes and information on using the 3Com EtherLink III series +ethercards in Linux. These cards are commonly known by the most widely-used +card's 3Com model number, 3c509. They are all 10mb/s ISA-bus cards and shouldn't +be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905" +(aka "Vortex" or "Boomerang") series. Kernel support for the 3c509 family is +provided by the module 3c509.c, which has code to support all of the following +models: + + 3c509 (original ISA card) + 3c509B (later revision of the ISA card; supports full-duplex) + 3c589 (PCMCIA) + 3c589B (later revision of the 3c589; supports full-duplex) + 3c579 (EISA) + +Large portions of this documentation were heavily borrowed from the guide +written the original author of the 3c509 driver, Donald Becker. The master +copy of that document, which contains notes on older versions of the driver, +currently resides on Scyld web server: http://www.scyld.com/. + + +(1) Special Driver Features + +Overriding card settings + +The driver allows boot- or load-time overriding of the card's detected IOADDR, +IRQ, and transceiver settings, although this capability shouldn't generally be +needed except to enable full-duplex mode (see below). An example of the syntax +for LILO parameters for doing this: + + ether=10,0x310,3,0x3c509,eth0 + +This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and +transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts +with other card types when overriding the I/O address. When the driver is +loaded as a module, only the IRQ may be overridden. For example, +setting two cards to IRQ10 and IRQ11 is done by using the irq module +option: + + options 3c509 irq=10,11 + + +(2) Full-duplex mode + +The v1.18c driver added support for the 3c509B's full-duplex capabilities. +In order to enable and successfully use full-duplex mode, three conditions +must be met: + +(a) You must have a Etherlink III card model whose hardware supports full- +duplex operations. Currently, the only members of the 3c509 family that are +positively known to support full-duplex are the 3c509B (ISA bus) and 3c589B +(PCMCIA) cards. Cards without the "B" model designation do *not* support +full-duplex mode; these include the original 3c509 (no "B"), the original +3c589, the 3c529 (MCA bus), and the 3c579 (EISA bus). + +(b) You must be using your card's 10baseT transceiver (i.e., the RJ-45 +connector), not its AUI (thick-net) or 10base2 (thin-net/coax) interfaces. +AUI and 10base2 network cabling is physically incapable of full-duplex +operation. + +(c) Most importantly, your 3c509B must be connected to a link partner that is +itself full-duplex capable. This is almost certainly one of two things: a full- +duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on +another system that's connected directly to the 3c509B via a crossover cable. + +Full-duplex mode can be enabled using 'ethtool'. + +/////Extremely important caution concerning full-duplex mode///// +Understand that the 3c509B's hardware's full-duplex support is much more +limited than that provide by more modern network interface cards. Although +at the physical layer of the network it fully supports full-duplex operation, +the card was designed before the current Ethernet auto-negotiation (N-way) +spec was written. This means that the 3c509B family ***cannot and will not +auto-negotiate a full-duplex connection with its link partner under any +circumstances, no matter how it is initialized***. If the full-duplex mode +of the 3c509B is enabled, its link partner will very likely need to be +independently _forced_ into full-duplex mode as well; otherwise various nasty +failures will occur - at the very least, you'll see massive numbers of packet +collisions. This is one of very rare circumstances where disabling auto- +negotiation and forcing the duplex mode of a network interface card or switch +would ever be necessary or desirable. + + +(3) Available Transceiver Types + +For versions of the driver v1.18c and above, the available transceiver types are: + +0 transceiver type from EEPROM config (normally 10baseT); force half-duplex +1 AUI (thick-net / DB15 connector) +2 (undefined) +3 10base2 (thin-net == coax / BNC connector) +4 10baseT (RJ-45 connector); force half-duplex mode +8 transceiver type and duplex mode taken from card's EEPROM config settings +12 10baseT (RJ-45 connector); force full-duplex mode + +Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note +that the new transceiver codes 8 and 12 are the *only* ones that will enable +full-duplex mode, no matter what the card's detected EEPROM settings might be. +This insured that merely upgrading the driver from an earlier version would +never automatically enable full-duplex mode in an existing installation; +it must always be explicitly enabled via one of these code in order to be +activated. + +The transceiver type can be changed using 'ethtool'. + + +(4a) Interpretation of error messages and common problems + +Error Messages + +eth0: Infinite loop in interrupt, status 2011. +These are "mostly harmless" message indicating that the driver had too much +work during that interrupt cycle. With a status of 0x2011 you are receiving +packets faster than they can be removed from the card. This should be rare +or impossible in normal operation. Possible causes of this error report are: + + - a "green" mode enabled that slows the processor down when there is no + keyboard activity. + + - some other device or device driver hogging the bus or disabling interrupts. + Check /proc/interrupts for excessive interrupt counts. The timer tick + interrupt should always be incrementing faster than the others. + +No received packets +If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never +receives packets (as reported by /proc/net/dev or 'ifconfig') you likely +have an interrupt line problem. Check /proc/interrupts to verify that the +card is actually generating interrupts. If the interrupt count is not +increasing you likely have a physical conflict with two devices trying to +use the same ISA IRQ line. The common conflict is with a sound card on IRQ10 +or IRQ5, and the easiest solution is to move the 3c509 to a different +interrupt line. If the device is receiving packets but 'ping' doesn't work, +you have a routing problem. + +Tx Carrier Errors Reported in /proc/net/dev +If an EtherLink III appears to transmit packets, but the "Tx carrier errors" +field in /proc/net/dev increments as quickly as the Tx packet count, you +likely have an unterminated network or the incorrect media transceiver selected. + +3c509B card is not detected on machines with an ISA PnP BIOS. +While the updated driver works with most PnP BIOS programs, it does not work +with all. This can be fixed by disabling PnP support using the 3Com-supplied +setup program. + +3c509 card is not detected on overclocked machines +Increase the delay time in id_read_eeprom() from the current value, 500, +to an absurdly high value, such as 5000. + + +(4b) Decoding Status and Error Messages + +The bits in the main status register are: + +value description +0x01 Interrupt latch +0x02 Tx overrun, or Rx underrun +0x04 Tx complete +0x08 Tx FIFO room available +0x10 A complete Rx packet has arrived +0x20 A Rx packet has started to arrive +0x40 The driver has requested an interrupt +0x80 Statistics counter nearly full + +The bits in the transmit (Tx) status word are: + +value description +0x02 Out-of-window collision. +0x04 Status stack overflow (normally impossible). +0x08 16 collisions. +0x10 Tx underrun (not enough PCI bus bandwidth). +0x20 Tx jabber. +0x40 Tx interrupt requested. +0x80 Status is valid (this should always be set). + + +When a transmit error occurs the driver produces a status message such as + + eth0: Transmit error, Tx status register 82 + +The two values typically seen here are: + +0x82 +Out of window collision. This typically occurs when some other Ethernet +host is incorrectly set to full duplex on a half duplex network. + +0x88 +16 collisions. This typically occurs when the network is exceptionally busy +or when another host doesn't correctly back off after a collision. If this +error is mixed with 0x82 errors it is the result of a host incorrectly set +to full duplex (see above). + +Both of these errors are the result of network problems that should be +corrected. They do not represent driver malfunction. + + +(5) Revision history (this file) + +28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs + diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.txt new file mode 100644 index 000000000..2e5a939d7 --- /dev/null +++ b/Documentation/networking/6lowpan.txt @@ -0,0 +1,50 @@ + +Netdev private dataroom for 6lowpan interfaces: + +All 6lowpan able net devices, means all interfaces with ARPHRD_6LOWPAN, +must have "struct lowpan_priv" placed at beginning of netdev_priv. + +The priv_size of each interface should be calculate by: + + dev->priv_size = LOWPAN_PRIV_SIZE(LL_6LOWPAN_PRIV_DATA); + +Where LL_PRIV_6LOWPAN_DATA is sizeof linklayer 6lowpan private data struct. +To access the LL_PRIV_6LOWPAN_DATA structure you can cast: + + lowpan_priv(dev)-priv; + +to your LL_6LOWPAN_PRIV_DATA structure. + +Before registering the lowpan netdev interface you must run: + + lowpan_netdev_setup(dev, LOWPAN_LLTYPE_FOOBAR); + +wheres LOWPAN_LLTYPE_FOOBAR is a define for your 6LoWPAN linklayer type of +enum lowpan_lltypes. + +Example to evaluate the private usually you can do: + +static inline struct lowpan_priv_foobar * +lowpan_foobar_priv(struct net_device *dev) +{ + return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv; +} + +switch (dev->type) { +case ARPHRD_6LOWPAN: + lowpan_priv = lowpan_priv(dev); + /* do great stuff which is ARPHRD_6LOWPAN related */ + switch (lowpan_priv->lltype) { + case LOWPAN_LLTYPE_FOOBAR: + /* do 802.15.4 6LoWPAN handling here */ + lowpan_foobar_priv(dev)->bar = foo; + break; + ... + } + break; +... +} + +In case of generic 6lowpan branch ("net/6lowpan") you can remove the check +on ARPHRD_6LOWPAN, because you can be sure that these function are called +by ARPHRD_6LOWPAN interfaces. diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt new file mode 100644 index 000000000..8f339428f --- /dev/null +++ b/Documentation/networking/6pack.txt @@ -0,0 +1,175 @@ +This is the 6pack-mini-HOWTO, written by + +Andreas Könsgen DG3KQ +Internet: ajk@comnets.uni-bremen.de +AMPR-net: dg3kq@db0pra.ampr.org +AX.25: dg3kq@db0ach.#nrw.deu.eu + +Last update: April 7, 1998 + +1. What is 6pack, and what are the advantages to KISS? + +6pack is a transmission protocol for data exchange between the PC and +the TNC over a serial line. It can be used as an alternative to KISS. + +6pack has two major advantages: +- The PC is given full control over the radio + channel. Special control data is exchanged between the PC and the TNC so + that the PC knows at any time if the TNC is receiving data, if a TNC + buffer underrun or overrun has occurred, if the PTT is + set and so on. This control data is processed at a higher priority than + normal data, so a data stream can be interrupted at any time to issue an + important event. This helps to improve the channel access and timing + algorithms as everything is computed in the PC. It would even be possible + to experiment with something completely different from the known CSMA and + DAMA channel access methods. + This kind of real-time control is especially important to supply several + TNCs that are connected between each other and the PC by a daisy chain + (however, this feature is not supported yet by the Linux 6pack driver). + +- Each packet transferred over the serial line is supplied with a checksum, + so it is easy to detect errors due to problems on the serial line. + Received packets that are corrupt are not passed on to the AX.25 layer. + Damaged packets that the TNC has received from the PC are not transmitted. + +More details about 6pack are described in the file 6pack.ps that is located +in the doc directory of the AX.25 utilities package. + +2. Who has developed the 6pack protocol? + +The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech +DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and +Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet. +They have also written a firmware for TNCs to perform the 6pack +protocol (see section 4 below). + +3. Where can I get the latest version of 6pack for LinuX? + +At the moment, the 6pack stuff can obtained via anonymous ftp from +db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq, +there is a file named 6pack.tgz. + +4. Preparing the TNC for 6pack operation + +To be able to use 6pack, a special firmware for the TNC is needed. The EPROM +of a newly bought TNC does not contain 6pack, so you will have to +program an EPROM yourself. The image file for 6pack EPROMs should be +available on any packet radio box where PC/FlexNet can be found. The name of +the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet +team. It can be used under the terms of the license that comes along +with PC/FlexNet. Please do not ask me about the internals of this file as I +don't know anything about it. I used a textual description of the 6pack +protocol to program the Linux driver. + +TNCs contain a 64kByte EPROM, the lower half of which is used for +the firmware/KISS. The upper half is either empty or is sometimes +programmed with software called TAPR. In the latter case, the TNC +is supplied with a DIP switch so you can easily change between the +two systems. When programming a new EPROM, one of the systems is replaced +by 6pack. It is useful to replace TAPR, as this software is rarely used +nowadays. If your TNC is not equipped with the switch mentioned above, you +can build in one yourself that switches over the highest address pin +of the EPROM between HIGH and LOW level. After having inserted the new EPROM +and switched to 6pack, apply power to the TNC for a first test. The connect +and the status LED are lit for about a second if the firmware initialises +the TNC correctly. + +5. Building and installing the 6pack driver + +The driver has been tested with kernel version 2.1.90. Use with older +kernels may lead to a compilation error because the interface to a kernel +function has been changed in the 2.1.8x kernels. + +How to turn on 6pack support: + +- In the linux kernel configuration program, select the code maturity level + options menu and turn on the prompting for development drivers. + +- Select the amateur radio support menu and turn on the serial port 6pack + driver. + +- Compile and install the kernel and the modules. + +To use the driver, the kissattach program delivered with the AX.25 utilities +has to be modified. + +- Do a cd to the directory that holds the kissattach sources. Edit the + kissattach.c file. At the top, insert the following lines: + + #ifndef N_6PACK + #define N_6PACK (N_AX25+1) + #endif + + Then find the line + + int disc = N_AX25; + + and replace N_AX25 by N_6PACK. + +- Recompile kissattach. Rename it to spattach to avoid confusions. + +Installing the driver: + +- Do an insmod 6pack. Look at your /var/log/messages file to check if the + module has printed its initialization message. + +- Do a spattach as you would launch kissattach when starting a KISS port. + Check if the kernel prints the message '6pack: TNC found'. + +- From here, everything should work as if you were setting up a KISS port. + The only difference is that the network device that represents + the 6pack port is called sp instead of sl or ax. So, sp0 would be the + first 6pack port. + +Although the driver has been tested on various platforms, I still declare it +ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module +and spattaching. Watch out if your computer behaves strangely. Read section +6 of this file about known problems. + +Note that the connect and status LEDs of the TNC are controlled in a +different way than they are when the TNC is used with PC/FlexNet. When using +FlexNet, the connect LED is on if there is a connection; the status LED is +on if there is data in the buffer of the PC's AX.25 engine that has to be +transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer, +so the 6pack driver doesn't know anything about connects or data that +has not yet been transmitted. Therefore the LEDs are controlled +as they are in KISS mode: The connect LED is turned on if data is transferred +from the PC to the TNC over the serial line, the status LED if data is +sent to the PC. + +6. Known problems + +When testing the driver with 2.0.3x kernels and +operating with data rates on the radio channel of 9600 Baud or higher, +the driver may, on certain systems, sometimes print the message '6pack: +bad checksum', which is due to data loss if the other station sends two +or more subsequent packets. I have been told that this is due to a problem +with the serial driver of 2.0.3x kernels. I don't know yet if the problem +still exists with 2.1.x kernels, as I have heard that the serial driver +code has been changed with 2.1.x. + +When shutting down the sp interface with ifconfig, the kernel crashes if +there is still an AX.25 connection left over which an IP connection was +running, even if that IP connection is already closed. The problem does not +occur when there is a bare AX.25 connection still running. I don't know if +this is a problem of the 6pack driver or something else in the kernel. + +The driver has been tested as a module, not yet as a kernel-builtin driver. + +The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is +connected to one serial port of the PC. This feature is not implemented +and at least at the moment I won't be able to do it because I do not have +the opportunity to build a TNC daisy-chain and test it. + +Some of the comments in the source code are inaccurate. They are left from +the SLIP/KISS driver, from which the 6pack driver has been derived. +I haven't modified or removed them yet -- sorry! The code itself needs +some cleaning and optimizing. This will be done in a later release. + +If you encounter a bug or if you have a question or suggestion concerning the +driver, feel free to mail me, using the addresses given at the beginning of +this file. + +Have fun! + +Andreas diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/LICENSE.qla3xxx new file mode 100644 index 000000000..2f2077e34 --- /dev/null +++ b/Documentation/networking/LICENSE.qla3xxx @@ -0,0 +1,46 @@ +Copyright (c) 2003-2006 QLogic Corporation +QLogic Linux Networking HBA Driver + +This program includes a device driver for Linux 2.6 that may be +distributed with QLogic hardware specific firmware binary file. +You may modify and redistribute the device driver code under the +GNU General Public License as published by the Free Software +Foundation (version 2 or a later version). + +You may redistribute the hardware specific firmware binary file +under the following terms: + + 1. Redistribution of source code (only if applicable), + must retain the above copyright notice, this list of + conditions and the following disclaimer. + + 2. Redistribution in binary form must reproduce the above + copyright notice, this list of conditions and the + following disclaimer in the documentation and/or other + materials provided with the distribution. + + 3. The name of QLogic Corporation may not be used to + endorse or promote products derived from this software + without specific prior written permission + +REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, +THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED +TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON +ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT +CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR +OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, +TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN +ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN +COMBINATION WITH THIS PROGRAM. + diff --git a/Documentation/networking/LICENSE.qlcnic b/Documentation/networking/LICENSE.qlcnic new file mode 100644 index 000000000..2ae3b6498 --- /dev/null +++ b/Documentation/networking/LICENSE.qlcnic @@ -0,0 +1,288 @@ +Copyright (c) 2009-2013 QLogic Corporation +QLogic Linux qlcnic NIC Driver + +You may modify and redistribute the device driver code under the +GNU General Public License (a copy of which is attached hereto as +Exhibit A) published by the Free Software Foundation (version 2). + + +EXHIBIT A + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. diff --git a/Documentation/networking/LICENSE.qlge b/Documentation/networking/LICENSE.qlge new file mode 100644 index 000000000..ce64e4d15 --- /dev/null +++ b/Documentation/networking/LICENSE.qlge @@ -0,0 +1,288 @@ +Copyright (c) 2003-2011 QLogic Corporation +QLogic Linux qlge NIC Driver + +You may modify and redistribute the device driver code under the +GNU General Public License (a copy of which is attached hereto as +Exhibit A) published by the Free Software Foundation (version 2). + + +EXHIBIT A + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. diff --git a/Documentation/networking/PLIP.txt b/Documentation/networking/PLIP.txt new file mode 100644 index 000000000..ad7e3f7c3 --- /dev/null +++ b/Documentation/networking/PLIP.txt @@ -0,0 +1,215 @@ +PLIP: The Parallel Line Internet Protocol Device + +Donald Becker (becker@super.org) +I.D.A. Supercomputing Research Center, Bowie MD 20715 + +At some point T. Thorn will probably contribute text, +Tommy Thorn (tthorn@daimi.aau.dk) + +PLIP Introduction +----------------- + +This document describes the parallel port packet pusher for Net/LGX. +This device interface allows a point-to-point connection between two +parallel ports to appear as a IP network interface. + +What is PLIP? +============= + +PLIP is Parallel Line IP, that is, the transportation of IP packages +over a parallel port. In the case of a PC, the obvious choice is the +printer port. PLIP is a non-standard, but [can use] uses the standard +LapLink null-printer cable [can also work in turbo mode, with a PLIP +cable]. [The protocol used to pack IP packages, is a simple one +initiated by Crynwr.] + +Advantages of PLIP +================== + +It's cheap, it's available everywhere, and it's easy. + +The PLIP cable is all that's needed to connect two Linux boxes, and it +can be built for very few bucks. + +Connecting two Linux boxes takes only a second's decision and a few +minutes' work, no need to search for a [supported] netcard. This might +even be especially important in the case of notebooks, where netcards +are not easily available. + +Not requiring a netcard also means that apart from connecting the +cables, everything else is software configuration [which in principle +could be made very easy.] + +Disadvantages of PLIP +===================== + +Doesn't work over a modem, like SLIP and PPP. Limited range, 15 m. +Can only be used to connect three (?) Linux boxes. Doesn't connect to +an existing Ethernet. Isn't standard (not even de facto standard, like +SLIP). + +Performance +=========== + +PLIP easily outperforms Ethernet cards....(ups, I was dreaming, but +it *is* getting late. EOB) + +PLIP driver details +------------------- + +The Linux PLIP driver is an implementation of the original Crynwr protocol, +that uses the parallel port subsystem of the kernel in order to properly +share parallel ports between PLIP and other services. + +IRQs and trigger timeouts +========================= + +When a parallel port used for a PLIP driver has an IRQ configured to it, the +PLIP driver is signaled whenever data is sent to it via the cable, such that +when no data is available, the driver isn't being used. + +However, on some machines it is hard, if not impossible, to configure an IRQ +to a certain parallel port, mainly because it is used by some other device. +On these machines, the PLIP driver can be used in IRQ-less mode, where +the PLIP driver would constantly poll the parallel port for data waiting, +and if such data is available, process it. This mode is less efficient than +the IRQ mode, because the driver has to check the parallel port many times +per second, even when no data at all is sent. Some rough measurements +indicate that there isn't a noticeable performance drop when using IRQ-less +mode as compared to IRQ mode as far as the data transfer speed is involved. +There is a performance drop on the machine hosting the driver. + +When the PLIP driver is used in IRQ mode, the timeout used for triggering a +data transfer (the maximal time the PLIP driver would allow the other side +before announcing a timeout, when trying to handshake a transfer of some +data) is, by default, 500usec. As IRQ delivery is more or less immediate, +this timeout is quite sufficient. + +When in IRQ-less mode, the PLIP driver polls the parallel port HZ times +per second (where HZ is typically 100 on most platforms, and 1024 on an +Alpha, as of this writing). Between two such polls, there are 10^6/HZ usecs. +On an i386, for example, 10^6/100 = 10000usec. It is easy to see that it is +quite possible for the trigger timeout to expire between two such polls, as +the timeout is only 500usec long. As a result, it is required to change the +trigger timeout on the *other* side of a PLIP connection, to about +10^6/HZ usecs. If both sides of a PLIP connection are used in IRQ-less mode, +this timeout is required on both sides. + +It appears that in practice, the trigger timeout can be shorter than in the +above calculation. It isn't an important issue, unless the wire is faulty, +in which case a long timeout would stall the machine when, for whatever +reason, bits are dropped. + +A utility that can perform this change in Linux is plipconfig, which is part +of the net-tools package (its location can be found in the +Documentation/Changes file). An example command would be +'plipconfig plipX trigger 10000', where plipX is the appropriate +PLIP device. + +PLIP hardware interconnection +----------------------------- + +PLIP uses several different data transfer methods. The first (and the +only one implemented in the early version of the code) uses a standard +printer "null" cable to transfer data four bits at a time using +data bit outputs connected to status bit inputs. + +The second data transfer method relies on both machines having +bi-directional parallel ports, rather than output-only ``printer'' +ports. This allows byte-wide transfers and avoids reconstructing +nibbles into bytes, leading to much faster transfers. + +Parallel Transfer Mode 0 Cable +============================== + +The cable for the first transfer mode is a standard +printer "null" cable which transfers data four bits at a time using +data bit outputs of the first port (machine T) connected to the +status bit inputs of the second port (machine R). There are five +status inputs, and they are used as four data inputs and a clock (data +strobe) input, arranged so that the data input bits appear as contiguous +bits with standard status register implementation. + +A cable that implements this protocol is available commercially as a +"Null Printer" or "Turbo Laplink" cable. It can be constructed with +two DB-25 male connectors symmetrically connected as follows: + + STROBE output 1* + D0->ERROR 2 - 15 15 - 2 + D1->SLCT 3 - 13 13 - 3 + D2->PAPOUT 4 - 12 12 - 4 + D3->ACK 5 - 10 10 - 5 + D4->BUSY 6 - 11 11 - 6 + D5,D6,D7 are 7*, 8*, 9* + AUTOFD output 14* + INIT output 16* + SLCTIN 17 - 17 + extra grounds are 18*,19*,20*,21*,22*,23*,24* + GROUND 25 - 25 +* Do not connect these pins on either end + +If the cable you are using has a metallic shield it should be +connected to the metallic DB-25 shell at one end only. + +Parallel Transfer Mode 1 +======================== + +The second data transfer method relies on both machines having +bi-directional parallel ports, rather than output-only ``printer'' +ports. This allows byte-wide transfers, and avoids reconstructing +nibbles into bytes. This cable should not be used on unidirectional +``printer'' (as opposed to ``parallel'') ports or when the machine +isn't configured for PLIP, as it will result in output driver +conflicts and the (unlikely) possibility of damage. + +The cable for this transfer mode should be constructed as follows: + + STROBE->BUSY 1 - 11 + D0->D0 2 - 2 + D1->D1 3 - 3 + D2->D2 4 - 4 + D3->D3 5 - 5 + D4->D4 6 - 6 + D5->D5 7 - 7 + D6->D6 8 - 8 + D7->D7 9 - 9 + INIT -> ACK 16 - 10 + AUTOFD->PAPOUT 14 - 12 + SLCT->SLCTIN 13 - 17 + GND->ERROR 18 - 15 + extra grounds are 19*,20*,21*,22*,23*,24* + GROUND 25 - 25 +* Do not connect these pins on either end + +Once again, if the cable you are using has a metallic shield it should +be connected to the metallic DB-25 shell at one end only. + +PLIP Mode 0 transfer protocol +============================= + +The PLIP driver is compatible with the "Crynwr" parallel port transfer +standard in Mode 0. That standard specifies the following protocol: + + send header nibble '0x8' + count-low octet + count-high octet + ... data octets + checksum octet + +Each octet is sent as + <wait for rx. '0x1?'> <send 0x10+(octet&0x0F)> + <wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)> + +To start a transfer the transmitting machine outputs a nibble 0x08. +That raises the ACK line, triggering an interrupt in the receiving +machine. The receiving machine disables interrupts and raises its own ACK +line. + +Restated: + +(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise) +Send_Byte: + OUT := low nibble, OUT.4 := 1 + WAIT FOR IN.4 = 1 + OUT := high nibble, OUT.4 := 0 + WAIT FOR IN.4 = 0 diff --git a/Documentation/networking/README.ipw2100 b/Documentation/networking/README.ipw2100 new file mode 100644 index 000000000..6f85e1d06 --- /dev/null +++ b/Documentation/networking/README.ipw2100 @@ -0,0 +1,293 @@ + +Intel(R) PRO/Wireless 2100 Driver for Linux in support of: + +Intel(R) PRO/Wireless 2100 Network Connection + +Copyright (C) 2003-2006, Intel Corporation + +README.ipw2100 + +Version: git-1.1.5 +Date : January 25, 2006 + +Index +----------------------------------------------- +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +1. Introduction +2. Release git-1.1.5 Current Features +3. Command Line Parameters +4. Sysfs Helper Files +5. Radio Kill Switch +6. Dynamic Firmware +7. Power Management +8. Support +9. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +----------------------------------------------- + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://www.intel.com/support/wireless/sb/CS-006408.htm + +1. Introduction +----------------------------------------------- + +This document provides a brief overview of the features supported by the +IPW2100 driver project. The main project website, where the latest +development version of the driver can be found, is: + + http://ipw2100.sourceforge.net + +There you can find the not only the latest releases, but also information about +potential fixes and patches, as well as links to the development mailing list +for the driver project. + + +2. Release git-1.1.5 Current Supported Features +----------------------------------------------- +- Managed (BSS) and Ad-Hoc (IBSS) +- WEP (shared key and open) +- Wireless Tools support +- 802.1x (tested with XSupplicant 1.0.1) + +Enabled (but not supported) features: +- Monitor/RFMon mode +- WPA/WPA2 + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. + + +3. Command Line Parameters +----------------------------------------------- + +If the driver is built as a module, the following optional parameters are used +by entering them on the command line with the modprobe command using this +syntax: + + modprobe ipw2100 [<option>=<VAL1><,VAL2>...] + +For example, to disable the radio on driver loading, enter: + + modprobe ipw2100 disable=1 + +The ipw2100 driver supports the following module parameters: + +Name Value Example: +debug 0x0-0xffffffff debug=1024 +mode 0,1,2 mode=1 /* AdHoc */ +channel int channel=3 /* Only valid in AdHoc or Monitor */ +associate boolean associate=0 /* Do NOT auto associate */ +disable boolean disable=1 /* Do not power the HW */ + + +4. Sysfs Helper Files +--------------------------- +----------------------------------------------- + +There are several ways to control the behavior of the driver. Many of the +general capabilities are exposed through the Wireless Tools (iwconfig). There +are a few capabilities that are exposed through entries in the Linux Sysfs. + + +----- Driver Level ------ +For the driver level files, look in /sys/bus/pci/drivers/ipw2100/ + + debug_level + + This controls the same global as the 'debug' module parameter. For + information on the various debugging levels available, run the 'dvals' + script found in the driver source directory. + + NOTE: 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn + on. + +----- Device Level ------ +For the device level files look in + + /sys/bus/pci/drivers/ipw2100/{PCI-ID}/ + +For example: + /sys/bus/pci/drivers/ipw2100/0000:02:01.0 + +For the device level files, see /sys/bus/pci/drivers/ipw2100: + + rf_kill + read - + 0 = RF kill not enabled (radio on) + 1 = SW based RF kill active (radio off) + 2 = HW based RF kill active (radio off) + 3 = Both HW and SW RF kill active (radio off) + write - + 0 = If SW based RF kill active, turn the radio back on + 1 = If radio is on, activate SW based RF kill + + NOTE: If you enable the SW based RF kill and then toggle the HW + based RF kill from ON -> OFF -> ON, the radio will NOT come back on + + +5. Radio Kill Switch +----------------------------------------------- +Most laptops provide the ability for the user to physically disable the radio. +Some vendors have implemented this as a physical switch that requires no +software to turn the radio off and on. On other laptops, however, the switch +is controlled through a button being pressed and a software driver then making +calls to turn the radio off and on. This is referred to as a "software based +RF kill switch" + +See the Sysfs helper file 'rf_kill' for determining the state of the RF switch +on your system. + + +6. Dynamic Firmware +----------------------------------------------- +As the firmware is licensed under a restricted use license, it can not be +included within the kernel sources. To enable the IPW2100 you will need a +firmware image to load into the wireless NIC's processors. + +You can obtain these images from <http://ipw2100.sf.net/firmware.php>. + +See INSTALL for instructions on installing the firmware. + + +7. Power Management +----------------------------------------------- +The IPW2100 supports the configuration of the Power Save Protocol +through a private wireless extension interface. The IPW2100 supports +the following different modes: + + off No power management. Radio is always on. + on Automatic power management + 1-5 Different levels of power management. The higher the + number the greater the power savings, but with an impact to + packet latencies. + +Power management works by powering down the radio after a certain +interval of time has passed where no packets are passed through the +radio. Once powered down, the radio remains in that state for a given +period of time. For higher power savings, the interval between last +packet processed to sleep is shorter and the sleep period is longer. + +When the radio is asleep, the access point sending data to the station +must buffer packets at the AP until the station wakes up and requests +any buffered packets. If you have an AP that does not correctly support +the PSP protocol you may experience packet loss or very poor performance +while power management is enabled. If this is the case, you will need +to try and find a firmware update for your AP, or disable power +management (via `iwconfig eth1 power off`) + +To configure the power level on the IPW2100 you use a combination of +iwconfig and iwpriv. iwconfig is used to turn power management on, off, +and set it to auto. + + iwconfig eth1 power off Disables radio power down + iwconfig eth1 power on Enables radio power management to + last set level (defaults to AUTO) + iwpriv eth1 set_power 0 Sets power level to AUTO and enables + power management if not previously + enabled. + iwpriv eth1 set_power 1-5 Set the power level as specified, + enabling power management if not + previously enabled. + +You can view the current power level setting via: + + iwpriv eth1 get_power + +It will return the current period or timeout that is configured as a string +in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of +time after packet processing), yyyy is the period to sleep (amount of time to +wait before powering the radio and querying the access point for buffered +packets), and z is the 'power level'. If power management is turned off the +xxxx/yyyy will be replaced with 'off' -- the level reported will be the active +level if `iwconfig eth1 power on` is invoked. + + +8. Support +----------------------------------------------- + +For general development information and support, +go to: + + http://ipw2100.sf.net/ + +The ipw2100 1.1.0 driver and firmware can be downloaded from: + + http://support.intel.com + +For installation support on the ipw2100 1.1.0 driver on Linux kernels +2.6.8 or greater, email support is available from: + + http://supportmail.intel.com + +9. License +----------------------------------------------- + + Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License (version 2) as + published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. + + License Contact Information: + James P. Ketrenos <ipw2100-admin@linux.intel.com> + Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497 + diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/README.ipw2200 new file mode 100644 index 000000000..b7658bed4 --- /dev/null +++ b/Documentation/networking/README.ipw2200 @@ -0,0 +1,472 @@ + +Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of: + +Intel(R) PRO/Wireless 2200BG Network Connection +Intel(R) PRO/Wireless 2915ABG Network Connection + +Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R) +PRO/Wireless 2200BG Driver for Linux is a unified driver that works on +both hardware adapters listed above. In this document the Intel(R) +PRO/Wireless 2915ABG Driver for Linux will be used to reference the +unified driver. + +Copyright (C) 2004-2006, Intel Corporation + +README.ipw2200 + +Version: 1.1.2 +Date : March 30, 2006 + + +Index +----------------------------------------------- +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +1. Introduction +1.1. Overview of features +1.2. Module parameters +1.3. Wireless Extension Private Methods +1.4. Sysfs Helper Files +1.5. Supported channels +2. Ad-Hoc Networking +3. Interacting with Wireless Tools +3.1. iwconfig mode +3.2. iwconfig sens +4. About the Version Numbers +5. Firmware installation +6. Support +7. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +----------------------------------------------- + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://support.intel.com + + +1. Introduction +----------------------------------------------- +The following sections attempt to provide a brief introduction to using +the Intel(R) PRO/Wireless 2915ABG Driver for Linux. + +This document is not meant to be a comprehensive manual on +understanding or using wireless technologies, but should be sufficient +to get you moving without wires on Linux. + +For information on building and installing the driver, see the INSTALL +file. + + +1.1. Overview of Features +----------------------------------------------- +The current release (1.1.2) supports the following features: + ++ BSS mode (Infrastructure, Managed) ++ IBSS mode (Ad-Hoc) ++ WEP (OPEN and SHARED KEY mode) ++ 802.1x EAP via wpa_supplicant and xsupplicant ++ Wireless Extension support ++ Full B and G rate support (2200 and 2915) ++ Full A rate support (2915 only) ++ Transmit power control ++ S state support (ACPI suspend/resume) + +The following features are currently enabled, but not officially +supported: + ++ WPA ++ long/short preamble support ++ Monitor mode (aka RFMon) + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. + + + +1.2. Command Line Parameters +----------------------------------------------- + +Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless +2915ABG Driver for Linux allows configuration options to be provided +as module parameters. The most common way to specify a module parameter +is via the command line. + +The general form is: + +% modprobe ipw2200 parameter=value + +Where the supported parameter are: + + associate + Set to 0 to disable the auto scan-and-associate functionality of the + driver. If disabled, the driver will not attempt to scan + for and associate to a network until it has been configured with + one or more properties for the target network, for example configuring + the network SSID. Default is 0 (do not auto-associate) + + Example: % modprobe ipw2200 associate=0 + + auto_create + Set to 0 to disable the auto creation of an Ad-Hoc network + matching the channel and network name parameters provided. + Default is 1. + + channel + channel number for association. The normal method for setting + the channel would be to use the standard wireless tools + (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes + to set this while debugging. Channel 0 means 'ANY' + + debug + If using a debug build, this is used to control the amount of debug + info is logged. See the 'dvals' and 'load' script for more info on + how to use this (the dvals and load scripts are provided as part + of the ipw2200 development snapshot releases available from the + SourceForge project at http://ipw2200.sf.net) + + led + Can be used to turn on experimental LED code. + 0 = Off, 1 = On. Default is 1. + + mode + Can be used to set the default mode of the adapter. + 0 = Managed, 1 = Ad-Hoc, 2 = Monitor + + +1.3. Wireless Extension Private Methods +----------------------------------------------- + +As an interface designed to handle generic hardware, there are certain +capabilities not exposed through the normal Wireless Tool interface. As +such, a provision is provided for a driver to declare custom, or +private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux +defines several of these to configure various settings. + +The general form of using the private wireless methods is: + + % iwpriv $IFNAME method parameters + +Where $IFNAME is the interface name the device is registered with +(typically eth1, customized via one of the various network interface +name managers, such as ifrename) + +The supported private methods are: + + get_mode + Can be used to report out which IEEE mode the driver is + configured to support. Example: + + % iwpriv eth1 get_mode + eth1 get_mode:802.11bg (6) + + set_mode + Can be used to configure which IEEE mode the driver will + support. + + Usage: + % iwpriv eth1 set_mode {mode} + Where {mode} is a number in the range 1-7: + 1 802.11a (2915 only) + 2 802.11b + 3 802.11ab (2915 only) + 4 802.11g + 5 802.11ag (2915 only) + 6 802.11bg + 7 802.11abg (2915 only) + + get_preamble + Can be used to report configuration of preamble length. + + set_preamble + Can be used to set the configuration of preamble length: + + Usage: + % iwpriv eth1 set_preamble {mode} + Where {mode} is one of: + 1 Long preamble only + 0 Auto (long or short based on connection) + + +1.4. Sysfs Helper Files: +----------------------------------------------- + +The Linux kernel provides a pseudo file system that can be used to +access various components of the operating system. The Intel(R) +PRO/Wireless 2915ABG Driver for Linux exposes several configuration +parameters through this mechanism. + +An entry in the sysfs can support reading and/or writing. You can +typically query the contents of a sysfs entry through the use of cat, +and can set the contents via echo. For example: + +% cat /sys/bus/pci/drivers/ipw2200/debug_level + +Will report the current debug level of the driver's logging subsystem +(only available if CONFIG_IPW2200_DEBUG was configured when the driver +was built). + +You can set the debug level via: + +% echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level + +Where $VALUE would be a number in the case of this sysfs entry. The +input to sysfs files does not have to be a number. For example, the +firmware loader used by hotplug utilizes sysfs entries for transferring +the firmware image from user space into the driver. + +The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries +at two levels -- driver level, which apply to all instances of the driver +(in the event that there are more than one device installed) and device +level, which applies only to the single specific instance. + + +1.4.1 Driver Level Sysfs Helper Files +----------------------------------------------- + +For the driver level files, look in /sys/bus/pci/drivers/ipw2200/ + + debug_level + + This controls the same global as the 'debug' module parameter + + + +1.4.2 Device Level Sysfs Helper Files +----------------------------------------------- + +For the device level files, look in + + /sys/bus/pci/drivers/ipw2200/{PCI-ID}/ + +For example: + /sys/bus/pci/drivers/ipw2200/0000:02:01.0 + +For the device level files, see /sys/bus/pci/drivers/ipw2200: + + rf_kill + read - + 0 = RF kill not enabled (radio on) + 1 = SW based RF kill active (radio off) + 2 = HW based RF kill active (radio off) + 3 = Both HW and SW RF kill active (radio off) + write - + 0 = If SW based RF kill active, turn the radio back on + 1 = If radio is on, activate SW based RF kill + + NOTE: If you enable the SW based RF kill and then toggle the HW + based RF kill from ON -> OFF -> ON, the radio will NOT come back on + + ucode + read-only access to the ucode version number + + led + read - + 0 = LED code disabled + 1 = LED code enabled + write - + 0 = Disable LED code + 1 = Enable LED code + + NOTE: The LED code has been reported to hang some systems when + running ifconfig and is therefore disabled by default. + + +1.5. Supported channels +----------------------------------------------- + +Upon loading the Intel(R) PRO/Wireless 2915ABG Driver for Linux, a +message stating the detected geography code and the number of 802.11 +channels supported by the card will be displayed in the log. + +The geography code corresponds to a regulatory domain as shown in the +table below. + + Supported channels +Code Geography 802.11bg 802.11a + +--- Restricted 11 0 +ZZF Custom US/Canada 11 8 +ZZD Rest of World 13 0 +ZZA Custom USA & Europe & High 11 13 +ZZB Custom NA & Europe 11 13 +ZZC Custom Japan 11 4 +ZZM Custom 11 0 +ZZE Europe 13 19 +ZZJ Custom Japan 14 4 +ZZR Rest of World 14 0 +ZZH High Band 13 4 +ZZG Custom Europe 13 4 +ZZK Europe 13 24 +ZZL Europe 11 13 + + +2. Ad-Hoc Networking +----------------------------------------------- + +When using a device in an Ad-Hoc network, it is useful to understand the +sequence and requirements for the driver to be able to create, join, or +merge networks. + +The following attempts to provide enough information so that you can +have a consistent experience while using the driver as a member of an +Ad-Hoc network. + +2.1. Joining an Ad-Hoc Network +----------------------------------------------- + +The easiest way to get onto an Ad-Hoc network is to join one that +already exists. + +2.2. Creating an Ad-Hoc Network +----------------------------------------------- + +An Ad-Hoc networks is created using the syntax of the Wireless tool. + +For Example: +iwconfig eth1 mode ad-hoc essid testing channel 2 + +2.3. Merging Ad-Hoc Networks +----------------------------------------------- + + +3. Interaction with Wireless Tools +----------------------------------------------- + +3.1 iwconfig mode +----------------------------------------------- + +When configuring the mode of the adapter, all run-time configured parameters +are reset to the value used when the module was loaded. This includes +channels, rates, ESSID, etc. + +3.2 iwconfig sens +----------------------------------------------- + +The 'iwconfig ethX sens XX' command will not set the signal sensitivity +threshold, as described in iwconfig documentation, but rather the number +of consecutive missed beacons that will trigger handover, i.e. roaming +to another access point. At the same time, it will set the disassociation +threshold to 3 times the given value. + + +4. About the Version Numbers +----------------------------------------------- + +Due to the nature of open source development projects, there are +frequently changes being incorporated that have not gone through +a complete validation process. These changes are incorporated into +development snapshot releases. + +Releases are numbered with a three level scheme: + + major.minor.development + +Any version where the 'development' portion is 0 (for example +1.0.0, 1.1.0, etc.) indicates a stable version that will be made +available for kernel inclusion. + +Any version where the 'development' portion is not a 0 (for +example 1.0.1, 1.1.5, etc.) indicates a development version that is +being made available for testing and cutting edge users. The stability +and functionality of the development releases are not know. We make +efforts to try and keep all snapshots reasonably stable, but due to the +frequency of their release, and the desire to get those releases +available as quickly as possible, unknown anomalies should be expected. + +The major version number will be incremented when significant changes +are made to the driver. Currently, there are no major changes planned. + +5. Firmware installation +---------------------------------------------- + +The driver requires a firmware image, download it and extract the +files under /lib/firmware (or wherever your hotplug's firmware.agent +will look for firmware files) + +The firmware can be downloaded from the following URL: + + http://ipw2200.sf.net/ + + +6. Support +----------------------------------------------- + +For direct support of the 1.0.0 version, you can contact +http://supportmail.intel.com, or you can use the open source project +support. + +For general information and support, go to: + + http://ipw2200.sf.net/ + + +7. License +----------------------------------------------- + + Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License version 2 as + published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. + + Contact Information: + James P. Ketrenos <ipw2100-admin@linux.intel.com> + Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497 + diff --git a/Documentation/networking/README.sb1000 b/Documentation/networking/README.sb1000 new file mode 100644 index 000000000..f92c2aac5 --- /dev/null +++ b/Documentation/networking/README.sb1000 @@ -0,0 +1,207 @@ +sb1000 is a module network device driver for the General Instrument (also known +as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card +which is used by a number of cable TV companies to provide cable modem access. +It's a one-way downstream-only cable modem, meaning that your upstream net link +is provided by your regular phone modem. + +This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves +a great deal of thanks for this wonderful piece of code! + +----------------------------------------------------------------------------- + +Support for this device is now a part of the standard Linux kernel. The +driver source code file is drivers/net/sb1000.c. In addition to this +you will need: + +1.) The "cmconfig" program. This is a utility which supplements "ifconfig" +to configure the cable modem and network interface (usually called "cm0"); +and + +2.) Several PPP scripts which live in /etc/ppp to make connecting via your +cable modem easy. + + These utilities can be obtained from: + + http://www.jacksonville.net/~fventuri/ + + in Franco's original source code distribution .tar.gz file. Support for + the sb1000 driver can be found at: + + http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html + http://web.archive.org/web/*/http://linuxpower.cx/~cable/ + + along with these utilities. + +3.) The standard isapnp tools. These are necessary to configure your SB1000 +card at boot time (or afterwards by hand) since it's a PnP card. + + If you don't have these installed as a standard part of your Linux + distribution, you can find them at: + + http://www.roestock.demon.co.uk/isapnptools/ + + or check your Linux distribution binary CD or their web site. For help with + isapnp, pnpdump, or /etc/isapnp.conf, go to: + + http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html + +----------------------------------------------------------------------------- + +To make the SB1000 card work, follow these steps: + +1.) Run `make config', or `make menuconfig', or `make xconfig', whichever +you prefer, in the top kernel tree directory to set up your kernel +configuration. Make sure to say "Y" to "Prompt for development drivers" +and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard +networking questions to get TCP/IP and PPP networking support. + +2.) *BEFORE* you build the kernel, edit drivers/net/sb1000.c. Make sure +to redefine the value of READ_DATA_PORT to match the I/O address used +by isapnp to access your PnP cards. This is the value of READPORT in +/etc/isapnp.conf or given by the output of pnpdump. + +3.) Build and install the kernel and modules as usual. + +4.) Boot your new kernel following the usual procedures. + +5.) Set up to configure the new SB1000 PnP card by capturing the output +of "pnpdump" to a file and editing this file to set the correct I/O ports, +IRQ, and DMA settings for all your PnP cards. Make sure none of the settings +conflict with one another. Then test this configuration by running the +"isapnp" command with your new config file as the input. Check for +errors and fix as necessary. (As an aside, I use I/O ports 0x110 and +0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.) +Then save the finished config file as /etc/isapnp.conf for proper configuration +on subsequent reboots. + +6.) Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of +the others referenced above. As root, unpack it into a temporary directory and +do a `make cmconfig' and then `install -c cmconfig /usr/local/sbin'. Don't do +`make install' because it expects to find all the utilities built and ready for +installation, not just cmconfig. + +7.) As root, copy all the files under the ppp/ subdirectory in Franco's +tar file into /etc/ppp, being careful not to overwrite any files that are +already in there. Then modify ppp@gi-on to set the correct login name, +phone number, and frequency for the cable modem. Also edit pap-secrets +to specify your login name and password and any site-specific information +you need. + +8.) Be sure to modify /etc/ppp/firewall to use ipchains instead of +the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to +convert ipfwadm commands to ipchains commands: + + http://users.dhp.com/~whisper/ipfwadm2ipchains/ + +You may also wish to modify the firewall script to implement a different +firewalling scheme. + +9.) Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be +root to do this. It's better to use a utility like sudo to execute +frequently used commands like this with root permissions if possible. If you +connect successfully the cable modem interface will come up and you'll see a +driver message like this at the console: + + cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11. + sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net) + +The "ifconfig" command should show two new interfaces, ppp0 and cm0. +The command "cmconfig cm0" will give you information about the cable modem +interface. + +10.) Try pinging a site via `ping -c 5 www.yahoo.com', for example. You should +see packets received. + +11.) If you can't get site names (like www.yahoo.com) to resolve into +IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file +has no syntax errors and has the right nameserver IP addresses in it. +If this doesn't help, try something like `ping -c 5 204.71.200.67' to +see if the networking is running but the DNS resolution is where the +problem lies. + +12.) If you still have problems, go to the support web sites mentioned above +and read the information and documentation there. + +----------------------------------------------------------------------------- + +Common problems: + +1.) Packets go out on the ppp0 interface but don't come back on the cm0 +interface. It looks like I'm connected but I can't even ping any +numerical IP addresses. (This happens predominantly on Debian systems due +to a default boot-time configuration script.) + +Solution -- As root `echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter' so it +can share the same IP address as the ppp0 interface. Note that this +command should probably be added to the /etc/ppp/cablemodem script +*right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands. +You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well. +If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot +(in rc.local or some such) then any interfaces can share the same IP +addresses. + +2.) I get "unresolved symbol" error messages on executing `insmod sb1000.o'. + +Solution -- You probably have a non-matching kernel source tree and +/usr/include/linux and /usr/include/asm header files. Make sure you +install the correct versions of the header files in these two directories. +Then rebuild and reinstall the kernel. + +3.) When isapnp runs it reports an error, and my SB1000 card isn't working. + +Solution -- There's a problem with later versions of isapnp using the "(CHECK)" +option in the lines that allocate the two I/O addresses for the SB1000 card. +This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses. +Make sure they don't conflict with any other pieces of hardware first! Then +rerun isapnp and go from there. + +4.) I can't execute the /etc/ppp/ppp@gi-on file. + +Solution -- As root do `chmod ug+x /etc/ppp/ppp@gi-on'. + +5.) The firewall script isn't working (with 2.2.x and higher kernels). + +Solution -- Use the ipfwadm2ipchains script referenced above to convert the +/etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains. + +6.) I'm getting *tons* of firewall deny messages in the /var/kern.log, +/var/messages, and/or /var/syslog files, and they're filling up my /var +partition!!! + +Solution -- First, tell your ISP that you're receiving DoS (Denial of Service) +and/or portscanning (UDP connection attempts) attacks! Look over the deny +messages to figure out what the attack is and where it's coming from. Next, +edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on +to the "cmconfig" command (uncomment that line). If you're not receiving these +denied packets on your broadcast interface (IP address xxx.yyy.zzz.255 +typically), then someone is attacking your machine in particular. Be careful +out there.... + +7.) Everything seems to work fine but my computer locks up after a while +(and typically during a lengthy download through the cable modem)! + +Solution -- You may need to add a short delay in the driver to 'slow down' the +SURFboard because your PC might not be able to keep up with the transfer rate +of the SB1000. To do this, it's probably best to download Franco's +sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll +want to edit the 'Makefile' and look for the 'SB1000_DELAY' +define. Uncomment those 'CFLAGS' lines (and comment out the default ones) +and try setting the delay to something like 60 microseconds with: +'-DSB1000_DELAY=60'. Then do `make' and as root `make install' and try +it out. If it still doesn't work or you like playing with the driver, you may +try other numbers. Remember though that the higher the delay, the slower the +driver (which slows down the rest of the PC too when it is actively +used). Thanks to Ed Daiga for this tip! + +----------------------------------------------------------------------------- + +Credits: This README came from Franco Venturi's original README file which is +still supplied with his driver .tar.gz archive. I and all other sb1000 users +owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten +and Ralph Bonnell who are now managing the Linux SB1000 web site, and to +the SB1000 users who reported and helped debug the common problems listed +above. + + + Clemmitt Sigler + csigler@vt.edu diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst new file mode 100644 index 000000000..ff929cfab --- /dev/null +++ b/Documentation/networking/af_xdp.rst @@ -0,0 +1,312 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +AF_XDP +====== + +Overview +======== + +AF_XDP is an address family that is optimized for high performance +packet processing. + +This document assumes that the reader is familiar with BPF and XDP. If +not, the Cilium project has an excellent reference guide at +http://cilium.readthedocs.io/en/latest/bpf/. + +Using the XDP_REDIRECT action from an XDP program, the program can +redirect ingress frames to other XDP enabled netdevs, using the +bpf_redirect_map() function. AF_XDP sockets enable the possibility for +XDP programs to redirect frames to a memory buffer in a user-space +application. + +An AF_XDP socket (XSK) is created with the normal socket() +syscall. Associated with each XSK are two rings: the RX ring and the +TX ring. A socket can receive packets on the RX ring and it can send +packets on the TX ring. These rings are registered and sized with the +setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory +to have at least one of these rings for each socket. An RX or TX +descriptor ring points to a data buffer in a memory area called a +UMEM. RX and TX can share the same UMEM so that a packet does not have +to be copied between RX and TX. Moreover, if a packet needs to be kept +for a while due to a possible retransmit, the descriptor that points +to that packet can be changed to point to another and reused right +away. This again avoids copying data. + +The UMEM consists of a number of equally sized chunks. A descriptor in +one of the rings references a frame by referencing its addr. The addr +is simply an offset within the entire UMEM region. The user space +allocates memory for this UMEM using whatever means it feels is most +appropriate (malloc, mmap, huge pages, etc). This memory area is then +registered with the kernel using the new setsockopt XDP_UMEM_REG. The +UMEM also has two rings: the FILL ring and the COMPLETION ring. The +fill ring is used by the application to send down addr for the kernel +to fill in with RX packet data. References to these frames will then +appear in the RX ring once each packet has been received. The +completion ring, on the other hand, contains frame addr that the +kernel has transmitted completely and can now be used again by user +space, for either TX or RX. Thus, the frame addrs appearing in the +completion ring are addrs that were previously transmitted using the +TX ring. In summary, the RX and FILL rings are used for the RX path +and the TX and COMPLETION rings are used for the TX path. + +The socket is then finally bound with a bind() call to a device and a +specific queue id on that device, and it is not until bind is +completed that traffic starts to flow. + +The UMEM can be shared between processes, if desired. If a process +wants to do this, it simply skips the registration of the UMEM and its +corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind +call and submits the XSK of the process it would like to share UMEM +with as well as its own newly created XSK socket. The new process will +then receive frame addr references in its own RX ring that point to +this shared UMEM. Note that since the ring structures are +single-consumer / single-producer (for performance reasons), the new +process has to create its own socket with associated RX and TX rings, +since it cannot share this with the other process. This is also the +reason that there is only one set of FILL and COMPLETION rings per +UMEM. It is the responsibility of a single process to handle the UMEM. + +How is then packets distributed from an XDP program to the XSKs? There +is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The +user-space application can place an XSK at an arbitrary place in this +map. The XDP program can then redirect a packet to a specific index in +this map and at this point XDP validates that the XSK in that map was +indeed bound to that device and ring number. If not, the packet is +dropped. If the map is empty at that index, the packet is also +dropped. This also means that it is currently mandatory to have an XDP +program loaded (and one XSK in the XSKMAP) to be able to get any +traffic to user space through the XSK. + +AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the +driver does not have support for XDP, or XDP_SKB is explicitly chosen +when loading the XDP program, XDP_SKB mode is employed that uses SKBs +together with the generic XDP support and copies out the data to user +space. A fallback mode that works for any network device. On the other +hand, if the driver has support for XDP, it will be used by the AF_XDP +code to provide better performance, but there is still a copy of the +data into user space. + +Concepts +======== + +In order to use an AF_XDP socket, a number of associated objects need +to be setup. + +Jonathan Corbet has also written an excellent article on LWN, +"Accelerating networking with AF_XDP". It can be found at +https://lwn.net/Articles/750845/. + +UMEM +---- + +UMEM is a region of virtual contiguous memory, divided into +equal-sized frames. An UMEM is associated to a netdev and a specific +queue id of that netdev. It is created and configured (chunk size, +headroom, start address and size) by using the XDP_UMEM_REG setsockopt +system call. A UMEM is bound to a netdev and queue id, via the bind() +system call. + +An AF_XDP is socket linked to a single UMEM, but one UMEM can have +multiple AF_XDP sockets. To share an UMEM created via one socket A, +the next socket B can do this by setting the XDP_SHARED_UMEM flag in +struct sockaddr_xdp member sxdp_flags, and passing the file descriptor +of A to struct sockaddr_xdp member sxdp_shared_umem_fd. + +The UMEM has two single-producer/single-consumer rings, that are used +to transfer ownership of UMEM frames between the kernel and the +user-space application. + +Rings +----- + +There are a four different kind of rings: Fill, Completion, RX and +TX. All rings are single-producer/single-consumer, so the user-space +application need explicit synchronization of multiple +processes/threads are reading/writing to them. + +The UMEM uses two rings: Fill and Completion. Each socket associated +with the UMEM must have an RX queue, TX queue or both. Say, that there +is a setup with four sockets (all doing TX and RX). Then there will be +one Fill ring, one Completion ring, four TX rings and four RX rings. + +The rings are head(producer)/tail(consumer) based rings. A producer +writes the data ring at the index pointed out by struct xdp_ring +producer member, and increasing the producer index. A consumer reads +the data ring at the index pointed out by struct xdp_ring consumer +member, and increasing the consumer index. + +The rings are configured and created via the _RING setsockopt system +calls and mmapped to user-space using the appropriate offset to mmap() +(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and +XDP_UMEM_PGOFF_COMPLETION_RING). + +The size of the rings need to be of size power of two. + +UMEM Fill Ring +~~~~~~~~~~~~~~ + +The Fill ring is used to transfer ownership of UMEM frames from +user-space to kernel-space. The UMEM addrs are passed in the ring. As +an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has +16 chunks and can pass addrs between 0 and 64k. + +Frames passed to the kernel are used for the ingress path (RX rings). + +The user application produces UMEM addrs to this ring. Note that the +kernel will mask the incoming addr. E.g. for a chunk size of 2k, the +log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 +and 3000 refers to the same chunk. + + +UMEM Completetion Ring +~~~~~~~~~~~~~~~~~~~~~~ + +The Completion Ring is used transfer ownership of UMEM frames from +kernel-space to user-space. Just like the Fill ring, UMEM indicies are +used. + +Frames passed from the kernel to user-space are frames that has been +sent (TX ring) and can be used by user-space again. + +The user application consumes UMEM addrs from this ring. + + +RX Ring +~~~~~~~ + +The RX ring is the receiving side of a socket. Each entry in the ring +is a struct xdp_desc descriptor. The descriptor contains UMEM offset +(addr) and the length of the data (len). + +If no frames have been passed to kernel via the Fill ring, no +descriptors will (or can) appear on the RX ring. + +The user application consumes struct xdp_desc descriptors from this +ring. + +TX Ring +~~~~~~~ + +The TX ring is used to send frames. The struct xdp_desc descriptor is +filled (index, length and offset) and passed into the ring. + +To start the transfer a sendmsg() system call is required. This might +be relaxed in the future. + +The user application produces struct xdp_desc descriptors to this +ring. + +XSKMAP / BPF_MAP_TYPE_XSKMAP +---------------------------- + +On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that +is used in conjunction with bpf_redirect_map() to pass the ingress +frame to a socket. + +The user application inserts the socket into the map, via the bpf() +system call. + +Note that if an XDP program tries to redirect to a socket that does +not match the queue configuration and netdev, the frame will be +dropped. E.g. an AF_XDP socket is bound to netdev eth0 and +queue 17. Only the XDP program executing for eth0 and queue 17 will +successfully pass data to the socket. Please refer to the sample +application (samples/bpf/) in for an example. + +Usage +===== + +In order to use AF_XDP sockets there are two parts needed. The +user-space application and the XDP program. For a complete setup and +usage example, please refer to the sample application. The user-space +side is xdpsock_user.c and the XDP side xdpsock_kern.c. + +Naive ring dequeue and enqueue could look like this:: + + // struct xdp_rxtx_ring { + // __u32 *producer; + // __u32 *consumer; + // struct xdp_desc *desc; + // }; + + // struct xdp_umem_ring { + // __u32 *producer; + // __u32 *consumer; + // __u64 *desc; + // }; + + // typedef struct xdp_rxtx_ring RING; + // typedef struct xdp_umem_ring RING; + + // typedef struct xdp_desc RING_TYPE; + // typedef __u64 RING_TYPE; + + int dequeue_one(RING *ring, RING_TYPE *item) + { + __u32 entries = *ring->producer - *ring->consumer; + + if (entries == 0) + return -1; + + // read-barrier! + + *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; + (*ring->consumer)++; + return 0; + } + + int enqueue_one(RING *ring, const RING_TYPE *item) + { + u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); + + if (free_entries == 0) + return -1; + + ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; + + // write-barrier! + + (*ring->producer)++; + return 0; + } + + +For a more optimized version, please refer to the sample application. + +Sample application +================== + +There is a xdpsock benchmarking/test application included that +demonstrates how to use AF_XDP sockets with both private and shared +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: + + ethtool -N p3p2 rx-flow-hash udp4 fn + ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ + action 16 + +Running the rxdrop benchmark in XDP_DRV mode can then be done +using:: + + samples/bpf/xdpsock -i p3p2 -q 16 -r -N + +For XDP_SKB mode, use the switch "-S" instead of "-N" and all options +can be displayed with "-h", as usual. + +Credits +======= + +- Björn Töpel (AF_XDP core) +- Magnus Karlsson (AF_XDP core) +- Alexander Duyck +- Alexei Starovoitov +- Daniel Borkmann +- Jesper Dangaard Brouer +- John Fastabend +- Jonathan Corbet (LWN coverage) +- Michael S. Tsirkin +- Qi Z Zhang +- Willem de Bruijn + diff --git a/Documentation/networking/alias.rst b/Documentation/networking/alias.rst new file mode 100644 index 000000000..af7c5ee92 --- /dev/null +++ b/Documentation/networking/alias.rst @@ -0,0 +1,49 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +IP-Aliasing +=========== + +IP-aliases are an obsolete way to manage multiple IP-addresses/masks +per interface. Newer tools such as iproute2 support multiple +address/prefixes per interface, but aliases are still supported +for backwards compatibility. + +An alias is formed by adding a colon and a string when running ifconfig. +This string is usually numeric, but this is not a must. + + +Alias creation +============== + +Alias creation is done by 'magic' interface naming: eg. to create a +200.1.1.1 alias for eth0 ... +:: + + # ifconfig eth0:0 200.1.1.1 etc,etc.... + ~~ -> request alias #0 creation (if not yet exists) for eth0 + +The corresponding route is also set up by this command. Please note: +The route always points to the base interface. + + +Alias deletion +============== + +The alias is removed by shutting the alias down:: + + # ifconfig eth0:0 down + ~~~~~~~~~~ -> will delete alias + + +Alias (re-)configuring +====================== + +Aliases are not real devices, but programs should be able to configure +and refer to them as usual (ifconfig, route, etc). + + +Relationship with main device +============================= + +If the base device is shut down the added aliases will be deleted too. diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt new file mode 100644 index 000000000..50b8589d1 --- /dev/null +++ b/Documentation/networking/altera_tse.txt @@ -0,0 +1,263 @@ + Altera Triple-Speed Ethernet MAC driver + +Copyright (C) 2008-2014 Altera Corporation + +This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers +using the SGDMA and MSGDMA soft DMA IP components. The driver uses the +platform bus to obtain component resources. The designs used to test this +driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, +and tested with ARM and NIOS processor hosts separately. The anticipated use +cases are simple communications between an embedded system and an external peer +for status and simple configuration of the embedded system. + +For more information visit www.altera.com and www.rocketboards.org. Support +forums for the driver may be found on www.rocketboards.org, and a design used +to test this driver may be found there as well. Support is also available from +the maintainer of this driver, found in MAINTAINERS. + +The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP +components that can be assembled and built into an FPGA using the Altera +Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that +this driver was tested against. The sopc2dts tool is used to create the +device tree for the driver, and may be found at rocketboards.org. + +The driver probe function examines the device tree and determines if the +Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The +probe function then installs the appropriate set of DMA routines to +initialize, setup transmits, receives, and interrupt handling primitives for +the respective configurations. + +The SGDMA component is to be deprecated in the near future (over the next 1-2 +years as of this writing in early 2014) in favor of the MSGDMA component. +SGDMA support is included for existing designs and reference in case a +developer wishes to support their own soft DMA logic and driver support. Any +new designs should not use the SGDMA. + +The SGDMA supports only a single transmit or receive operation at a time, and +therefore will not perform as well compared to the MSGDMA soft IP. Please +visit www.altera.com for known, documented SGDMA errata. + +Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. +Scatter-gather DMA will be added to a future maintenance update to this +driver. + +Jumbo frames are not supported at this time. + +The driver limits PHY operations to 10/100Mbps, and has not yet been fully +tested for 1Gbps. This support will be added in a future maintenance update. + +1) Kernel Configuration +The kernel configuration option is ALTERA_TSE: + Device Drivers ---> Network device support ---> Ethernet driver support ---> + Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + dma_rx_num: Number of descriptors in the RX list (default is 64); + dma_tx_num: Number of descriptors in the TX list (default is 64). + +3) Command line options +Driver parameters can be also passed in command line by using: + altera_tse=dma_rx_num:128,dma_tx_num:512 + +4) Driver information and notes + +4.1) Transmit process +When the driver's transmit routine is called by the kernel, it sets up a +transmit descriptor by calling the underlying DMA transmit routine (SGDMA or +MSGDMA), and initiates a transmit operation. Once the transmit is complete, an +interrupt is driven by the transmit DMA logic. The driver handles the transmit +completion in the context of the interrupt handling chain by recycling +resource required to send and track the requested transmit operation. + +4.2) Receive process +The driver will post receive buffers to the receive DMA logic during driver +initialization. Receive buffers may or may not be queued depending upon the +underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able +to queue receive buffers to the SGDMA receive logic). When a packet is +received, the DMA logic generates an interrupt. The driver handles a receive +interrupt by obtaining the DMA receive logic status, reaping receive +completions until no more receive completions are available. + +4.3) Interrupt Mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for receive operations. Interrupt mitigation is not yet supported +for transmit operations, but will be added in a future maintenance release. + +4.4) Ethtool support +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.5) PHY Support +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.7) List of source files: + o Kconfig + o Makefile + o altera_tse_main.c: main network device driver + o altera_tse_ethtool.c: ethtool support + o altera_tse.h: private driver structure and common definitions + o altera_msgdma.h: MSGDMA implementation function definitions + o altera_sgdma.h: SGDMA implementation function definitions + o altera_msgdma.c: MSGDMA implementation + o altera_sgdma.c: SGDMA implementation + o altera_sgdmahw.h: SGDMA register and descriptor definitions + o altera_msgdmahw.h: MSGDMA register and descriptor definitions + o altera_utils.c: Driver utility functions + o altera_utils.h: Driver utility function definitions + +5) Debug Information + +The driver exports debug information such as internal statistics, +debug information, MAC and DMA registers etc. + +A user may use the ethtool support to get statistics: +e.g. using: ethtool -S ethX (that shows the statistics counters) +or sees the MAC registers: e.g. using: ethtool -d ethX + +The developer can also use the "debug" module parameter to get +further debug information. + +6) Statistics Support + +The controller and driver support a mix of IEEE standard defined statistics, +RFC defined statistics, and driver or Altera defined statistics. The four +specifications containing the standard definitions for these statistics are +as follows: + + o IEEE 802.3-2012 - IEEE Standard for Ethernet. + o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. + o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. + o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com + +The statistics supported by the TSE and the device driver are as follows: + +"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.2. This statistics is the count of frames that are successfully +transmitted. + +"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.5. This statistic is the count of frames that are successfully +received. This count does not include any error packets such as CRC errors, +length errors, or alignment errors. + +"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE +802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are +an integral number of bytes in length and do not pass the CRC test as the frame +is received. + +"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, +Section 5.2.2.1.7. This statistic is the count of frames that are not an +integral number of bytes in length and do not pass the CRC test as the frame is +received. + +"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.8. This statistic is the count of data and pad bytes +successfully transmitted from the interface. + +"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.14. This statistic is the count of data and pad bytes +successfully received by the controller. + +"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE +802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames +transmitted from the network controller. + +"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE +802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames +received by the network controller. + +"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is +a count of the number of packets received containing errors that prevented the +packet from being delivered to a higher level protocol. + +"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic +is a count of the number of packets that could not be transmitted due to errors. + +"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were not addressed +to the broadcast address or a multicast group. + +"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +a multicast address group. + +"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +the broadcast address. + +"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This +statistic is the number of outbound packets not transmitted even though an +error was not detected. An example of a reason this might occur is to free up +internal buffer space. + +"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were not addressed to +a multicast group or broadcast address. + +"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +multicast group. + +"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +broadcast address. + +"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. +This statistic counts the number of packets dropped due to lack of internal +controller resources. + +"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. +This statistic counts the total number of bytes received by the controller, +including error and discarded packets. + +"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. +This statistic counts the total number of packets received by the controller, +including error, discarded, unicast, multicast, and broadcast packets. + +"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets received less +than 64 bytes long. + +"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets greater than 1518 +bytes long. + +"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. +This statistic counts the total number of packets received that were 64 octets +in length. + +"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC +2819. This statistic counts the total number of packets received that were +between 65 and 127 octets in length inclusive. + +"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 128 and 255 octets in length inclusive. + +"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 256 and 511 octets in length inclusive. + +"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 512 and 1023 octets in length inclusive. + +"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define +in RFC 2819. This statistic is the total number of packets received that were +between 1024 and 1518 octets in length inclusive. + +"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the +Altera TSE. This statistics counts the number of received good and errored +frames between the length of 1519 and the maximum frame length configured +in the frm_length register. See the Altera TSE User Guide for More details. + +"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This +statistic is the total number of packets received that were longer than 1518 +octets, and had either a bad CRC with an integral number of octets (CRC Error) +or a bad CRC with a non-integral number of octets (Alignment Error). + +"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This +statistic is the total number of packets received that were less than 64 octets +in length and had either a bad CRC with an integral number of octets (CRC +error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.txt new file mode 100644 index 000000000..731de4115 --- /dev/null +++ b/Documentation/networking/arcnet-hardware.txt @@ -0,0 +1,3133 @@ + +----------------------------------------------------------------------------- +1) This file is a supplement to arcnet.txt. Please read that for general + driver configuration help. +----------------------------------------------------------------------------- +2) This file is no longer Linux-specific. It should probably be moved out of + the kernel sources. Ideas? +----------------------------------------------------------------------------- + +Because so many people (myself included) seem to have obtained ARCnet cards +without manuals, this file contains a quick introduction to ARCnet hardware, +some cabling tips, and a listing of all jumper settings I can find. Please +e-mail apenwarr@worldvisions.ca with any settings for your particular card, +or any other information you have! + + +INTRODUCTION TO ARCNET +---------------------- + +ARCnet is a network type which works in a way similar to popular Ethernet +networks but which is also different in some very important ways. + +First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps +(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact, +there are others as well, but these are less common. The different hardware +types, as far as I'm aware, are not compatible and so you cannot wire a +100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does +work with 100 Mbps cards, but I haven't been able to verify this myself, +since I only have the 2.5 Mbps variety. It is probably not going to saturate +your 100 Mbps card. Stop complaining. :) + +You also cannot connect an ARCnet card to any kind of Ethernet card and +expect it to work. + +There are two "types" of ARCnet - STAR topology and BUS topology. This +refers to how the cards are meant to be wired together. According to most +available documentation, you can only connect STAR cards to STAR cards and +BUS cards to BUS cards. That makes sense, right? Well, it's not quite +true; see below under "Cabling." + +Once you get past these little stumbling blocks, ARCnet is actually quite a +well-designed standard. It uses something called "modified token passing" +which makes it completely incompatible with so-called "Token Ring" cards, +but which makes transfers much more reliable than Ethernet does. In fact, +ARCnet will guarantee that a packet arrives safely at the destination, and +even if it can't possibly be delivered properly (ie. because of a cable +break, or because the destination computer does not exist) it will at least +tell the sender about it. + +Because of the carefully defined action of the "token", it will always make +a pass around the "ring" within a maximum length of time. This makes it +useful for realtime networks. + +In addition, all known ARCnet cards have an (almost) identical programming +interface. This means that with one ARCnet driver you can support any +card, whereas with Ethernet each manufacturer uses what is sometimes a +completely different programming interface, leading to a lot of different, +sometimes very similar, Ethernet drivers. Of course, always using the same +programming interface also means that when high-performance hardware +facilities like PCI bus mastering DMA appear, it's hard to take advantage of +them. Let's not go into that. + +One thing that makes ARCnet cards difficult to program for, however, is the +limit on their packet sizes; standard ARCnet can only send packets that are +up to 508 bytes in length. This is smaller than the Internet "bare minimum" +of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra +level of encapsulation is defined by RFC1201, which I call "packet +splitting," that allows "virtual packets" to grow as large as 64K each, +although they are generally kept down to the Ethernet-style 1500 bytes. + +For more information on the advantages and disadvantages (mostly the +advantages) of ARCnet networks, you might try the "ARCnet Trade Association" +WWW page: + http://www.arcnet.com + + +CABLING ARCNET NETWORKS +----------------------- + +This section was rewritten by + Vojtech Pavlik <vojtech@suse.cz> +using information from several people, including: + Avery Pennraun <apenwarr@worldvisions.ca> + Stephen A. Wood <saw@hallc1.cebaf.gov> + John Paul Morrison <jmorriso@bogomips.ee.ubc.ca> + Joachim Koenig <jojo@repas.de> +and Avery touched it up a bit, at Vojtech's request. + +ARCnet (the classic 2.5 Mbps version) can be connected by two different +types of cabling: coax and twisted pair. The other ARCnet-type networks +(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of +cabling (Type1, Fiber, C1, C4, C5). + +For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables +also work fine, because ARCnet is a very stable network. I personally use 75 +Ohm TV antenna cable. + +Cards for coax cabling are shipped in two different variants: for BUS and +STAR network topologies. They are mostly the same. The only difference +lies in the hybrid chip installed. BUS cards use high impedance output, +while STAR use low impedance. Low impedance card (STAR) is electrically +equal to a high impedance one with a terminator installed. + +Usually, the ARCnet networks are built up from STAR cards and hubs. There +are two types of hubs - active and passive. Passive hubs are small boxes +with four BNC connectors containing four 47 Ohm resistors: + + | | wires + R + junction +-R-+-R- R 47 Ohm resistors + R + | + +The shielding is connected together. Active hubs are much more complicated; +they are powered and contain electronics to amplify the signal and send it +to other segments of the net. They usually have eight connectors. Active +hubs come in two variants - dumb and smart. The dumb variant just +amplifies, but the smart one decodes to digital and encodes back all packets +coming through. This is much better if you have several hubs in the net, +since many dumb active hubs may worsen the signal quality. + +And now to the cabling. What you can connect together: + +1. A card to a card. This is the simplest way of creating a 2-computer + network. + +2. A card to a passive hub. Remember that all unused connectors on the hub + must be properly terminated with 93 Ohm (or something else if you don't + have the right ones) terminators. + (Avery's note: oops, I didn't know that. Mine (TV cable) works + anyway, though.) + +3. A card to an active hub. Here is no need to terminate the unused + connectors except some kind of aesthetic feeling. But, there may not be + more than eleven active hubs between any two computers. That of course + doesn't limit the number of active hubs on the network. + +4. An active hub to another. + +5. An active hub to passive hub. + +Remember that you cannot connect two passive hubs together. The power loss +implied by such a connection is too high for the net to operate reliably. + +An example of a typical ARCnet network: + + R S - STAR type card + S------H--------A-------S R - Terminator + | | H - Hub + | | A - Active hub + | S----H----S + S | + | + S + +The BUS topology is very similar to the one used by Ethernet. The only +difference is in cable and terminators: they should be 93 Ohm. Ethernet +uses 50 Ohm impedance. You use T connectors to put the computers on a single +line of cable, the bus. You have to put terminators at both ends of the +cable. A typical BUS ARCnet network looks like: + + RT----T------T------T------T------TR + B B B B B B + + B - BUS type card + R - Terminator + T - T connector + +But that is not all! The two types can be connected together. According to +the official documentation the only way of connecting them is using an active +hub: + + A------T------T------TR + | B B B + S---H---S + | + S + +The official docs also state that you can use STAR cards at the ends of +BUS network in place of a BUS card and a terminator: + + S------T------T------S + B B + +But, according to my own experiments, you can simply hang a BUS type card +anywhere in middle of a cable in a STAR topology network. And more - you +can use the bus card in place of any star card if you use a terminator. Then +you can build very complicated networks fulfilling all your needs! An +example: + + S + | + RT------T-------T------H------S + B B B | + | R + S------A------T-------T-------A-------H------TR + | B B | | B + | S BT | + | | | S----A-----S + S------H---A----S | | + | | S------T----H---S | + S S B R S + +A basically different cabling scheme is used with Twisted Pair cabling. Each +of the TP cards has two RJ (phone-cord style) connectors. The cards are +then daisy-chained together using a cable connecting every two neighboring +cards. The ends are terminated with RJ 93 Ohm terminators which plug into +the empty connectors of cards on the ends of the chain. An example: + + ___________ ___________ + _R_|_ _|_|_ _|_R_ + | | | | | | + |Card | |Card | |Card | + |_____| |_____| |_____| + + +There are also hubs for the TP topology. There is nothing difficult +involved in using them; you just connect a TP chain to a hub on any end or +even at both. This way you can create almost any network configuration. +The maximum of 11 hubs between any two computers on the net applies here as +well. An example: + + RP-------P--------P--------H-----P------P-----PR + | + RP-----H--------P--------H-----P------PR + | | + PR PR + + R - RJ Terminator + P - TP Card + H - TP Hub + +Like any network, ARCnet has a limited cable length. These are the maximum +cable lengths between two active ends (an active end being an active hub or +a STAR card). + + RG-62 93 Ohm up to 650 m + RG-59/U 75 Ohm up to 457 m + RG-11/U 75 Ohm up to 533 m + IBM Type 1 150 Ohm up to 200 m + IBM Type 3 100 Ohm up to 100 m + +The maximum length of all cables connected to a passive hub is limited to 65 +meters for RG-62 cabling; less for others. You can see that using passive +hubs in a large network is a bad idea. The maximum length of a single "BUS +Trunk" is about 300 meters for RG-62. The maximum distance between the two +most distant points of the net is limited to 3000 meters. The maximum length +of a TP cable between two cards/hubs is 650 meters. + + +SETTING THE JUMPERS +------------------- + +All ARCnet cards should have a total of four or five different settings: + + - the I/O address: this is the "port" your ARCnet card is on. Probed + values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If + your card has additional ones, which is possible, please tell me.) This + should not be the same as any other device on your system. According to + a doc I got from Novell, MS Windows prefers values of 0x300 or more, + eating net connections on my system (at least) otherwise. My guess is + this may be because, if your card is at 0x2E0, probing for a serial port + at 0x2E8 will reset the card and probably mess things up royally. + - Avery's favourite: 0x300. + + - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7. + on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15. + + Make sure this is different from any other card on your system. Note + that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can + "cat /proc/interrupts" for a somewhat complete list of which ones are in + use at any given time. Here is a list of common usages from Vojtech + Pavlik <vojtech@suse.cz>: + ("Not on bus" means there is no way for a card to generate this + interrupt) + IRQ 0 - Timer 0 (Not on bus) + IRQ 1 - Keyboard (Not on bus) + IRQ 2 - IRQ Controller 2 (Not on bus, nor does interrupt the CPU) + IRQ 3 - COM2 + IRQ 4 - COM1 + IRQ 5 - FREE (LPT2 if you have it; sometimes COM3; maybe PLIP) + IRQ 6 - Floppy disk controller + IRQ 7 - FREE (LPT1 if you don't use the polling driver; PLIP) + IRQ 8 - Realtime Clock Interrupt (Not on bus) + IRQ 9 - FREE (VGA vertical sync interrupt if enabled) + IRQ 10 - FREE + IRQ 11 - FREE + IRQ 12 - FREE + IRQ 13 - Numeric Coprocessor (Not on bus) + IRQ 14 - Fixed Disk Controller + IRQ 15 - FREE (Fixed Disk Controller 2 if you have it) + + Note: IRQ 9 is used on some video cards for the "vertical retrace" + interrupt. This interrupt would have been handy for things like + video games, as it occurs exactly once per screen refresh, but + unfortunately IBM cancelled this feature starting with the original + VGA and thus many VGA/SVGA cards do not support it. For this + reason, no modern software uses this interrupt and it can almost + always be safely disabled, if your video card supports it at all. + + If your card for some reason CANNOT disable this IRQ (usually there + is a jumper), one solution would be to clip the printed circuit + contact on the board: it's the fourth contact from the left on the + back side. I take no responsibility if you try this. + + - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though. + + - the memory address: Unlike most cards, ARCnets use "shared memory" for + copying buffers around. Make SURE it doesn't conflict with any other + used memory in your system! + A0000 - VGA graphics memory (ok if you don't have VGA) + B0000 - Monochrome text mode + C0000 \ One of these is your VGA BIOS - usually C0000. + E0000 / + F0000 - System BIOS + + Anything less than 0xA0000 is, well, a BAD idea since it isn't above + 640k. + - Avery's favourite: 0xD0000 + + - the station address: Every ARCnet card has its own "unique" network + address from 0 to 255. Unlike Ethernet, you can set this address + yourself with a jumper or switch (or on some cards, with special + software). Since it's only 8 bits, you can only have 254 ARCnet cards + on a network. DON'T use 0 or 255, since these are reserved (although + neat stuff will probably happen if you DO use them). By the way, if you + haven't already guessed, don't set this the same as any other ARCnet on + your network! + - Avery's favourite: 3 and 4. Not that it matters. + + - There may be ETS1 and ETS2 settings. These may or may not make a + difference on your card (many manuals call them "reserved"), but are + used to change the delays used when powering up a computer on the + network. This is only necessary when wiring VERY long range ARCnet + networks, on the order of 4km or so; in any case, the only real + requirement here is that all cards on the network with ETS1 and ETS2 + jumpers have them in the same position. Chris Hindy <chrish@io.org> + sent in a chart with actual values for this: + ET1 ET2 Response Time Reconfiguration Time + --- --- ------------- -------------------- + open open 74.7us 840us + open closed 283.4us 1680us + closed open 561.8us 1680us + closed closed 1118.6us 1680us + + Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your + network. + +Also, on many cards (not mine, though) there are red and green LED's. +Vojtech Pavlik <vojtech@suse.cz> tells me this is what they mean: + GREEN RED Status + ----- --- ------ + OFF OFF Power off + OFF Short flashes Cabling problems (broken cable or not + terminated) + OFF (short) ON Card init + ON ON Normal state - everything OK, nothing + happens + ON Long flashes Data transfer + ON OFF Never happens (maybe when wrong ID) + + +The following is all the specific information people have sent me about +their own particular ARCnet cards. It is officially a mess, and contains +huge amounts of duplicated information. I have no time to fix it. If you +want to, PLEASE DO! Just send me a 'diff -u' of all your changes. + +The model # is listed right above specifics for that card, so you should be +able to use your text viewer's "search" function to find the entry you want. +If you don't KNOW what kind of card you have, try looking through the +various diagrams to see if you can tell. + +If your model isn't listed and/or has different settings, PLEASE PLEASE +tell me. I had to figure mine out without the manual, and it WASN'T FUN! + +Even if your ARCnet model isn't listed, but has the same jumpers as another +model that is, please e-mail me to say so. + +Cards Listed in this file (in this order, mostly): + + Manufacturer Model # Bits + ------------ ------- ---- + SMC PC100 8 + SMC PC110 8 + SMC PC120 8 + SMC PC130 8 + SMC PC270E 8 + SMC PC500 16 + SMC PC500Longboard 16 + SMC PC550Longboard 16 + SMC PC600 16 + SMC PC710 8 + SMC? LCS-8830(-T) 8/16 + Puredata PDI507 8 + CNet Tech CN120-Series 8 + CNet Tech CN160-Series 16 + Lantech? UM9065L chipset 8 + Acer 5210-003 8 + Datapoint? LAN-ARC-8 8 + Topware TA-ARC/10 8 + Thomas-Conrad 500-6242-0097 REV A 8 + Waterloo? (C)1985 Waterloo Micro. 8 + No Name -- 8/16 + No Name Taiwan R.O.C? 8 + No Name Model 9058 8 + Tiara Tiara Lancard? 8 + + +** SMC = Standard Microsystems Corp. +** CNet Tech = CNet Technology, Inc. + + +Unclassified Stuff +------------------ + - Please send any other information you can find. + + - And some other stuff (more info is welcome!): + From: root@ultraworld.xs4all.nl (Timo Hilbrink) + To: apenwarr@foxnet.net (Avery Pennarun) + Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT) + Reply-To: timoh@xs4all.nl + + [...parts deleted...] + + About the jumpers: On my PC130 there is one more jumper, located near the + cable-connector and it's for changing to star or bus topology; + closed: star - open: bus + On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI + and another with ALE,LA17,LA18,LA19 these are undocumented.. + + [...more parts deleted...] + + --- CUT --- + + +** Standard Microsystems Corp (SMC) ** +PC100, PC110, PC120, PC130 (8-bit cards) +PC500, PC600 (16-bit cards) +--------------------------------- + - mainly from Avery Pennarun <apenwarr@worldvisions.ca>. Values depicted + are from Avery's setup. + - special thanks to Timo Hilbrink <timoh@xs4all.nl> for noting that PC120, + 130, 500, and 600 all have the same switches as Avery's PC100. + PC500/600 have several extra, undocumented pins though. (?) + - PC110 settings were verified by Stephen A. Wood <saw@cebaf.gov> + - Also, the JP- and S-numbers probably don't match your card exactly. Try + to find jumpers/switches with the same number of settings - it's + probably more reliable. + + + JP5 [|] : : : : +(IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7 + Put exactly one jumper on exactly one set of pins. + + + 1 2 3 4 5 6 7 8 9 10 + S1 /----------------------------------\ +(I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 | + addresses) \----------------------------------/ + |--| |--------| |--------| + (a) (b) (m) + + WARNING. It's very important when setting these which way + you're holding the card, and which way you think is '1'! + + If you suspect that your settings are not being made + correctly, try reversing the direction or inverting the + switch positions. + + a: The first digit of the I/O address. + Setting Value + ------- ----- + 00 0 + 01 1 + 10 2 + 11 3 + + b: The second digit of the I/O address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The I/O address is in the form ab0. For example, if + a is 0x2 and b is 0xE, the address will be 0x2E0. + + DO NOT SET THIS LESS THAN 0x200!!!!! + + + m: The first digit of the memory address. + Setting Value + ------- ----- + 0000 0 + 0001 1 + 0010 2 + ... ... + 1110 E + 1111 F + + The memory address is in the form m0000. For example, if + m is D, the address will be 0xD0000. + + DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000! + + 1 2 3 4 5 6 7 8 + S2 /--------------------------\ +(Station Address) | 1 1 0 0 0 0 0 0 | + \--------------------------/ + + Setting Value + ------- ----- + 00000000 00 + 10000000 01 + 01000000 02 + ... + 01111111 FE + 11111111 FF + + Note that this is binary with the digits reversed! + + DO NOT SET THIS TO 0 OR 255 (0xFF)! + + +***************************************************************************** + +** Standard Microsystems Corp (SMC) ** +PC130E/PC270E (8-bit cards) +--------------------------- + - from Juergen Seifert <seifert@htwm.de> + + +STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET(R)-PC130E/PC270E +=============================================================== + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the following Original SMC Manual + + "Configuration Guide for + ARCNET(R)-PC130E/PC270 + Network Controller Boards + Pub. # 900.044A + June, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC130E is an enhanced version of the PC130 board, is equipped with a +standard BNC female connector for connection to RG-62/U coax cable. +Since this board is designed both for point-to-point connection in star +networks and for connection to bus networks, it is downwardly compatible +with all the other standard boards designed for coax networks (that is, +the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and +PC200 bus topology boards). + +The PC270E is an enhanced version of the PC260 board, is equipped with two +modular RJ11-type jacks for connection to twisted pair wiring. +It can be used in a star or a daisy-chained network. + + + 8 7 6 5 4 3 2 1 + ________________________________________________________________ + | | S1 | | + | |_________________| | + | Offs|Base |I/O Addr | + | RAM Addr | ___| + | ___ ___ CR3 |___| + | | \/ | CR4 |___| + | | PROM | ___| + | | | N | | 8 + | | SOCKET | o | | 7 + | |________| d | | 6 + | ___________________ e | | 5 + | | | A | S | 4 + | |oo| EXT2 | | d | 2 | 3 + | |oo| EXT1 | SMC | d | | 2 + | |oo| ROM | 90C63 | r |___| 1 + | |oo| IRQ7 | | |o| _____| + | |oo| IRQ5 | | |o| | J1 | + | |oo| IRQ4 | | STAR |_____| + | |oo| IRQ3 | | | J2 | + | |oo| IRQ2 |___________________| |_____| + |___ ______________| + | | + |_____________________________________________| + +Legend: + +SMC 90C63 ARCNET Controller / Transceiver /Logic +S1 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select +S2 1-8: Node ID Select +EXT Extended Timeout Select +ROM ROM Enable Select +STAR Selected - Star Topology (PC130E only) + Deselected - Bus Topology (PC130E only) +CR3/CR4 Diagnostic LEDs +J1 BNC RG62/U Connector (PC130E only) +J1 6-position Telephone Jack (PC270E only) +J2 6-position Telephone Jack (PC270E only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +------------------- + +The eight switches in group S2 are used to set the node ID. +These switches work in a way similar to the PC100-series cards; see that +entry for more information. + + +Setting the I/O Base Address +---------------------------- + +The first three switches in switch group S1 are used to select one +of eight possible I/O Base addresses using the following table + + + Switch | Hex I/O + 1 2 3 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group S1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group S1. + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + +*) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting the Timeouts and Interrupt +---------------------------------- + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2. + + +Configuring the PC130E for Star or Bus Topology +----------------------------------------------- + +The single jumper labeled STAR is used to configure the PC130E board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +--------------- + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +***************************************************************************** + +** Standard Microsystems Corp (SMC) ** +PC500/PC550 Longboard (16-bit cards) +------------------------------------- + - from Juergen Seifert <seifert@htwm.de> + + +STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET-PC500/PC550 Long Board +===================================================================== + +Note: There is another Version of the PC500 called Short Version, which + is different in hard- and software! The most important differences + are: + - The long board has no Shared memory. + - On the long board the selection of the interrupt is done by binary + coded switch, on the short board directly by jumper. + +[Avery's note: pay special attention to that: the long board HAS NO SHARED +MEMORY. This means the current Linux-ARCnet driver can't use these cards. +I have obtained a PC500Longboard and will be doing some experiments on it in +the future, but don't hold your breath. Thanks again to Juergen Seifert for +his advice about this!] + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the following Original SMC Manual + + "Configuration Guide for + SMC ARCNET-PC500/PC550 + Series Network Controller Boards + Pub. # 900.033 Rev. A + November, 1989" + +ARCNET is a registered trademark of the Datapoint Corporation +SMC is a registered trademark of the Standard Microsystems Corporation + +The PC500 is equipped with a standard BNC female connector for connection +to RG-62/U coax cable. +The board is designed both for point-to-point connection in star networks +and for connection to bus networks. + +The PC550 is equipped with two modular RJ11-type jacks for connection +to twisted pair wiring. +It can be used in a star or a daisy-chained (BUS) network. + + 1 + 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1 + ____________________________________________________________________ + < | SW1 | | SW2 | | + > |_____________________| |_____________| | + < IRQ |I/O Addr | + > ___| + < CR4 |___| + > CR3 |___| + < ___| + > N | | 8 + < o | | 7 + > d | S | 6 + < e | W | 5 + > A | 3 | 4 + < d | | 3 + > d | | 2 + < r |___| 1 + > |o| _____| + < |o| | J1 | + > 3 1 JP6 |_____| + < |o|o| JP2 | J2 | + > |o|o| |_____| + < 4 2__ ______________| + > | | | + <____| |_____________________________________________| + +Legend: + +SW1 1-6: I/O Base Address Select + 7-10: Interrupt Select +SW2 1-6: Reserved for Future Use +SW3 1-8: Node ID Select +JP2 1-4: Extended Timeout Select +JP6 Selected - Star Topology (PC500 only) + Deselected - Bus Topology (PC500 only) +CR3 Green Monitors Network Activity +CR4 Red Monitors Board Activity +J1 BNC RG62/U Connector (PC500 only) +J1 6-position Telephone Jack (PC550 only) +J2 6-position Telephone Jack (PC550 only) + +Setting one of the switches to Off/Open means "1", On/Closed means "0". + + +Setting the Node ID +------------------- + +The eight switches in group SW3 are used to set the node ID. Each node +attached to the network must have an unique node ID which must be +different from 0. +Switch 1 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The first six switches in switch group SW1 are used to select one +of 32 possible I/O Base addresses using the following table + + Switch | Hex I/O + 6 5 4 3 2 1 | Address + -------------|-------- + 0 1 0 0 0 0 | 200 + 0 1 0 0 0 1 | 210 + 0 1 0 0 1 0 | 220 + 0 1 0 0 1 1 | 230 + 0 1 0 1 0 0 | 240 + 0 1 0 1 0 1 | 250 + 0 1 0 1 1 0 | 260 + 0 1 0 1 1 1 | 270 + 0 1 1 0 0 0 | 280 + 0 1 1 0 0 1 | 290 + 0 1 1 0 1 0 | 2A0 + 0 1 1 0 1 1 | 2B0 + 0 1 1 1 0 0 | 2C0 + 0 1 1 1 0 1 | 2D0 + 0 1 1 1 1 0 | 2E0 (Manufacturer's default) + 0 1 1 1 1 1 | 2F0 + 1 1 0 0 0 0 | 300 + 1 1 0 0 0 1 | 310 + 1 1 0 0 1 0 | 320 + 1 1 0 0 1 1 | 330 + 1 1 0 1 0 0 | 340 + 1 1 0 1 0 1 | 350 + 1 1 0 1 1 0 | 360 + 1 1 0 1 1 1 | 370 + 1 1 1 0 0 0 | 380 + 1 1 1 0 0 1 | 390 + 1 1 1 0 1 0 | 3A0 + 1 1 1 0 1 1 | 3B0 + 1 1 1 1 0 0 | 3C0 + 1 1 1 1 0 1 | 3D0 + 1 1 1 1 1 0 | 3E0 + 1 1 1 1 1 1 | 3F0 + + +Setting the Interrupt +--------------------- + +Switches seven through ten of switch group SW1 are used to select the +interrupt level. The interrupt level is binary coded, so selections +from 0 to 15 would be possible, but only the following eight values will +be supported: 3, 4, 5, 7, 9, 10, 11, 12. + + Switch | IRQ + 10 9 8 7 | + ---------|-------- + 0 0 1 1 | 3 + 0 1 0 0 | 4 + 0 1 0 1 | 5 + 0 1 1 1 | 7 + 1 0 0 1 | 9 (=2) (default) + 1 0 1 0 | 10 + 1 0 1 1 | 11 + 1 1 0 0 | 12 + + +Setting the Timeouts +-------------------- + +The two jumpers JP2 (1-4) are used to determine the timeout parameters. +These two jumpers are normally left open. +Refer to the COM9026 Data Sheet for alternate configurations. + + +Configuring the PC500 for Star or Bus Topology +---------------------------------------------- + +The single jumper labeled JP6 is used to configure the PC500 board for +star or bus topology. +When the jumper is installed, the board may be used in a star network, when +it is removed, the board can be used in a bus topology. + + +Diagnostic LEDs +--------------- + +Two diagnostic LEDs are visible on the rear bracket of the board. +The green LED monitors the network activity: the red one shows the +board activity: + + Green | Status Red | Status + -------|------------------- ---------|------------------- + on | normal activity flash/on | data transfer + blink | reconfiguration off | no data transfer; + off | defective board or | incorrect memory or + | node ID is zero | I/O address + + +***************************************************************************** + +** SMC ** +PC710 (8-bit card) +------------------ + - from J.S. van Oosten <jvoosten@compiler.tdcnet.nl> + +Note: this data is gathered by experimenting and looking at info of other +cards. However, I'm sure I got 99% of the settings right. + +The SMC710 card resembles the PC270 card, but is much more basic (i.e. no +LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing: + + _______________________________________ + | +---------+ +---------+ |____ + | | S2 | | S1 | | + | +---------+ +---------+ | + | | + | +===+ __ | + | | R | | | X-tal ###___ + | | O | |__| ####__'| + | | M | || ### + | +===+ | + | | + | .. JP1 +----------+ | + | .. | big chip | | + | .. | 90C63 | | + | .. | | | + | .. +----------+ | + ------- ----------- + ||||||||||||||||||||| + +The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes +labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM, +IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) ) + +S1 and S2 perform the same function as on the PC270, only their numbers +are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address). + +I know it works when connected to a PC110 type ARCnet board. + + +***************************************************************************** + +** Possibly SMC ** +LCS-8830(-T) (8 and 16-bit cards) +--------------------------------- + - from Mathias Katzer <mkatzer@HRZ.Uni-Bielefeld.DE> + - Marek Michalkiewicz <marekm@i17linuxb.ists.pwr.wroc.pl> says the + LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS + only (the JP0 jumper is hardwired), and BNC only. + +This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC, +nowhere else, not even on the few Xeroxed sheets from the manual). + +SMC ARCnet Board Type LCS-8830-T + + ------------------------------------ + | | + | JP3 88 8 JP2 | + | ##### | \ | + | ##### ET1 ET2 ###| + | 8 ###| + | U3 SW 1 JP0 ###| Phone Jacks + | -- ###| + | | | | + | | | SW2 | + | | | | + | | | ##### | + | -- ##### #### BNC Connector + | #### + | 888888 JP1 | + | 234567 | + -- ------- + ||||||||||||||||||||||||||| + -------------------------- + + +SW1: DIP-Switches for Station Address +SW2: DIP-Switches for Memory Base and I/O Base addresses + +JP0: If closed, internal termination on (default open) +JP1: IRQ Jumpers +JP2: Boot-ROM enabled if closed +JP3: Jumpers for response timeout + +U3: Boot-ROM Socket + + +ET1 ET2 Response Time Idle Time Reconfiguration Time + + 78 86 840 + X 285 316 1680 + X 563 624 1680 + X X 1130 1237 1680 + +(X means closed jumper) + +(DIP-Switch downwards means "0") + +The station address is binary-coded with SW1. + +The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2: + +Switches Base +678 Address +000 260-26f +100 290-29f +010 2e0-2ef +110 2f0-2ff +001 300-30f +101 350-35f +011 380-38f +111 3e0-3ef + + +DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range: + +Switches RAM ROM +12345 Address Range Address Range +00000 C:0000-C:07ff C:2000-C:3fff +10000 C:0800-C:0fff +01000 C:1000-C:17ff +11000 C:1800-C:1fff +00100 C:4000-C:47ff C:6000-C:7fff +10100 C:4800-C:4fff +01100 C:5000-C:57ff +11100 C:5800-C:5fff +00010 C:C000-C:C7ff C:E000-C:ffff +10010 C:C800-C:Cfff +01010 C:D000-C:D7ff +11010 C:D800-C:Dfff +00110 D:0000-D:07ff D:2000-D:3fff +10110 D:0800-D:0fff +01110 D:1000-D:17ff +11110 D:1800-D:1fff +00001 D:4000-D:47ff D:6000-D:7fff +10001 D:4800-D:4fff +01001 D:5000-D:57ff +11001 D:5800-D:5fff +00101 D:8000-D:87ff D:A000-D:bfff +10101 D:8800-D:8fff +01101 D:9000-D:97ff +11101 D:9800-D:9fff +00011 D:C000-D:c7ff D:E000-D:ffff +10011 D:C800-D:cfff +01011 D:D000-D:d7ff +11011 D:D800-D:dfff +00111 E:0000-E:07ff E:2000-E:3fff +10111 E:0800-E:0fff +01111 E:1000-E:17ff +11111 E:1800-E:1fff + + +***************************************************************************** + +** PureData Corp ** +PDI507 (8-bit card) +-------------------- + - from Mark Rejhon <mdrejhon@magi.com> (slight modifications by Avery) + - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards) + are mostly the same as this. PDI508Plus cards appear to be mainly + software-configured. + +Jumpers: + There is a jumper array at the bottom of the card, near the edge + connector. This array is labelled J1. They control the IRQs and + something else. Put only one jumper on the IRQ pins. + + ETS1, ETS2 are for timing on very long distance networks. See the + more general information near the top of this file. + + There is a J2 jumper on two pins. A jumper should be put on them, + since it was already there when I got the card. I don't know what + this jumper is for though. + + There is a two-jumper array for J3. I don't know what it is for, + but there were already two jumpers on it when I got the card. It's + a six pin grid in a two-by-three fashion. The jumpers were + configured as follows: + + .-------. + o | o o | + :-------: ------> Accessible end of card with connectors + o | o o | in this direction -------> + `-------' + +Carl de Billy <CARL@carainfo.com> explains J3 and J4: + + J3 Diagram: + + .-------. + o | o o | + :-------: TWIST Technology + o | o o | + `-------' + .-------. + | o o | o + :-------: COAX Technology + | o o | o + `-------' + + - If using coax cable in a bus topology the J4 jumper must be removed; + place it on one pin. + + - If using bus topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + Connectors. Also the J4 jumper must be removed; place it on one pin of + J4 jumper for storage. + + - If using star topology with twisted pair wiring move the J3 + jumpers so they connect the middle pin and the pins closest to the RJ11 + connectors. + + +DIP Switches: + + The DIP switches accessible on the accessible end of the card while + it is installed, is used to set the ARCnet address. There are 8 + switches. Use an address from 1 to 254. + + Switch No. + 12345678 ARCnet address + ----------------------------------------- + 00000000 FF (Don't use this!) + 00000001 FE + 00000010 FD + .... + 11111101 2 + 11111110 1 + 11111111 0 (Don't use this!) + + There is another array of eight DIP switches at the top of the + card. There are five labelled MS0-MS4 which seem to control the + memory address, and another three labelled IO0-IO2 which seem to + control the base I/O address of the card. + + This was difficult to test by trial and error, and the I/O addresses + are in a weird order. This was tested by setting the DIP switches, + rebooting the computer, and attempting to load ARCETHER at various + addresses (mostly between 0x200 and 0x400). The address that caused + the red transmit LED to blink, is the one that I thought works. + + Also, the address 0x3D0 seem to have a special meaning, since the + ARCETHER packet driver loaded fine, but without the red LED + blinking. I don't know what 0x3D0 is for though. I recommend using + an address of 0x300 since Windows may not like addresses below + 0x300. + + IO Switch No. + 210 I/O address + ------------------------------- + 111 0x260 + 110 0x290 + 101 0x2E0 + 100 0x2F0 + 011 0x300 + 010 0x350 + 001 0x380 + 000 0x3E0 + + The memory switches set a reserved address space of 0x1000 bytes + (0x100 segment units, or 4k). For example if I set an address of + 0xD000, it will use up addresses 0xD000 to 0xD100. + + The memory switches were tested by booting using QEMM386 stealth, + and using LOADHI to see what address automatically became excluded + from the upper memory regions, and then attempting to load ARCETHER + using these addresses. + + I recommend using an ARCnet memory address of 0xD000, and putting + the EMS page frame at 0xC000 while using QEMM stealth mode. That + way, you get contiguous high memory from 0xD100 almost all the way + the end of the megabyte. + + Memory Switch 0 (MS0) didn't seem to work properly when set to OFF + on my card. It could be malfunctioning on my card. Experiment with + it ON first, and if it doesn't work, set it to OFF. (It may be a + modifier for the 0x200 bit?) + + MS Switch No. + 43210 Memory address + -------------------------------- + 00001 0xE100 (guessed - was not detected by QEMM) + 00011 0xE000 (guessed - was not detected by QEMM) + 00101 0xDD00 + 00111 0xDC00 + 01001 0xD900 + 01011 0xD800 + 01101 0xD500 + 01111 0xD400 + 10001 0xD100 + 10011 0xD000 + 10101 0xCD00 + 10111 0xCC00 + 11001 0xC900 (guessed - crashes tested system) + 11011 0xC800 (guessed - crashes tested system) + 11101 0xC500 (guessed - crashes tested system) + 11111 0xC400 (guessed - crashes tested system) + + +***************************************************************************** + +** CNet Technology Inc. ** +120 Series (8-bit cards) +------------------------ + - from Juergen Seifert <seifert@htwm.de> + + +CNET TECHNOLOGY INC. (CNet) ARCNET 120A SERIES +============================================== + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the following Original CNet Manual + + "ARCNET + USER'S MANUAL + for + CN120A + CN120AB + CN120TP + CN120ST + CN120SBT + P/N:12-01-0007 + Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +P/N 120A ARCNET 8 bit XT/AT Star +P/N 120AB ARCNET 8 bit XT/AT Bus +P/N 120TP ARCNET 8 bit XT/AT Twisted Pair +P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair +P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair + + __________________________________________________________________ + | | + | ___| + | LED |___| + | ___| + | N | | ID7 + | o | | ID6 + | d | S | ID5 + | e | W | ID4 + | ___________________ A | 2 | ID3 + | | | d | | ID2 + | | | 1 2 3 4 5 6 7 8 d | | ID1 + | | | _________________ r |___| ID0 + | | 90C65 || SW1 | ____| + | JP 8 7 | ||_________________| | | + | |o|o| JP1 | | | J2 | + | |o|o| |oo| | | JP 1 1 1 | | + | ______________ | | 0 1 2 |____| + | | PROM | |___________________| |o|o|o| _____| + | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 | + | |______________| |o|o|o|o|o| |o|o|o| |_____| + |_____ |o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend: + +90C65 ARCNET Probe +S1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select +S2 1-8: Node ID Select (ID0-ID7) +JP1 ROM Enable Select +JP2 IRQ2 +JP3 IRQ3 +JP4 IRQ4 +JP5 IRQ5 +JP6 IRQ7 +JP7/JP8 ET1, ET2 Timeout Parameters +JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only) +JP12 Terminator Select (CN120AB/ST/SBT only) +J1 BNC RG62/U Connector (all except CN120TP) +J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only) + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +------------------- + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 8K or memory base + 0x2000. +Switches 1-5 of switch block SW1 select the Memory Base address. + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + +*) To enable the Boot ROM install the jumper JP1 + +Note: Since the switches 1 and 2 are always set to ON it may be possible + that they can be used to add an offset of 2K, 4K or 6K to the base + address, but this feature is not documented in the manual and I + haven't tested it yet. + + +Setting the Interrupt Line +-------------------------- + +To select a hardware interrupt level install one (only one!) of the jumpers +JP2, JP3, JP4, JP5, JP6. JP2 is the default. + + Jumper | IRQ + -------|----- + 2 | 2 + 3 | 3 + 4 | 4 + 5 | 5 + 6 | 7 + + +Setting the Internal Terminator on CN120AB/TP/SBT +-------------------------------------------------- + +The jumper JP12 is used to enable the internal terminator. + + ----- + 0 | 0 | + ----- ON | | ON + | 0 | | 0 | + | | OFF ----- OFF + | 0 | 0 + ----- + Terminator Terminator + disabled enabled + + +Selecting the Connector Type on CN120ST/SBT +------------------------------------------- + + JP10 JP11 JP10 JP11 + ----- ----- + 0 0 | 0 | | 0 | + ----- ----- | | | | + | 0 | | 0 | | 0 | | 0 | + | | | | ----- ----- + | 0 | | 0 | 0 0 + ----- ----- + Coaxial Cable Twisted Pair Cable + (Default) + + +Setting the Timeout Parameters +------------------------------ + +The jumpers labeled EXT1 and EXT2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + + +***************************************************************************** + +** CNet Technology Inc. ** +160 Series (16-bit cards) +------------------------- + - from Juergen Seifert <seifert@htwm.de> + +CNET TECHNOLOGY INC. (CNet) ARCNET 160A SERIES +============================================== + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the following Original CNet Manual + + "ARCNET + USER'S MANUAL + for + CN160A + CN160AB + CN160TP + P/N:12-01-0006 + Revision 3.00" + +ARCNET is a registered trademark of the Datapoint Corporation + +P/N 160A ARCNET 16 bit XT/AT Star +P/N 160AB ARCNET 16 bit XT/AT Bus +P/N 160TP ARCNET 16 bit XT/AT Twisted Pair + + ___________________________________________________________________ + < _________________________ ___| + > |oo| JP2 | | LED |___| + < |oo| JP1 | 9026 | LED |___| + > |_________________________| ___| + < N | | ID7 + > 1 o | | ID6 + < 1 2 3 4 5 6 7 8 9 0 d | S | ID5 + > _______________ _____________________ e | W | ID4 + < | PROM | | SW1 | A | 2 | ID3 + > > SOCKET | |_____________________| d | | ID2 + < |_______________| | IO-Base | MEM | d | | ID1 + > r |___| ID0 + < ____| + > | | + < | J1 | + > | | + < |____| + > 1 1 1 1 | + < 3 4 5 6 7 JP 8 9 0 1 2 3 | + > |o|o|o|o|o| |o|o|o|o|o|o| | + < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________| + > | | | + <____________| |_______________________________________| + +Legend: + +9026 ARCNET Probe +SW1 1-6: Base I/O Address Select + 7-10: Base Memory Address Select +SW2 1-8: Node ID Select (ID0-ID7) +JP1/JP2 ET1, ET2 Timeout Parameters +JP3-JP13 Interrupt Select +J1 BNC RG62/U Connector (CN160A/AB only) +J1 Two 6-position Telephone Jack (CN160TP only) +LED + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +------------------- + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must be different from 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The first six switches in switch block SW1 are used to select the I/O Base +address using the following table: + + Switch | Hex I/O + 1 2 3 4 5 6 | Address + ------------------------|-------- + OFF ON ON OFF OFF ON | 260 + OFF ON OFF ON ON OFF | 290 + OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default) + OFF ON OFF OFF OFF OFF | 2F0 + OFF OFF ON ON ON ON | 300 + OFF OFF ON OFF ON OFF | 350 + OFF OFF OFF ON ON ON | 380 + OFF OFF OFF OFF OFF ON | 3E0 + +Note: Other IO-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The switches 7-10 of switch block SW1 are used to select the Memory +Base address of the RAM (2K) and the PROM. + + Switch | Hex RAM | Hex ROM + 7 8 9 10 | Address | Address + ----------------|---------|----------- + OFF OFF ON ON | C0000 | C8000 + OFF OFF ON OFF | D0000 | D8000 (Default) + OFF OFF OFF ON | E0000 | E8000 + +Note: Other MEM-Base addresses seem to be selectable, but only the above + combinations are documented. + + +Setting the Interrupt Line +-------------------------- + +To select a hardware interrupt level install one (only one!) of the jumpers +JP3 through JP13 using the following table: + + Jumper | IRQ + -------|----------------- + 3 | 14 + 4 | 15 + 5 | 12 + 6 | 11 + 7 | 10 + 8 | 3 + 9 | 4 + 10 | 5 + 11 | 6 + 12 | 7 + 13 | 2 (=9) Default! + +Note: - Do not use JP11=IRQ6, it may conflict with your Floppy Disk + Controller + - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL- + Hard Disk, it may conflict with their controllers + + +Setting the Timeout Parameters +------------------------------ + +The jumpers labeled JP1 and JP2 are used to determine the timeout +parameters. These two jumpers are normally left open. + + +***************************************************************************** + +** Lantech ** +8-bit card, unknown model +------------------------- + - from Vlad Lungu <vlungu@ugal.ro> - his e-mail address seemed broken at + the time I tried to reach him. Sorry Vlad, if you didn't get my reply. + + ________________________________________________________________ + | 1 8 | + | ___________ __| + | | SW1 | LED |__| + | |__________| | + | ___| + | _____________________ |S | 8 + | | | |W | + | | | |2 | + | | | |__| 1 + | | UM9065L | |o| JP4 ____|____ + | | | |o| | CN | + | | | |________| + | | | | + | |___________________| | + | | + | | + | _____________ | + | | | | + | | PROM | |ooooo| JP6 | + | |____________| |ooooo| | + |_____________ _ _| + |____________________________________________| |__| + + +UM9065L : ARCnet Controller + +SW 1 : Shared Memory Address and I/O Base + + ON=0 + + 12345|Memory Address + -----|-------------- + 00001| D4000 + 00010| CC000 + 00110| D0000 + 01110| D1000 + 01101| D9000 + 10010| CC800 + 10011| DC800 + 11110| D1800 + +It seems that the bits are considered in reverse order. Also, you must +observe that some of those addresses are unusual and I didn't probe them; I +used a memory dump in DOS to identify them. For the 00000 configuration and +some others that I didn't write here the card seems to conflict with the +video card (an S3 GENDAC). I leave the full decoding of those addresses to +you. + + 678| I/O Address + ---|------------ + 000| 260 + 001| failed probe + 010| 2E0 + 011| 380 + 100| 290 + 101| 350 + 110| failed probe + 111| 3E0 + +SW 2 : Node ID (binary coded) + +JP 4 : Boot PROM enable CLOSE - enabled + OPEN - disabled + +JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6) + + +***************************************************************************** + +** Acer ** +8-bit card, Model 5210-003 +-------------------------- + - from Vojtech Pavlik <vojtech@suse.cz> using portions of the existing + arcnet-hardware file. + +This is a 90C26 based card. Its configuration seems similar to the SMC +PC100, but has some additional jumpers I don't know the meaning of. + + __ + | | + ___________|__|_________________________ + | | | | + | | BNC | | + | |______| ___| + | _____________________ |___ + | | | | + | | Hybrid IC | | + | | | o|o J1 | + | |_____________________| 8|8 | + | 8|8 J5 | + | o|o | + | 8|8 | + |__ 8|8 | + (|__| LED o|o | + | 8|8 | + | 8|8 J15 | + | | + | _____ | + | | | _____ | + | | | | | ___| + | | | | | | + | _____ | ROM | | UFS | | + | | | | | | | | + | | | ___ | | | | | + | | | | | |__.__| |__.__| | + | | NCR | |XTL| _____ _____ | + | | | |___| | | | | | + | |90C26| | | | | | + | | | | RAM | | UFS | | + | | | J17 o|o | | | | | + | | | J16 o|o | | | | | + | |__.__| |__.__| |__.__| | + | ___ | + | | |8 | + | |SW2| | + | | | | + | |___|1 | + | ___ | + | | |10 J18 o|o | + | | | o|o | + | |SW1| o|o | + | | | J21 o|o | + | |___|1 | + | | + |____________________________________| + + +Legend: + +90C26 ARCNET Chip +XTL 20 MHz Crystal +SW1 1-6 Base I/O Address Select + 7-10 Memory Address Select +SW2 1-8 Node ID Select (ID0-ID7) +J1-J5 IRQ Select +J6-J21 Unknown (Probably extra timeouts & ROM enable ...) +LED1 Activity LED +BNC Coax connector (STAR ARCnet) +RAM 2k of SRAM +ROM Boot ROM socket +UFS Unidentified Flying Sockets + + +Setting the Node ID +------------------- + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to OFF means "1", ON means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Don't set this to 0 or 255; these values are reserved. + + +Setting the I/O Base Address +---------------------------- + +The switches 1 to 6 of switch block SW1 are used to select one +of 32 possible I/O Base addresses using the following tables + + | Hex + Switch | Value + -------|------- + 1 | 200 + 2 | 100 + 3 | 80 + 4 | 40 + 5 | 20 + 6 | 10 + +The I/O address is sum of all switches set to "1". Remember that +the I/O address space bellow 0x200 is RESERVED for mainboard, so +switch 1 should be ALWAYS SET TO OFF. + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of sixteen positions. However, the addresses below +A0000 are likely to cause system hang because there's main RAM. + +Jumpers 7-10 of switch block SW1 select the Memory Base address. + + Switch | Hex RAM + 7 8 9 10 | Address + ----------------|--------- + OFF OFF OFF OFF | F0000 (conflicts with main BIOS) + OFF OFF OFF ON | E0000 + OFF OFF ON OFF | D0000 + OFF OFF ON ON | C0000 (conflicts with video BIOS) + OFF ON OFF OFF | B0000 (conflicts with mono video) + OFF ON OFF ON | A0000 (conflicts with graphics) + + +Setting the Interrupt Line +-------------------------- + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open. + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 7 + OFF ON OFF OFF OFF | 5 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 3 + OFF OFF OFF OFF ON | 2 + + +Unknown jumpers & sockets +------------------------- + +I know nothing about these. I just guess that J16&J17 are timeout +jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and +J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't +guess the purpose. + + +***************************************************************************** + +** Datapoint? ** +LAN-ARC-8, an 8-bit card +------------------------ + - from Vojtech Pavlik <vojtech@suse.cz> + +This is another SMC 90C65-based ARCnet card. I couldn't identify the +manufacturer, but it might be DataPoint, because the card has the +original arcNet logo in its upper right corner. + + _______________________________________________________ + | _________ | + | | SW2 | ON arcNet | + | |_________| OFF ___| + | _____________ 1 ______ 8 | | 8 + | | | SW1 | XTAL | ____________ | S | + | > RAM (2k) | |______|| | | W | + | |_____________| | H | | 3 | + | _________|_____ y | |___| 1 + | _________ | | |b | | + | |_________| | | |r | | + | | SMC | |i | | + | | 90C65| |d | | + | _________ | | | | | + | | SW1 | ON | | |I | | + | |_________| OFF |_________|_____/C | _____| + | 1 8 | | | |___ + | ______________ | | | BNC |___| + | | | |____________| |_____| + | > EPROM SOCKET | _____________ | + | |______________| |_____________| | + | ______________| + | | + |________________________________________| + +Legend: + +90C65 ARCNET Chip +SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select +SW2 1-8: Node ID Select +SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable +BNC Coax connector +XTAL 20 MHz Crystal + + +Setting the Node ID +------------------- + +The eight switches in SW3 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + + +Setting the I/O Base Address +---------------------------- + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. +Jumpers 3-5 of switch block SW1 select the Memory Base address. + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + +*) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON. + +The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address. + + +Setting the Interrupt Line +-------------------------- + +Switches 1-5 of the switch block SW3 control the IRQ level. + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 3 + OFF ON OFF OFF OFF | 4 + OFF OFF ON OFF OFF | 5 + OFF OFF OFF ON OFF | 7 + OFF OFF OFF OFF ON | 2 + + +Setting the Timeout Parameters +------------------------------ + +The switches 6-7 of the switch block SW3 are used to determine the timeout +parameters. These two switches are normally left in the OFF position. + + +***************************************************************************** + +** Topware ** +8-bit card, TA-ARC/10 +------------------------- + - from Vojtech Pavlik <vojtech@suse.cz> + +This is another very similar 90C65 card. Most of the switches and jumpers +are the same as on other clones. + + _____________________________________________________________________ +| ___________ | | ______ | +| |SW2 NODE ID| | | | XTAL | | +| |___________| | Hybrid IC | |______| | +| ___________ | | __| +| |SW1 MEM+I/O| |_________________________| LED1|__|) +| |___________| 1 2 | +| J3 |o|o| TIMEOUT ______| +| ______________ |o|o| | | +| | | ___________________ | RJ | +| > EPROM SOCKET | | \ |------| +|J2 |______________| | | | | +||o| | | |______| +||o| ROM ENABLE | SMC | _________ | +| _____________ | 90C65 | |_________| _____| +| | | | | | |___ +| > RAM (2k) | | | | BNC |___| +| |_____________| | | |_____| +| |____________________| | +| ________ IRQ 2 3 4 5 7 ___________ | +||________| |o|o|o|o|o| |___________| | +|________ J1|o|o|o|o|o| ______________| + | | + |_____________________________________________| + +Legend: + +90C65 ARCNET Chip +XTAL 20 MHz Crystal +SW1 1-5 Base Memory Address Select + 6-8 Base I/O Address Select +SW2 1-8 Node ID Select (ID0-ID7) +J1 IRQ Select +J2 ROM Enable +J3 Extra Timeout +LED1 Activity LED +BNC Coax connector (BUS ARCnet) +RJ Twisted Pair Connector (daisy chain) + + +Setting the Node ID +------------------- + +The eight switches in SW2 are used to set the node ID. Each node attached to +the network must have an unique node ID which must not be 0. Switch 1 (ID0) +serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Setting the I/O Base Address +---------------------------- + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table: + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 (Manufacturer's default) + OFF ON ON | 290 + ON OFF ON | 2E0 + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. +Jumpers 3-5 of switch block SW1 select the Memory Base address. + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default) + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + +*) To enable the Boot ROM short the jumper J2. + +The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address. + + +Setting the Interrupt Line +-------------------------- + +Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means +shorted, OFF means open. + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +------------------------------ + +The jumpers J3 are used to set the timeout parameters. These two +jumpers are normally left open. + + +***************************************************************************** + +** Thomas-Conrad ** +Model #500-6242-0097 REV A (8-bit card) +--------------------------------------- + - from Lars Karlsson <100617.3473@compuserve.com> + + ________________________________________________________ + | ________ ________ |_____ + | |........| |........| | + | |________| |________| ___| + | SW 3 SW 1 | | + | Base I/O Base Addr. Station | | + | address | | + | ______ switch | | + | | | | | + | | | |___| + | | | ______ |___._ + | |______| |______| ____| BNC + | Jumper- _____| Connector + | Main chip block _ __| ' + | | | | RJ Connector + | |_| | with 110 Ohm + | |__ Terminator + | ___________ __| + | |...........| | RJ-jack + | |...........| _____ | (unused) + | |___________| |_____| |__ + | Boot PROM socket IRQ-jumpers |_ Diagnostic + |________ __ _| LED (red) + | | | | | | | | | | | | | | | | | | | | | | + | | | | | | | | | | | | | | | | | | | | |________| + | + | + +And here are the settings for some of the switches and jumpers on the cards. + + + I/O + + 1 2 3 4 5 6 7 8 + +2E0----- 0 0 0 1 0 0 0 1 +2F0----- 0 0 0 1 0 0 0 0 +300----- 0 0 0 0 1 1 1 1 +350----- 0 0 0 0 1 1 1 0 + +"0" in the above example means switch is off "1" means that it is on. + + + ShMem address. + + 1 2 3 4 5 6 7 8 + +CX00--0 0 1 1 | | | +DX00--0 0 1 0 | +X000--------- 1 1 | +X400--------- 1 0 | +X800--------- 0 1 | +XC00--------- 0 0 +ENHANCED----------- 1 +COMPATIBLE--------- 0 + + + IRQ + + + 3 4 5 7 2 + . . . . . + . . . . . + + +There is a DIP-switch with 8 switches, used to set the shared memory address +to be used. The first 6 switches set the address, the 7th doesn't have any +function, and the 8th switch is used to select "compatible" or "enhanced". +When I got my two cards, one of them had this switch set to "enhanced". That +card didn't work at all, it wasn't even recognized by the driver. The other +card had this switch set to "compatible" and it behaved absolutely normally. I +guess that the switch on one of the cards, must have been changed accidentally +when the card was taken out of its former host. The question remains +unanswered, what is the purpose of the "enhanced" position? + +[Avery's note: "enhanced" probably either disables shared memory (use IO +ports instead) or disables IO ports (use memory addresses instead). This +varies by the type of card involved. I fail to see how either of these +enhance anything. Send me more detailed information about this mode, or +just use "compatible" mode instead.] + + +***************************************************************************** + +** Waterloo Microsystems Inc. ?? ** +8-bit card (C) 1985 +------------------- + - from Robert Michael Best <rmb117@cs.usask.ca> + +[Avery's note: these don't work with my driver for some reason. These cards +SEEM to have settings similar to the PDI508Plus, which is +software-configured and doesn't work with my driver either. The "Waterloo +chip" is a boot PROM, probably designed specifically for the University of +Waterloo. If you have any further information about this card, please +e-mail me.] + +The probe has not been able to detect the card on any of the J2 settings, +and I tried them again with the "Waterloo" chip removed. + + _____________________________________________________________________ +| \/ \/ ___ __ __ | +| C4 C4 |^| | M || ^ ||^| | +| -- -- |_| | 5 || || | C3 | +| \/ \/ C10 |___|| ||_| | +| C4 C4 _ _ | | ?? | +| -- -- | \/ || | | +| | || | | +| | || C1 | | +| | || | \/ _____| +| | C6 || | C9 | |___ +| | || | -- | BNC |___| +| | || | >C7| |_____| +| | || | | +| __ __ |____||_____| 1 2 3 6 | +|| ^ | >C4| |o|o|o|o|o|o| J2 >C4| | +|| | |o|o|o|o|o|o| | +|| C2 | >C4| >C4| | +|| | >C8| | +|| | 2 3 4 5 6 7 IRQ >C4| | +||_____| |o|o|o|o|o|o| J3 | +|_______ |o|o|o|o|o|o| _______________| + | | + |_____________________________________________| + +C1 -- "COM9026 + SMC 8638" + In a chip socket. + +C2 -- "@Copyright + Waterloo Microsystems Inc. + 1985" + In a chip Socket with info printed on a label covering a round window + showing the circuit inside. (The window indicates it is an EPROM chip.) + +C3 -- "COM9032 + SMC 8643" + In a chip socket. + +C4 -- "74LS" + 9 total no sockets. + +M5 -- "50006-136 + 20.000000 MHZ + MTQ-T1-S3 + 0 M-TRON 86-40" + Metallic case with 4 pins, no socket. + +C6 -- "MOSTEK@TC8643 + MK6116N-20 + MALAYSIA" + No socket. + +C7 -- No stamp or label but in a 20 pin chip socket. + +C8 -- "PAL10L8CN + 8623" + In a 20 pin socket. + +C9 -- "PAl16R4A-2CN + 8641" + In a 20 pin socket. + +C10 -- "M8640 + NMC + 9306N" + In an 8 pin socket. + +?? -- Some components on a smaller board and attached with 20 pins all + along the side closest to the BNC connector. The are coated in a dark + resin. + +On the board there are two jumper banks labeled J2 and J3. The +manufacturer didn't put a J1 on the board. The two boards I have both +came with a jumper box for each bank. + +J2 -- Numbered 1 2 3 4 5 6. + 4 and 5 are not stamped due to solder points. + +J3 -- IRQ 2 3 4 5 6 7 + +The board itself has a maple leaf stamped just above the irq jumpers +and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986 +CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector. +Below that "MADE IN CANADA" + + +***************************************************************************** + +** No Name ** +8-bit cards, 16-bit cards +------------------------- + - from Juergen Seifert <seifert@htwm.de> + +NONAME 8-BIT ARCNET +=================== + +I have named this ARCnet card "NONAME", since there is no name of any +manufacturer on the Installation manual nor on the shipping box. The only +hint to the existence of a manufacturer at all is written in copper, +it is "Made in Taiwan" + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the Original + "ARCnet Installation Manual" + + + ________________________________________________________________ + | |STAR| BUS| T/P| | + | |____|____|____| | + | _____________________ | + | | | | + | | | | + | | | | + | | SMC | | + | | | | + | | COM90C65 | | + | | | | + | | | | + | |__________-__________| | + | _____| + | _______________ | CN | + | | PROM | |_____| + | > SOCKET | | + | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 | + | _______________ _______________ | + | |o|o|o|o|o|o|o|o| | SW1 || SW2 || + | |o|o|o|o|o|o|o|o| |_______________||_______________|| + |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____| + | \ IRQ / T T O | + |__________________1_2_M______________________| + +Legend: + +COM90C65: ARCnet Probe +S1 1-8: Node ID Select +S2 1-3: I/O Base Address Select + 4-6: Memory Base Address Select + 7-8: RAM Offset Select +ET1, ET2 Extended Timeout Select +ROM ROM Enable Select +CN RG62 Coax Connector +STAR| BUS | T/P Three fields for placing a sign (colored circle) + indicating the topology of the card + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +------------------- + +The eight switches in group SW1 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The first three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table + + Switch | Hex I/O + 1 2 3 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 4-6 of switch group SW2 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 7 and 8 of group SW2. + + Switch | Hex RAM | Hex ROM + 4 5 6 7 8 | Address | Address *) + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + +*) To enable the 8K Boot PROM install the jumper ROM. + The default is jumper ROM not installed. + + +Setting Interrupt Request Lines (IRQ) +------------------------------------- + +To select a hardware interrupt level set one (only one!) of the jumpers +IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2. + + +Setting the Timeouts +-------------------- + +The two jumpers labeled ET1 and ET2 are used to determine the timeout +parameters (response and reconfiguration time). Every node in a network +must be set to the same timeout values. + + ET1 ET2 | Response Time (us) | Reconfiguration Time (ms) + --------|--------------------|-------------------------- + Off Off | 78 | 840 (Default) + Off On | 285 | 1680 + On Off | 563 | 1680 + On On | 1130 | 1680 + +On means jumper installed, Off means jumper not installed + + +NONAME 16-BIT ARCNET +==================== + +The manual of my 8-Bit NONAME ARCnet Card contains another description +of a 16-Bit Coax / Twisted Pair Card. This description is incomplete, +because there are missing two pages in the manual booklet. (The table +of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside +the booklet there is a different way of counting ... 2-9, 2-10, A-1, +(empty page), 3-1, ..., 3-18, A-1 (again), A-2) +Also the picture of the board layout is not as good as the picture of +8-Bit card, because there isn't any letter like "SW1" written to the +picture. +Should somebody have such a board, please feel free to complete this +description or to send a mail to me! + +This description has been written by Juergen Seifert <seifert@htwm.de> +using information from the Original + "ARCnet Installation Manual" + + + ___________________________________________________________________ + < _________________ _________________ | + > | SW? || SW? | | + < |_________________||_________________| | + > ____________________ | + < | | | + > | | | + < | | | + > | | | + < | | | + > | | | + < | | | + > |____________________| | + < ____| + > ____________________ | | + < | | | J1 | + > | < | | + < |____________________| ? ? ? ? ? ? |____| + > |o|o|o|o|o|o| | + < |o|o|o|o|o|o| | + > | + < __ ___________| + > | | | + <____________| |_______________________________________| + + +Setting one of the switches to Off means "1", On means "0". + + +Setting the Node ID +------------------- + +The eight switches in group SW2 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 8 serves as the least significant bit (LSB). + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Value + -------|------- + 8 | 1 + 7 | 2 + 6 | 4 + 5 | 8 + 4 | 16 + 3 | 32 + 2 | 64 + 1 | 128 + +Some Examples: + + Switch | Hex | Decimal + 1 2 3 4 5 6 7 8 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The first three switches in switch group SW1 are used to select one +of eight possible I/O Base addresses using the following table + + Switch | Hex I/O + 3 2 1 | Address + ------------|-------- + ON ON ON | 260 + ON ON OFF | 290 + ON OFF ON | 2E0 (Manufacturer's default) + ON OFF OFF | 2F0 + OFF ON ON | 300 + OFF ON OFF | 350 + OFF OFF ON | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 6-8 of switch group SW1 select the Base of the 16K block. +Within that 16K address space, the buffer may be assigned any one of four +positions, determined by the offset, switches 4 and 5 of group SW1. + + Switch | Hex RAM | Hex ROM + 8 7 6 5 4 | Address | Address + -----------|---------|----------- + 0 0 0 0 0 | C0000 | C2000 + 0 0 0 0 1 | C0800 | C2000 + 0 0 0 1 0 | C1000 | C2000 + 0 0 0 1 1 | C1800 | C2000 + | | + 0 0 1 0 0 | C4000 | C6000 + 0 0 1 0 1 | C4800 | C6000 + 0 0 1 1 0 | C5000 | C6000 + 0 0 1 1 1 | C5800 | C6000 + | | + 0 1 0 0 0 | CC000 | CE000 + 0 1 0 0 1 | CC800 | CE000 + 0 1 0 1 0 | CD000 | CE000 + 0 1 0 1 1 | CD800 | CE000 + | | + 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default) + 0 1 1 0 1 | D0800 | D2000 + 0 1 1 1 0 | D1000 | D2000 + 0 1 1 1 1 | D1800 | D2000 + | | + 1 0 0 0 0 | D4000 | D6000 + 1 0 0 0 1 | D4800 | D6000 + 1 0 0 1 0 | D5000 | D6000 + 1 0 0 1 1 | D5800 | D6000 + | | + 1 0 1 0 0 | D8000 | DA000 + 1 0 1 0 1 | D8800 | DA000 + 1 0 1 1 0 | D9000 | DA000 + 1 0 1 1 1 | D9800 | DA000 + | | + 1 1 0 0 0 | DC000 | DE000 + 1 1 0 0 1 | DC800 | DE000 + 1 1 0 1 0 | DD000 | DE000 + 1 1 0 1 1 | DD800 | DE000 + | | + 1 1 1 0 0 | E0000 | E2000 + 1 1 1 0 1 | E0800 | E2000 + 1 1 1 1 0 | E1000 | E2000 + 1 1 1 1 1 | E1800 | E2000 + + +Setting Interrupt Request Lines (IRQ) +------------------------------------- + +?????????????????????????????????????? + + +Setting the Timeouts +-------------------- + +?????????????????????????????????????? + + +***************************************************************************** + +** No Name ** +8-bit cards ("Made in Taiwan R.O.C.") +----------- + - from Vojtech Pavlik <vojtech@suse.cz> + +I have named this ARCnet card "NONAME", since I got only the card with +no manual at all and the only text identifying the manufacturer is +"MADE IN TAIWAN R.O.C" printed on the card. + + ____________________________________________________________ + | 1 2 3 4 5 6 7 8 | + | |o|o| JP1 o|o|o|o|o|o|o|o| ON | + | + o|o|o|o|o|o|o|o| ___| + | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7 + | | | SW1 | | | | ID6 + | > RAM (2k) | ____________________ | H | | S | ID5 + | |_____________| | || y | | W | ID4 + | | || b | | 2 | ID3 + | | || r | | | ID2 + | | || i | | | ID1 + | | 90C65 || d | |___| ID0 + | SW3 | || | | + | |o|o|o|o|o|o|o|o| ON | || I | | + | |o|o|o|o|o|o|o|o| | || C | | + | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____| + | 1 2 3 4 5 6 7 8 | | | |___ + | ______________ | | | BNC |___| + | | | |_____| |_____| + | > EPROM SOCKET | | + | |______________| | + | ______________| + | | + |_____________________________________________| + +Legend: + +90C65 ARCNET Chip +SW1 1-5: Base Memory Address Select + 6-8: Base I/O Address Select +SW2 1-8: Node ID Select (ID0-ID7) +SW3 1-5: IRQ Select + 6-7: Extra Timeout + 8 : ROM Enable +JP1 Led connector +BNC Coax connector + +Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not +switches. + +Setting the jumpers to ON means connecting the upper two pins, off the bottom +two - or - in case of IRQ setting, connecting none of them at all. + +Setting the Node ID +------------------- + +The eight switches in SW2 are used to set the node ID. Each node attached +to the network must have an unique node ID which must not be 0. +Switch 1 (ID0) serves as the least significant bit (LSB). + +Setting one of the switches to Off means "1", On means "0". + +The node ID is the sum of the values of all switches set to "1" +These values are: + + Switch | Label | Value + -------|-------|------- + 1 | ID0 | 1 + 2 | ID1 | 2 + 3 | ID2 | 4 + 4 | ID3 | 8 + 5 | ID4 | 16 + 6 | ID5 | 32 + 7 | ID6 | 64 + 8 | ID7 | 128 + +Some Examples: + + Switch | Hex | Decimal + 8 7 6 5 4 3 2 1 | Node ID | Node ID + ----------------|---------|--------- + 0 0 0 0 0 0 0 0 | not allowed + 0 0 0 0 0 0 0 1 | 1 | 1 + 0 0 0 0 0 0 1 0 | 2 | 2 + 0 0 0 0 0 0 1 1 | 3 | 3 + . . . | | + 0 1 0 1 0 1 0 1 | 55 | 85 + . . . | | + 1 0 1 0 1 0 1 0 | AA | 170 + . . . | | + 1 1 1 1 1 1 0 1 | FD | 253 + 1 1 1 1 1 1 1 0 | FE | 254 + 1 1 1 1 1 1 1 1 | FF | 255 + + +Setting the I/O Base Address +---------------------------- + +The last three switches in switch block SW1 are used to select one +of eight possible I/O Base addresses using the following table + + + Switch | Hex I/O + 6 7 8 | Address + ------------|-------- + ON ON ON | 260 + OFF ON ON | 290 + ON OFF ON | 2E0 (Manufacturer's default) + OFF OFF ON | 2F0 + ON ON OFF | 300 + OFF ON OFF | 350 + ON OFF OFF | 380 + OFF OFF OFF | 3E0 + + +Setting the Base Memory (RAM) buffer Address +-------------------------------------------- + +The memory buffer (RAM) requires 2K. The base of this buffer can be +located in any of eight positions. The address of the Boot Prom is +memory base + 0x2000. +Jumpers 3-5 of jumper block SW1 select the Memory Base address. + + Switch | Hex RAM | Hex ROM + 1 2 3 4 5 | Address | Address *) + --------------------|---------|----------- + ON ON ON ON ON | C0000 | C2000 + ON ON OFF ON ON | C4000 | C6000 + ON ON ON OFF ON | CC000 | CE000 + ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default) + ON ON ON ON OFF | D4000 | D6000 + ON ON OFF ON OFF | D8000 | DA000 + ON ON ON OFF OFF | DC000 | DE000 + ON ON OFF OFF OFF | E0000 | E2000 + +*) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON. + +The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders. + +Setting the Interrupt Line +-------------------------- + +Jumpers 1-5 of the jumper block SW3 control the IRQ level. + + Jumper | IRQ + 1 2 3 4 5 | + ---------------------------- + ON OFF OFF OFF OFF | 2 + OFF ON OFF OFF OFF | 3 + OFF OFF ON OFF OFF | 4 + OFF OFF OFF ON OFF | 5 + OFF OFF OFF OFF ON | 7 + + +Setting the Timeout Parameters +------------------------------ + +The jumpers 6-7 of the jumper block SW3 are used to determine the timeout +parameters. These two jumpers are normally left in the OFF position. + + +***************************************************************************** + +** No Name ** +(Generic Model 9058) +-------------------- + - from Andrew J. Kroll <ag784@freenet.buffalo.edu> + - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a + year!) + _____ + | < + | .---' + ________________________________________________________________ | | + | | SW2 | | | + | ___________ |_____________| | | + | | | 1 2 3 4 5 6 ___| | + | > 6116 RAM | _________ 8 | | | + | |___________| |20MHzXtal| 7 | | | + | |_________| __________ 6 | S | | + | 74LS373 | |- 5 | W | | + | _________ | E |- 4 | | | + | >_______| ______________|..... P |- 3 | 3 | | + | | | : O |- 2 | | | + | | | : X |- 1 |___| | + | ________________ | | : Y |- | | + | | SW1 | | SL90C65 | : |- | | + | |________________| | | : B |- | | + | 1 2 3 4 5 6 7 8 | | : O |- | | + | |_________o____|..../ A |- _______| | + | ____________________ | R |- | |------, + | | | | D |- | BNC | # | + | > 2764 PROM SOCKET | |__________|- |_______|------' + | |____________________| _________ | | + | >________| <- 74LS245 | | + | | | + |___ ______________| | + |H H H H H H H H H H H H H H H H H H H H H H H| | | + |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | | + \| +Legend: + +SL90C65 ARCNET Controller / Transceiver /Logic +SW1 1-5: IRQ Select + 6: ET1 + 7: ET2 + 8: ROM ENABLE +SW2 1-3: Memory Buffer/PROM Address + 3-6: I/O Address Map +SW3 1-8: Node ID Select +BNC BNC RG62/U Connection + *I* have had success using RG59B/U with *NO* terminators! + What gives?! + +SW1: Timeouts, Interrupt and ROM +--------------------------------- + +To select a hardware interrupt level set one (only one!) of the dip switches +up (on) SW1...(switches 1-5) +IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2. + +The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7) +are used to determine the timeout parameters. These two dip switches +are normally left off (down). + + To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM. + The default is jumper ROM not installed. + + +Setting the I/O Base Address +---------------------------- + +The last three switches in switch group SW2 are used to select one +of eight possible I/O Base addresses using the following table + + + Switch | Hex I/O + 4 5 6 | Address + -------|-------- + 0 0 0 | 260 + 0 0 1 | 290 + 0 1 0 | 2E0 (Manufacturer's default) + 0 1 1 | 2F0 + 1 0 0 | 300 + 1 0 1 | 350 + 1 1 0 | 380 + 1 1 1 | 3E0 + + +Setting the Base Memory Address (RAM & ROM) +------------------------------------------- + +The memory buffer requires 2K of a 16K block of RAM. The base of this +16K block can be located in any of eight positions. +Switches 1-3 of switch group SW2 select the Base of the 16K block. +(0 = DOWN, 1 = UP) +I could, however, only verify two settings... + + Switch| Hex RAM | Hex ROM + 1 2 3 | Address | Address + ------|---------|----------- + 0 0 0 | E0000 | E2000 + 0 0 1 | D0000 | D2000 (Manufacturer's default) + 0 1 0 | ????? | ????? + 0 1 1 | ????? | ????? + 1 0 0 | ????? | ????? + 1 0 1 | ????? | ????? + 1 1 0 | ????? | ????? + 1 1 1 | ????? | ????? + + +Setting the Node ID +------------------- + +The eight switches in group SW3 are used to set the node ID. +Each node attached to the network must have an unique node ID which +must be different from 0. +Switch 1 serves as the least significant bit (LSB). +switches in the DOWN position are OFF (0) and in the UP position are ON (1) + +The node ID is the sum of the values of all switches set to "1" +These values are: + Switch | Value + -------|------- + 1 | 1 + 2 | 2 + 3 | 4 + 4 | 8 + 5 | 16 + 6 | 32 + 7 | 64 + 8 | 128 + +Some Examples: + + Switch# | Hex | Decimal +8 7 6 5 4 3 2 1 | Node ID | Node ID +----------------|---------|--------- +0 0 0 0 0 0 0 0 | not allowed <-. +0 0 0 0 0 0 0 1 | 1 | 1 | +0 0 0 0 0 0 1 0 | 2 | 2 | +0 0 0 0 0 0 1 1 | 3 | 3 | + . . . | | | +0 1 0 1 0 1 0 1 | 55 | 85 | + . . . | | + Don't use 0 or 255! +1 0 1 0 1 0 1 0 | AA | 170 | + . . . | | | +1 1 1 1 1 1 0 1 | FD | 253 | +1 1 1 1 1 1 1 0 | FE | 254 | +1 1 1 1 1 1 1 1 | FF | 255 <-' + + +***************************************************************************** + +** Tiara ** +(model unknown) +------------------------- + - from Christoph Lameter <christoph@lameter.com> + + +Here is information about my card as far as I could figure it out: +----------------------------------------------- tiara +Tiara LanCard of Tiara Computer Systems. + ++----------------------------------------------+ +! ! Transmitter Unit ! ! +! +------------------+ ------- +! MEM Coax Connector +! ROM 7654321 <- I/O ------- +! : : +--------+ ! +! : : ! 90C66LJ! +++ +! : : ! ! !D Switch to set +! : : ! ! !I the Nodenumber +! : : +--------+ !P +! !++ +! 234567 <- IRQ ! ++------------!!!!!!!!!!!!!!!!!!!!!!!!--------+ + !!!!!!!!!!!!!!!!!!!!!!!! + +0 = Jumper Installed +1 = Open + +Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O + +Settings for Memory Location (Top Jumper Line) +456 Address selected +000 C0000 +001 C4000 +010 CC000 +011 D0000 +100 D4000 +101 D8000 +110 DC000 +111 E0000 + +Settings for I/O Address (Top Jumper Line) +123 Port +000 260 +001 290 +010 2E0 +011 2F0 +100 300 +101 350 +110 380 +111 3E0 + +Settings for IRQ Selection (Lower Jumper Line) +234567 +011111 IRQ 2 +101111 IRQ 3 +110111 IRQ 4 +111011 IRQ 5 +111110 IRQ 7 + +***************************************************************************** + + +Other Cards +----------- + +I have no information on other models of ARCnet cards at the moment. Please +send any and all info to: + apenwarr@worldvisions.ca + +Thanks. diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt new file mode 100644 index 000000000..aff97f47c --- /dev/null +++ b/Documentation/networking/arcnet.txt @@ -0,0 +1,556 @@ +---------------------------------------------------------------------------- +NOTE: See also arcnet-hardware.txt in this directory for jumper-setting +and cabling information if you're like many of us and didn't happen to get a +manual with your ARCnet card. +---------------------------------------------------------------------------- + +Since no one seems to listen to me otherwise, perhaps a poem will get your +attention: + This driver's getting fat and beefy, + But my cat is still named Fifi. + +Hmm, I think I'm allowed to call that a poem, even though it's only two +lines. Hey, I'm in Computer Science, not English. Give me a break. + +The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if +you test this and get it working. Or if you don't. Or anything. + +ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was +nice, but after that even FEWER people started writing to me because they +didn't even have to install the patch. <sigh> + +Come on, be a sport! Send me a success report! + +(hey, that was even better than my original poem... this is getting bad!) + + +-------- +WARNING: +-------- + +If you don't e-mail me about your success/failure soon, I may be forced to +start SINGING. And we don't want that, do we? + +(You know, it might be argued that I'm pushing this point a little too much. +If you think so, why not flame me in a quick little e-mail? Please also +include the type of card(s) you're using, software, size of network, and +whether it's working or not.) + +My e-mail address is: apenwarr@worldvisions.ca + + +--------------------------------------------------------------------------- + + +These are the ARCnet drivers for Linux. + + +This new release (2.91) has been put together by David Woodhouse +<dwmw2@infradead.org>, in an attempt to tidy up the driver after adding support +for yet another chipset. Now the generic support has been separated from the +individual chipset drivers, and the source files aren't quite so packed with +#ifdefs! I've changed this file a bit, but kept it in the first person from +Avery, because I didn't want to completely rewrite it. + +The previous release resulted from many months of on-and-off effort from me +(Avery Pennarun), many bug reports/fixes and suggestions from others, and in +particular a lot of input and coding from Tomasz Motylewski. Starting with +ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been +included and seems to be working fine! + + +Where do I discuss these drivers? +--------------------------------- + +Tomasz has been so kind as to set up a new and improved mailing list. +Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR +REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the +list, mail to linux-arcnet@tichy.ch.uj.edu.pl. + +There are archives of the mailing list at: + http://epistolary.org/mailman/listinfo.cgi/arcnet + +The people on linux-net@vger.kernel.org (now defunct, replaced by +netdev@vger.kernel.org) have also been known to be very helpful, especially +when we're talking about ALPHA Linux kernels that may or may not work right +in the first place. + + +Other Drivers and Info +---------------------- + +You can try my ARCNET page on the World Wide Web at: + http://www.qis.net/~jschmitz/arcnet/ + +Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you +might be interested in, which includes several drivers for various cards +including ARCnet. Try: + http://www.smc.com/ + +Performance Technologies makes various network software that supports +ARCnet: + http://www.perftech.com/ or ftp to ftp.perftech.com. + +Novell makes a networking stack for DOS which includes ARCnet drivers. Try +FTPing to ftp.novell.com. + +You can get the Crynwr packet driver collection (including arcether.com, the +one you'll want to use with ARCnet cards) from +oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+ +without patches, though, and also doesn't like several cards. Fixed +versions are available on my WWW page, or via e-mail if you don't have WWW +access. + + +Installing the Driver +--------------------- + +All you will need to do in order to install the driver is: + make config + (be sure to choose ARCnet in the network devices + and at least one chipset driver.) + make clean + make zImage + +If you obtained this ARCnet package as an upgrade to the ARCnet driver in +your current kernel, you will need to first copy arcnet.c over the one in +the linux/drivers/net directory. + +You will know the driver is installed properly if you get some ARCnet +messages when you reboot into the new Linux kernel. + +There are four chipset options: + + 1. Standard ARCnet COM90xx chipset. + +This is the normal ARCnet card, which you've probably got. This is the only +chipset driver which will autoprobe if not told where the card is. +It following options on the command line: + com90xx=[<io>[,<irq>[,<shmem>]]][,<name>] | <name> + +If you load the chipset support as a module, the options are: + io=<io> irq=<irq> shmem=<shmem> device=<name> + +To disable the autoprobe, just specify "com90xx=" on the kernel command line. +To specify the name alone, but allow autoprobe, just put "com90xx=<name>" + + 2. ARCnet COM20020 chipset. + +This is the new chipset from SMC with support for promiscuous mode (packet +sniffing), extra diagnostic information, etc. Unfortunately, there is no +sensible method of autoprobing for these cards. You must specify the I/O +address on the kernel command line. +The command line options are: + com20020=<io>[,<irq>[,<node_ID>[,backplane[,CKP[,timeout]]]]][,name] + +If you load the chipset support as a module, the options are: + io=<io> irq=<irq> node=<node_ID> backplane=<backplane> clock=<CKP> + timeout=<timeout> device=<name> + +The COM20020 chipset allows you to set the node ID in software, overriding the +default which is still set in DIP switches on the card. If you don't have the +COM20020 data sheets, and you don't know what the other three options refer +to, then they won't interest you - forget them. + + 3. ARCnet COM90xx chipset in IO-mapped mode. + +This will also work with the normal ARCnet cards, but doesn't use the shared +memory. It performs less well than the above driver, but is provided in case +you have a card which doesn't support shared memory, or (strangely) in case +you have so many ARCnet cards in your machine that you run out of shmem slots. +If you don't give the IO address on the kernel command line, then the driver +will not find the card. +The command line options are: + com90io=<io>[,<irq>][,<name>] + +If you load the chipset support as a module, the options are: + io=<io> irq=<irq> device=<name> + + 4. ARCnet RIM I cards. + +These are COM90xx chips which are _completely_ memory mapped. The support for +these is not tested. If you have one, please mail the author with a success +report. All options must be specified, except the device name. +Command line options: + arcrimi=<shmem>,<irq>,<node_ID>[,<name>] + +If you load the chipset support as a module, the options are: + shmem=<shmem> irq=<irq> node=<node_ID> device=<name> + + +Loadable Module Support +----------------------- + +Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet +support" and to support for your ARCnet chipset if you want to use the +loadable module. You can also say 'y' to "Generic ARCnet support" and 'm' +to the chipset support if you wish. + + make config + make clean + make zImage + make modules + +If you're using a loadable module, you need to use insmod to load it, and +you can specify various characteristics of your card on the command +line. (In recent versions of the driver, autoprobing is much more reliable +and works as a module, so most of this is now unnecessary.) + +For example: + cd /usr/src/linux/modules + insmod arcnet.o + insmod com90xx.o + insmod com20020.o io=0x2e0 device=eth1 + + +Using the Driver +---------------- + +If you build your kernel with ARCnet COM90xx support included, it should +probe for your card automatically when you boot. If you use a different +chipset driver complied into the kernel, you must give the necessary options +on the kernel command line, as detailed above. + +Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be +available where you picked up this driver. Think of your ARCnet as a +souped-up (or down, as the case may be) Ethernet card. + +By the way, be sure to change all references from "eth0" to "arc0" in the +HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name +is DIFFERENT. + + +Multiple Cards in One Computer +------------------------------ + +Linux has pretty good support for this now, but since I've been busy, the +ARCnet driver has somewhat suffered in this respect. COM90xx support, if +compiled into the kernel, will (try to) autodetect all the installed cards. + +If you have other cards, with support compiled into the kernel, then you can +just repeat the options on the kernel command line, e.g.: +LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260 + +If you have the chipset support built as a loadable module, then you need to +do something like this: + insmod -o arc0 com90xx + insmod -o arc1 com20020 io=0x2e0 + insmod -o arc2 com90xx +The ARCnet drivers will now sort out their names automatically. + + +How do I get it to work with...? +-------------------------------- + +NFS: Should be fine linux->linux, just pretend you're using Ethernet cards. + oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There + is also a DOS-based NFS server called SOSS. It doesn't multitask + quite the way Linux does (actually, it doesn't multitask AT ALL) but + you never know what you might need. + + With AmiTCP (and possibly others), you may need to set the following + options in your Amiga nfstab: MD 1024 MR 1024 MW 1024 + (Thanks to Christian Gottschling <ferksy@indigo.tng.oche.de> + for this.) + + Probably these refer to maximum NFS data/read/write block sizes. I + don't know why the defaults on the Amiga didn't work; write to me if + you know more. + +DOS: If you're using the freeware arcether.com, you might want to install + the driver patch from my web page. It helps with PC/TCP, and also + can get arcether to load if it timed out too quickly during + initialization. In fact, if you use it on a 386+ you REALLY need + the patch, really. + +Windows: See DOS :) Trumpet Winsock works fine with either the Novell or + Arcether client, assuming you remember to load winpkt of course. + +LAN Manager and Windows for Workgroups: These programs use protocols that + are incompatible with the Internet standard. They try to pretend + the cards are Ethernet, and confuse everyone else on the network. + + However, v2.00 and higher of the Linux ARCnet driver supports this + protocol via the 'arc0e' device. See the section on "Multiprotocol + Support" for more information. + + Using the freeware Samba server and clients for Linux, you can now + interface quite nicely with TCP/IP-based WfWg or Lan Manager + networks. + +Windows 95: Tools are included with Win95 that let you use either the LANMAN + style network drivers (NDIS) or Novell drivers (ODI) to handle your + ARCnet packets. If you use ODI, you'll need to use the 'arc0' + device with Linux. If you use NDIS, then try the 'arc0e' device. + See the "Multiprotocol Support" section below if you need arc0e, + you're completely insane, and/or you need to build some kind of + hybrid network that uses both encapsulation types. + +OS/2: I've been told it works under Warp Connect with an ARCnet driver from + SMC. You need to use the 'arc0e' interface for this. If you get + the SMC driver to work with the TCP/IP stuff included in the + "normal" Warp Bonus Pack, let me know. + + ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client + which should use the same protocol as WfWg does. I had no luck + installing it under Warp, however. Please mail me with any results. + +NetBSD/AmiTCP: These use an old version of the Internet standard ARCnet + protocol (RFC1051) which is compatible with the Linux driver v2.10 + ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet" + below.) ** Newer versions of NetBSD apparently support RFC1201. + + +Using Multiprotocol ARCnet +-------------------------- + +The ARCnet driver v2.10 ALPHA supports three protocols, each on its own +"virtual network device": + + arc0 - RFC1201 protocol, the official Internet standard which just + happens to be 100% compatible with Novell's TRXNET driver. + Version 1.00 of the ARCnet driver supported _only_ this + protocol. arc0 is the fastest of the three protocols (for + whatever reason), and allows larger packets to be used + because it supports RFC1201 "packet splitting" operations. + Unless you have a specific need to use a different protocol, + I strongly suggest that you stick with this one. + + arc0e - "Ethernet-Encapsulation" which sends packets over ARCnet + that are actually a lot like Ethernet packets, including the + 6-byte hardware addresses. This protocol is compatible with + Microsoft's NDIS ARCnet driver, like the one in WfWg and + LANMAN. Because the MTU of 493 is actually smaller than the + one "required" by TCP/IP (576), there is a chance that some + network operations will not function properly. The Linux + TCP/IP layer can compensate in most cases, however, by + automatically fragmenting the TCP/IP packets to make them + fit. arc0e also works slightly more slowly than arc0, for + reasons yet to be determined. (Probably it's the smaller + MTU that does it.) + + arc0s - The "[s]imple" RFC1051 protocol is the "previous" Internet + standard that is completely incompatible with the new + standard. Some software today, however, continues to + support the old standard (and only the old standard) + including NetBSD and AmiTCP. RFC1051 also does not support + RFC1201's packet splitting, and the MTU of 507 is still + smaller than the Internet "requirement," so it's quite + possible that you may run into problems. It's also slower + than RFC1201 by about 25%, for the same reason as arc0e. + + The arc0s support was contributed by Tomasz Motylewski + and modified somewhat by me. Bugs are probably my fault. + +You can choose not to compile arc0e and arc0s into the driver if you want - +this will save you a bit of memory and avoid confusion when eg. trying to +use the "NFS-root" stuff in recent Linux kernels. + +The arc0e and arc0s devices are created automatically when you first +ifconfig the arc0 device. To actually use them, though, you need to also +ifconfig the other virtual devices you need. There are a number of ways you +can set up your network then: + + +1. Single Protocol. + + This is the simplest way to configure your network: use just one of the + two available protocols. As mentioned above, it's a good idea to use + only arc0 unless you have a good reason (like some other software, ie. + WfWg, that only works with arc0e). + + If you need only arc0, then the following commands should get you going: + ifconfig arc0 MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0 + route add -net SUB.NET.ADD.RESS arc0 + [add other local routes here] + + If you need arc0e (and only arc0e), it's a little different: + ifconfig arc0 MY.IP.ADD.RESS + ifconfig arc0e MY.IP.ADD.RESS + route add MY.IP.ADD.RESS arc0e + route add -net SUB.NET.ADD.RESS arc0e + + arc0s works much the same way as arc0e. + + +2. More than one protocol on the same wire. + + Now things start getting confusing. To even try it, you may need to be + partly crazy. Here's what *I* did. :) Note that I don't include arc0s in + my home network; I don't have any NetBSD or AmiTCP computers, so I only + use arc0s during limited testing. + + I have three computers on my home network; two Linux boxes (which prefer + RFC1201 protocol, for reasons listed above), and one XT that can't run + Linux but runs the free Microsoft LANMAN Client instead. + + Worse, one of the Linux computers (freedom) also has a modem and acts as + a router to my Internet provider. The other Linux box (insight) also has + its own IP address and needs to use freedom as its default gateway. The + XT (patience), however, does not have its own Internet IP address and so + I assigned it one on a "private subnet" (as defined by RFC1597). + + To start with, take a simple network with just insight and freedom. + Insight needs to: + - talk to freedom via RFC1201 (arc0) protocol, because I like it + more and it's faster. + - use freedom as its Internet gateway. + + That's pretty easy to do. Set up insight like this: + ifconfig arc0 insight + route add insight arc0 + route add freedom arc0 /* I would use the subnet here (like I said + to to in "single protocol" above), + but the rest of the subnet + unfortunately lies across the PPP + link on freedom, which confuses + things. */ + route add default gw freedom + + And freedom gets configured like so: + ifconfig arc0 freedom + route add freedom arc0 + route add insight arc0 + /* and default gateway is configured by pppd */ + + Great, now insight talks to freedom directly on arc0, and sends packets + to the Internet through freedom. If you didn't know how to do the above, + you should probably stop reading this section now because it only gets + worse. + + Now, how do I add patience into the network? It will be using LANMAN + Client, which means I need the arc0e device. It needs to be able to talk + to both insight and freedom, and also use freedom as a gateway to the + Internet. (Recall that patience has a "private IP address" which won't + work on the Internet; that's okay, I configured Linux IP masquerading on + freedom for this subnet). + + So patience (necessarily; I don't have another IP number from my + provider) has an IP address on a different subnet than freedom and + insight, but needs to use freedom as an Internet gateway. Worse, most + DOS networking programs, including LANMAN, have braindead networking + schemes that rely completely on the netmask and a 'default gateway' to + determine how to route packets. This means that to get to freedom or + insight, patience WILL send through its default gateway, regardless of + the fact that both freedom and insight (courtesy of the arc0e device) + could understand a direct transmission. + + I compensate by giving freedom an extra IP address - aliased 'gatekeeper' + - that is on my private subnet, the same subnet that patience is on. I + then define gatekeeper to be the default gateway for patience. + + To configure freedom (in addition to the commands above): + ifconfig arc0e gatekeeper + route add gatekeeper arc0e + route add patience arc0e + + This way, freedom will send all packets for patience through arc0e, + giving its IP address as gatekeeper (on the private subnet). When it + talks to insight or the Internet, it will use its "freedom" Internet IP + address. + + You will notice that we haven't configured the arc0e device on insight. + This would work, but is not really necessary, and would require me to + assign insight another special IP number from my private subnet. Since + both insight and patience are using freedom as their default gateway, the + two can already talk to each other. + + It's quite fortunate that I set things up like this the first time (cough + cough) because it's really handy when I boot insight into DOS. There, it + runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet. + In this mode it would be impossible for insight to communicate directly + with patience, since the Novell stack is incompatible with Microsoft's + Ethernet-Encap. Without changing any settings on freedom or patience, I + simply set freedom as the default gateway for insight (now in DOS, + remember) and all the forwarding happens "automagically" between the two + hosts that would normally not be able to communicate at all. + + For those who like diagrams, I have created two "virtual subnets" on the + same physical ARCnet wire. You can picture it like this: + + + [RFC1201 NETWORK] [ETHER-ENCAP NETWORK] + (registered Internet subnet) (RFC1597 private subnet) + + (IP Masquerade) + /---------------\ * /---------------\ + | | * | | + | +-Freedom-*-Gatekeeper-+ | + | | | * | | + \-------+-------/ | * \-------+-------/ + | | | + Insight | Patience + (Internet) + + + +It works: what now? +------------------- + +Send mail describing your setup, preferably including driver version, kernel +version, ARCnet card model, CPU type, number of systems on your network, and +list of software in use to me at the following address: + apenwarr@worldvisions.ca + +I do send (sometimes automated) replies to all messages I receive. My email +can be weird (and also usually gets forwarded all over the place along the +way to me), so if you don't get a reply within a reasonable time, please +resend. + + +It doesn't work: what now? +-------------------------- + +Do the same as above, but also include the output of the ifconfig and route +commands, as well as any pertinent log entries (ie. anything that starts +with "arcnet:" and has shown up since the last reboot) in your mail. + +If you want to try fixing it yourself (I strongly recommend that you mail me +about the problem first, since it might already have been solved) you may +want to try some of the debug levels available. For heavy testing on +D_DURING or more, it would be a REALLY good idea to kill your klogd daemon +first! D_DURING displays 4-5 lines for each packet sent or received. D_TX, +D_RX, and D_SKB actually DISPLAY each packet as it is sent or received, +which is obviously quite big. + +Starting with v2.40 ALPHA, the autoprobe routines have changed +significantly. In particular, they won't tell you why the card was not +found unless you turn on the D_INIT_REASONS debugging flag. + +Once the driver is running, you can run the arcdump shell script (available +from me or in the full ARCnet package, if you have it) as root to list the +contents of the arcnet buffers at any time. To make any sense at all out of +this, you should grab the pertinent RFCs. (some are listed near the top of +arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the +script. + +Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending. +Ping-pong buffers are implemented both ways. + +If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY, +the buffers are cleared to a constant value of 0x42 every time the card is +reset (which should only happen when you do an ifconfig up, or when Linux +decides that the driver is broken). During a transmit, unused parts of the +buffer will be cleared to 0x42 as well. This is to make it easier to figure +out which bytes are being used by a packet. + +You can change the debug level without recompiling the kernel by typing: + ifconfig arc0 down metric 1xxx + /etc/rc.d/rc.inet1 +where "xxx" is the debug level you want. For example, "metric 1015" would put +you at debug level 15. Debug level 7 is currently the default. + +Note that the debug level is (starting with v1.90 ALPHA) a binary +combination of different debug flags; so debug level 7 is really 1+2+4 or +D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this, +resulting in debug level 23. + +If you don't understand that, you probably don't want to know anyway. +E-mail me about your problem. + + +I want to send money: what now? +------------------------------- + +Go take a nap or something. You'll feel better in the morning. diff --git a/Documentation/networking/atm.txt b/Documentation/networking/atm.txt new file mode 100644 index 000000000..82921cee7 --- /dev/null +++ b/Documentation/networking/atm.txt @@ -0,0 +1,8 @@ +In order to use anything but the most primitive functions of ATM, +several user-mode programs are required to assist the kernel. These +programs and related material can be found via the ATM on Linux Web +page at http://linux-atm.sourceforge.net/ + +If you encounter problems with ATM, please report them on the ATM +on Linux mailing list. Subscription information, archives, etc., +can be found on http://linux-atm.sourceforge.net/ diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.txt new file mode 100644 index 000000000..8257dbf9b --- /dev/null +++ b/Documentation/networking/ax25.txt @@ -0,0 +1,10 @@ +To use the amateur radio protocols within Linux you will need to get a +suitable copy of the AX.25 Utilities. More detailed information about +AX.25, NET/ROM and ROSE, associated programs and and utilities can be +found on http://www.linux-ax25.org. + +There is an active mailing list for discussing Linux amateur radio matters +called linux-hams@vger.kernel.org. To subscribe to it, send a message to +majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body +of the message, the subject field is ignored. You don't need to be +subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst new file mode 100644 index 000000000..245fb6c0a --- /dev/null +++ b/Documentation/networking/batman-adv.rst @@ -0,0 +1,222 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +batman-adv +========== + +Batman advanced is a new approach to wireless networking which does no longer +operate on the IP basis. Unlike the batman daemon, which exchanges information +using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI +Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It +emulates a virtual network switch of all nodes participating. Therefore all +nodes appear to be link local, thus all higher operating protocols won't be +affected by any changes within the network. You can run almost any protocol +above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX. + +Batman advanced was implemented as a Linux kernel driver to reduce the overhead +to a minimum. It does not depend on any (other) network driver, and can be used +on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style +layer 2). + + +Configuration +============= + +Load the batman-adv module into your kernel:: + + $ insmod batman-adv.ko + +The module is now waiting for activation. You must add some interfaces on which +batman can operate. After loading the module batman advanced will scan your +systems interfaces to search for compatible interfaces. Once found, it will +create subfolders in the ``/sys`` directories of each supported interface, +e.g.:: + + $ ls /sys/class/net/eth0/batman_adv/ + elp_interval iface_status mesh_iface throughput_override + +If an interface does not have the ``batman_adv`` subfolder, it probably is not +supported. Not supported interfaces are: loopback, non-ethernet and batman's +own interfaces. + +Note: After the module was loaded it will continuously watch for new +interfaces to verify the compatibility. There is no need to reload the module +if you plug your USB wifi adapter into your machine after batman advanced was +initially loaded. + +The batman-adv soft-interface can be created using the iproute2 tool ``ip``:: + + $ ip link add name bat0 type batadv + +To activate a given interface simply attach it to the ``bat0`` interface:: + + $ ip link set dev eth0 master bat0 + +Repeat this step for all interfaces you wish to add. Now batman starts +using/broadcasting on this/these interface(s). + +By reading the "iface_status" file you can check its status:: + + $ cat /sys/class/net/eth0/batman_adv/iface_status + active + +To deactivate an interface you have to detach it from the "bat0" interface:: + + $ ip link set dev eth0 nomaster + + +All mesh wide settings can be found in batman's own interface folder:: + + $ ls /sys/class/net/bat0/mesh/ + aggregated_ogms fragmentation isolation_mark routing_algo + ap_isolation gw_bandwidth log_level vlan0 + bonding gw_mode multicast_mode + bridge_loop_avoidance gw_sel_class network_coding + distributed_arp_table hop_penalty orig_interval + +There is a special folder for debugging information:: + + $ ls /sys/kernel/debug/batman_adv/bat0/ + bla_backbone_table log neighbors transtable_local + bla_claim_table mcast_flags originators + dat_cache nc socket + gateways nc_nodes transtable_global + +Some of the files contain all sort of status information regarding the mesh +network. For example, you can view the table of originators (mesh +participants) with:: + + $ cat /sys/kernel/debug/batman_adv/bat0/originators + +Other files allow to change batman's behaviour to better fit your requirements. +For instance, you can check the current originator interval (value in +milliseconds which determines how often batman sends its broadcast packets):: + + $ cat /sys/class/net/bat0/mesh/orig_interval + 1000 + +and also change its value:: + + $ echo 3000 > /sys/class/net/bat0/mesh/orig_interval + +In very mobile scenarios, you might want to adjust the originator interval to a +lower value. This will make the mesh more responsive to topology changes, but +will also increase the overhead. + + +Usage +===== + +To make use of your newly created mesh, batman advanced provides a new +interface "bat0" which you should use from this point on. All interfaces added +to batman advanced are not relevant any longer because batman handles them for +you. Basically, one "hands over" the data by using the batman interface and +batman will make sure it reaches its destination. + +The "bat0" interface can be used like any other regular interface. It needs an +IP address which can be either statically configured or dynamically (by using +DHCP or similar services):: + + NodeA: ip link set up dev bat0 + NodeA: ip addr add 192.168.0.1/24 dev bat0 + + NodeB: ip link set up dev bat0 + NodeB: ip addr add 192.168.0.2/24 dev bat0 + NodeB: ping 192.168.0.1 + +Note: In order to avoid problems remove all IP addresses previously assigned to +interfaces now used by batman advanced, e.g.:: + + $ ip addr flush dev eth0 + + +Logging/Debugging +================= + +All error messages, warnings and information messages are sent to the kernel +log. Depending on your operating system distribution this can be read in one of +a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in +the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages +are prefixed with "batman-adv:" So to see just these messages try:: + + $ dmesg | grep batman-adv + +When investigating problems with your mesh network, it is sometimes necessary to +see more detail debug messages. This must be enabled when compiling the +batman-adv module. When building batman-adv as part of kernel, use "make +menuconfig" and enable the option ``B.A.T.M.A.N. debugging`` +(``CONFIG_BATMAN_ADV_DEBUG=y``). + +Those additional debug messages can be accessed using a special file in +debugfs:: + + $ cat /sys/kernel/debug/batman_adv/bat0/log + +The additional debug output is by default disabled. It can be enabled during +run time. Following log_levels are defined: + +.. flat-table:: + + * - 0 + - All debug output disabled + * - 1 + - Enable messages related to routing / flooding / broadcasting + * - 2 + - Enable messages related to route added / changed / deleted + * - 4 + - Enable messages related to translation table operations + * - 8 + - Enable messages related to bridge loop avoidance + * - 16 + - Enable messages related to DAT, ARP snooping and parsing + * - 32 + - Enable messages related to network coding + * - 64 + - Enable messages related to multicast + * - 128 + - Enable messages related to throughput meter + * - 255 + - Enable all messages + +The debug output can be changed at runtime using the file +``/sys/class/net/bat0/mesh/log_level``. e.g.:: + + $ echo 6 > /sys/class/net/bat0/mesh/log_level + +will enable debug messages for when routes change. + +Counters for different types of packets entering and leaving the batman-adv +module are available through ethtool:: + + $ ethtool --statistics bat0 + + +batctl +====== + +As batman advanced operates on layer 2, all hosts participating in the virtual +switch are completely transparent for all protocols above layer 2. Therefore +the common diagnosis tools do not work as expected. To overcome these problems, +batctl was created. At the moment the batctl contains ping, traceroute, tcpdump +and interfaces to the kernel module settings. + +For more information, please see the manpage (``man batctl``). + +batctl is available on https://www.open-mesh.org/ + + +Contact +======= + +Please send us comments, experiences, questions, anything :) + +IRC: + #batman on irc.freenode.org +Mailing-list: + b.a.t.m.a.n@open-mesh.org (optional subscription at + https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n) + +You can also contact the Authors: + +* Marek Lindner <mareklindner@neomailbox.ch> +* Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt new file mode 100644 index 000000000..688f18fd4 --- /dev/null +++ b/Documentation/networking/baycom.txt @@ -0,0 +1,158 @@ + LINUX DRIVERS FOR BAYCOM MODEMS + + Thomas M. Sailer, HB9JNX/AE4WA, <sailer@ife.ee.ethz.ch> + +!!NEW!! (04/98) The drivers for the baycom modems have been split into +separate drivers as they did not share any code, and the driver +and device names have changed. + +This document describes the Linux Kernel Drivers for simple Baycom style +amateur radio modems. + +The following drivers are available: + +baycom_ser_fdx: + This driver supports the SER12 modems either full or half duplex. + Its baud rate may be changed via the `baud' module parameter, + therefore it supports just about every bit bang modem on a + serial port. Its devices are called bcsf0 through bcsf3. + This is the recommended driver for SER12 type modems, + however if you have a broken UART clone that does not have working + delta status bits, you may try baycom_ser_hdx. + +baycom_ser_hdx: + This is an alternative driver for SER12 type modems. + It only supports half duplex, and only 1200 baud. Its devices + are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx + does not work with your UART. + +baycom_par: + This driver supports the par96 and picpar modems. + Its devices are called bcp0 through bcp3. + +baycom_epp: + This driver supports the EPP modem. + Its devices are called bce0 through bce3. + This driver is work-in-progress. + +The following modems are supported: + +ser12: This is a very simple 1200 baud AFSK modem. The modem consists only + of a modulator/demodulator chip, usually a TI TCM3105. The computer + is responsible for regenerating the receiver bit clock, as well as + for handling the HDLC protocol. The modem connects to a serial port, + hence the name. Since the serial port is not used as an async serial + port, the kernel driver for serial ports cannot be used, and this + driver only supports standard serial hardware (8250, 16450, 16550) + +par96: This is a modem for 9600 baud FSK compatible to the G3RUH standard. + The modem does all the filtering and regenerates the receiver clock. + Data is transferred from and to the PC via a shift register. + The shift register is filled with 16 bits and an interrupt is signalled. + The PC then empties the shift register in a burst. This modem connects + to the parallel port, hence the name. The modem leaves the + implementation of the HDLC protocol and the scrambler polynomial to + the PC. + +picpar: This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem + is protocol compatible to par96, but uses only three low power ICs + and can therefore be fed from the parallel port and does not require + an additional power supply. Furthermore, it incorporates a carrier + detect circuitry. + +EPP: This is a high-speed modem adaptor that connects to an enhanced parallel port. + Its target audience is users working over a high speed hub (76.8kbit/s). + +eppfpga: This is a redesign of the EPP adaptor. + + + +All of the above modems only support half duplex communications. However, +the driver supports the KISS (see below) fullduplex command. It then simply +starts to send as soon as there's a packet to transmit and does not care +about DCD, i.e. it starts to send even if there's someone else on the channel. +This command is required by some implementations of the DAMA channel +access protocol. + + +The Interface of the drivers + +Unlike previous drivers, these drivers are no longer character devices, +but they are now true kernel network interfaces. Installation is therefore +simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available. +sethdlc from the ax25 utilities may be used to set driver states etc. +Users of userland AX.25 stacks may use the net2kiss utility (also available +in the ax25 utilities package) to convert packets of a network interface +to a KISS stream on a pseudo tty. There's also a patch available from +me for WAMPES which allows attaching a kernel network interface directly. + + +Configuring the driver + +Every time a driver is inserted into the kernel, it has to know which +modems it should access at which ports. This can be done with the setbaycom +utility. If you are only using one modem, you can also configure the +driver from the insmod command line (or by means of an option line in +/etc/modprobe.d/*.conf). + +Examples: + modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 + sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4 + +Both lines configure the first port to drive a ser12 modem at the first +serial port (COM1 under DOS). The * in the mode parameter instructs the driver to use +the software DCD algorithm (see below). + + insmod baycom_par mode="picpar" iobase=0x378 + sethdlc -i bcp0 -p mode "picpar" io 0x378 + +Both lines configure the first port to drive a picpar modem at the +first parallel port (LPT1 under DOS). (Note: picpar implies +hardware DCD, par96 implies software DCD). + +The channel access parameters can be set with sethdlc -a or kissparms. +Note that both utilities interpret the values slightly differently. + + +Hardware DCD versus Software DCD + +To avoid collisions on the air, the driver must know when the channel is +busy. This is the task of the DCD circuitry/software. The driver may either +utilise a software DCD algorithm (options=1) or use a DCD signal from +the hardware (options=0). + +ser12: if software DCD is utilised, the radio's squelch should always be + open. It is highly recommended to use the software DCD algorithm, + as it is much faster than most hardware squelch circuitry. The + disadvantage is a slightly higher load on the system. + +par96: the software DCD algorithm for this type of modem is rather poor. + The modem simply does not provide enough information to implement + a reasonable DCD algorithm in software. Therefore, if your radio + feeds the DCD input of the PAR96 modem, the use of the hardware + DCD circuitry is recommended. + +picpar: the picpar modem features a builtin DCD hardware, which is highly + recommended. + + + +Compatibility with the rest of the Linux kernel + +The serial driver and the baycom serial drivers compete +for the same hardware resources. Of course only one driver can access a given +interface at a time. The serial driver grabs all interfaces it can find at +startup time. Therefore the baycom drivers subsequently won't be able to +access a serial port. You might therefore find it necessary to release +a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where +# is the number of the interface. The baycom drivers do not reserve any +ports at startup, unless one is specified on the 'insmod' command line. Another +method to solve the problem is to compile all drivers as modules and +leave it to kmod to load the correct driver depending on the application. + +The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem +to arbitrate the ports between different client drivers. + +vy 73s de +Tom Sailer, sailer@ife.ee.ethz.ch +hb9jnx @ hb9w.ampr.org diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt new file mode 100644 index 000000000..4035a495c --- /dev/null +++ b/Documentation/networking/bonding.txt @@ -0,0 +1,2828 @@ + + Linux Ethernet Bonding Driver HOWTO + + Latest update: 27 April 2011 + +Initial release : Thomas Davis <tadavis at lbl.gov> +Corrections, HA extensions : 2000/10/03-15 : + - Willy Tarreau <willy at meta-x.org> + - Constantine Gavrilov <const-g at xpert.com> + - Chad N. Tindel <ctindel at ieee dot org> + - Janice Girouard <girouard at us dot ibm dot com> + - Jay Vosburgh <fubar at us dot ibm dot com> + +Reorganized and updated Feb 2005 by Jay Vosburgh +Added Sysfs information: 2006/04/24 + - Mitch Williams <mitch.a.williams at intel.com> + +Introduction +============ + + The Linux bonding driver provides a method for aggregating +multiple network interfaces into a single logical "bonded" interface. +The behavior of the bonded interfaces depends upon the mode; generally +speaking, modes provide either hot standby or load balancing services. +Additionally, link integrity monitoring may be performed. + + The bonding driver originally came from Donald Becker's +beowulf patches for kernel 2.0. It has changed quite a bit since, and +the original tools from extreme-linux and beowulf sites will not work +with this version of the driver. + + For new versions of the driver, updated userspace tools, and +who to ask for help, please follow the links at the end of this file. + +Table of Contents +================= + +1. Bonding Driver Installation + +2. Bonding Driver Options + +3. Configuring Bonding Devices +3.1 Configuration with Sysconfig Support +3.1.1 Using DHCP with Sysconfig +3.1.2 Configuring Multiple Bonds with Sysconfig +3.2 Configuration with Initscripts Support +3.2.1 Using DHCP with Initscripts +3.2.2 Configuring Multiple Bonds with Initscripts +3.3 Configuring Bonding Manually with Ifenslave +3.3.1 Configuring Multiple Bonds Manually +3.4 Configuring Bonding Manually via Sysfs +3.5 Configuration with Interfaces Support +3.6 Overriding Configuration for Special Cases +3.7 Configuring LACP for 802.3ad mode in a more secure way + +4. Querying Bonding Configuration +4.1 Bonding Configuration +4.2 Network Configuration + +5. Switch Configuration + +6. 802.1q VLAN Support + +7. Link Monitoring +7.1 ARP Monitor Operation +7.2 Configuring Multiple ARP Targets +7.3 MII Monitor Operation + +8. Potential Trouble Sources +8.1 Adventures in Routing +8.2 Ethernet Device Renaming +8.3 Painfully Slow Or No Failed Link Detection By Miimon + +9. SNMP agents + +10. Promiscuous mode + +11. Configuring Bonding for High Availability +11.1 High Availability in a Single Switch Topology +11.2 High Availability in a Multiple Switch Topology +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology +11.2.2 HA Link Monitoring for Multiple Switch Topology + +12. Configuring Bonding for Maximum Throughput +12.1 Maximum Throughput in a Single Switch Topology +12.1.1 MT Bonding Mode Selection for Single Switch Topology +12.1.2 MT Link Monitoring for Single Switch Topology +12.2 Maximum Throughput in a Multiple Switch Topology +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology +12.2.2 MT Link Monitoring for Multiple Switch Topology + +13. Switch Behavior Issues +13.1 Link Establishment and Failover Delays +13.2 Duplicated Incoming Packets + +14. Hardware Specific Considerations +14.1 IBM BladeCenter + +15. Frequently Asked Questions + +16. Resources and Links + + +1. Bonding Driver Installation +============================== + + Most popular distro kernels ship with the bonding driver +already available as a module. If your distro does not, or you +have need to compile bonding from source (e.g., configuring and +installing a mainline kernel from kernel.org), you'll need to perform +the following steps: + +1.1 Configure and build the kernel with bonding +----------------------------------------------- + + The current version of the bonding driver is available in the +drivers/net/bonding subdirectory of the most recent kernel source +(which is available on http://kernel.org). Most users "rolling their +own" will want to use the most recent kernel from kernel.org. + + Configure kernel with "make menuconfig" (or "make xconfig" or +"make config"), then select "Bonding driver support" in the "Network +device support" section. It is recommended that you configure the +driver as module since it is currently the only way to pass parameters +to the driver or configure more than one bonding device. + + Build and install the new kernel and modules. + +1.2 Bonding Control Utility +------------------------------------- + + It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. + +2. Bonding Driver Options +========================= + + Options for the bonding driver are supplied as parameters to the +bonding module at load time, or are specified via sysfs. + + Module options may be given as command line arguments to the +insmod or modprobe command, but are usually specified in either the +/etc/modprobe.d/*.conf configuration files, or in a distro-specific +configuration file (some of which are detailed in the next section). + + Details on bonding support for sysfs is provided in the +"Configuring Bonding Manually via Sysfs" section, below. + + The available bonding driver parameters are listed below. If a +parameter is not specified the default value is used. When initially +configuring a bond, it is recommended "tail -f /var/log/messages" be +run in a separate window to watch for bonding driver error messages. + + It is critical that either the miimon or arp_interval and +arp_ip_target parameters be specified, otherwise serious network +degradation will occur during link failures. Very few devices do not +support at least miimon, so there is really no reason not to use it. + + Options with textual values will accept either the text name +or, for backwards compatibility, the option value. E.g., +"mode=802.3ad" and "mode=4" set the same mode. + + The parameters are as follows: + +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + +ad_actor_sys_prio + + In an AD system, this specifies the system priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 65535 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_actor_system + + In an AD system, this specifies the mac-address for the actor in + protocol packet exchanges (LACPDUs). The value cannot be a multicast + address. If the all-zeroes MAC is specified, bonding will internally + use the MAC of the bond itself. It is preferred to have the + local-admin bit set for this mac but driver does not enforce it. If + the value is not given then system defaults to using the masters' + mac address as actors' system address. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +ad_select + + Specifies the 802.3ad aggregation selection logic to use. The + possible values and their effects are: + + stable or 0 + + The active aggregator is chosen by largest aggregate + bandwidth. + + Reselection of the active aggregator occurs only when all + slaves of the active aggregator are down or the active + aggregator has no slaves. + + This is the default value. + + bandwidth or 1 + + The active aggregator is chosen by largest aggregate + bandwidth. Reselection occurs if: + + - A slave is added to or removed from the bond + + - Any slave's link state changes + + - Any slave's 802.3ad association state changes + + - The bond's administrative state changes to up + + count or 2 + + The active aggregator is chosen by the largest number of + ports (slaves). Reselection occurs as described under the + "bandwidth" setting, above. + + The bandwidth and count selection policies permit failover of + 802.3ad aggregations when partial failure of the active aggregator + occurs. This keeps the aggregator with the highest availability + (either in bandwidth or in number of ports) active at all times. + + This option was added in bonding version 3.4.0. + +ad_user_port_key + + In an AD system, the port-key has three parts as shown below - + + Bits Use + 00 Duplex + 01-05 Speed + 06-15 User-defined + + This defines the upper 10 bits of the port key. The values can be + from 0 - 1023. If not given, the system defaults to 0. + + This parameter has effect only in 802.3ad mode and is available through + SysFs interface. + +all_slaves_active + + Specifies that duplicate frames (received on inactive ports) should be + dropped (0) or delivered (1). + + Normally, bonding will drop duplicate frames (received on inactive + ports), which is desirable for most users. But there are some times + it is nice to allow duplicate frames to be delivered. + + The default value is 0 (drop duplicate frames received on inactive + ports). + +arp_interval + + Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + + If ARP monitoring is used in an etherchannel compatible mode + (modes 0 and 2), the switch should be configured in a mode + that evenly distributes packets across all links. If the + switch is configured to distribute the packets in an XOR + fashion, all replies from the ARP targets will be received on + the same link which could cause the other team members to + fail. ARP monitoring should not be used in conjunction with + miimon. A value of 0 disables ARP monitoring. The default + value is 0. + +arp_ip_target + + Specifies the IP addresses to use as ARP monitoring peers when + arp_interval is > 0. These are the targets of the ARP request + sent to determine the health of the link to the targets. + Specify these values in ddd.ddd.ddd.ddd format. Multiple IP + addresses must be separated by a comma. At least one IP + address must be given for ARP monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IP addresses. + +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. + + Possible values are: + + none or 0 + + No validation or filtering is performed. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. + + This option was added in bonding version 3.1.0. + +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + +downdelay + + Specifies the time, in milliseconds, to wait before disabling + a slave after a link failure has been detected. This option + is only valid for the miimon link monitor. The downdelay + value should be a multiple of the miimon value; if not, it + will be rounded down to the nearest multiple. The default + value is 0. + +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address at enslavement (the traditional + behavior), or, when enabled, perform special handling of the + bond's MAC address in accordance with the selected policy. + + Possible values are: + + none or 0 + + This setting disables fail_over_mac, and causes + bonding to set all slaves of an active-backup bond to + the same MAC address at enslavement time. This is the + default. + + active or 1 + + The "active" fail_over_mac policy indicates that the + MAC address of the bond should always be the MAC + address of the currently active slave. The MAC + address of the slaves is not changed; instead, the MAC + address of the bond changes during a failover. + + This policy is useful for devices that cannot ever + alter their MAC address, or for devices that refuse + incoming broadcasts with their own source MAC (which + interferes with the ARP monitor). + + The down side of this policy is that every device on + the network must be updated via gratuitous ARP, + vs. just updating a switch or set of switches (which + often takes place for any traffic, not just ARP + traffic, if the switch snoops incoming traffic to + update its tables) for the traditional method. If the + gratuitous ARP is lost, communication may be + disrupted. + + When this policy is used in conjunction with the mii + monitor, devices which assert link up prior to being + able to actually transmit and receive are particularly + susceptible to loss of the gratuitous ARP, and an + appropriate updelay setting may be required. + + follow or 2 + + The "follow" fail_over_mac policy causes the MAC + address of the bond to be selected normally (normally + the MAC address of the first slave added to the bond). + However, the second and subsequent slaves are not set + to this MAC address while they are in a backup role; a + slave is programmed with the bond's MAC address at + failover time (and the formerly active slave receives + the newly active slave's MAC address). + + This policy is useful for multiport devices that + either become confused or incur a performance penalty + when multiple ports are programmed with the same MAC + address. + + + The default policy is none, unless the first slave cannot + change its MAC address, in which case the active policy is + selected by default. + + This option may be modified via sysfs only when no slaves are + present in the bond. + + This option was added in bonding version 3.2.0. The "follow" + policy was added in bonding version 3.3.0. + +lacp_rate + + Option specifying the rate in which we'll ask our link partner + to transmit LACPDU packets in 802.3ad mode. Possible values + are: + + slow or 0 + Request partner to transmit LACPDUs every 30 seconds + + fast or 1 + Request partner to transmit LACPDUs every 1 second + + The default is slow. + +max_bonds + + Specifies the number of bonding devices to create for this + instance of the bonding driver. E.g., if max_bonds is 3, and + the bonding driver is not already loaded, then bond0, bond1 + and bond2 will be created. The default value is 1. Specifying + a value of 0 will load bonding, but will not create any devices. + +miimon + + Specifies the MII link monitoring frequency in milliseconds. + This determines how often the link state of each slave is + inspected for link failures. A value of zero disables MII + link monitoring. A value of 100 is a good starting point. + The use_carrier option, below, affects how the link state is + determined. See the High Availability section for additional + information. The default value is 0. + +min_links + + Specifies the minimum number of links that must be active before + asserting carrier. It is similar to the Cisco EtherChannel min-links + feature. This allows setting the minimum number of member ports that + must be up (link-up state) before marking the bond device as up + (carrier on). This is useful for situations where higher level services + such as clustering want to ensure a minimum number of low bandwidth + links are active before switchover. This option only affect 802.3ad + mode. + + The default value is 0. This will cause carrier to be asserted (for + 802.3ad mode) whenever there is an active aggregator, regardless of the + number of available links in that aggregator. Note that, because an + aggregator cannot be active without at least one available link, + setting this option to 0 or to 1 has the exact same effect. + +mode + + Specifies one of the bonding policies. The default is + balance-rr (round robin). Possible values are: + + balance-rr or 0 + + Round-robin policy: Transmit packets in sequential + order from the first available slave through the + last. This mode provides load balancing and fault + tolerance. + + active-backup or 1 + + Active-backup policy: Only one slave in the bond is + active. A different slave becomes active if, and only + if, the active slave fails. The bond's MAC address is + externally visible on only one port (network adapter) + to avoid confusing the switch. + + In bonding version 2.6.2 or later, when a failover + occurs in active-backup mode, bonding will issue one + or more gratuitous ARPs on the newly active slave. + One gratuitous ARP is issued for the bonding master + interface and each VLAN interfaces configured above + it, provided that the interface has at least one IP + address configured. Gratuitous ARPs issued for VLAN + interfaces are tagged with the appropriate VLAN id. + + This mode provides fault tolerance. The primary + option, documented below, affects the behavior of this + mode. + + balance-xor or 2 + + XOR policy: Transmit based on the selected transmit + hash policy. The default policy is a simple [(source + MAC address XOR'd with destination MAC address XOR + packet type ID) modulo slave count]. Alternate transmit + policies may be selected via the xmit_hash_policy option, + described below. + + This mode provides load balancing and fault tolerance. + + broadcast or 3 + + Broadcast policy: transmits everything on all slave + interfaces. This mode provides fault tolerance. + + 802.3ad or 4 + + IEEE 802.3ad Dynamic link aggregation. Creates + aggregation groups that share the same speed and + duplex settings. Utilizes all slaves in the active + aggregator according to the 802.3ad specification. + + Slave selection for outgoing traffic is done according + to the transmit hash policy, which may be changed from + the default simple XOR policy via the xmit_hash_policy + option, documented below. Note that not all transmit + policies may be 802.3ad compliant, particularly in + regards to the packet mis-ordering requirements of + section 43.2.4 of the 802.3ad standard. Differing + peer implementations will have varying tolerances for + noncompliance. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed and duplex of each slave. + + 2. A switch that supports IEEE 802.3ad Dynamic link + aggregation. + + Most switches will require some type of configuration + to enable 802.3ad mode. + + balance-tlb or 5 + + Adaptive transmit load balancing: channel bonding that + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. + + Prerequisite: + + Ethtool support in the base drivers for retrieving the + speed of each slave. + + balance-alb or 6 + + Adaptive load balancing: includes balance-tlb plus + receive load balancing (rlb) for IPV4 traffic, and + does not require any special switch support. The + receive load balancing is achieved by ARP negotiation. + The bonding driver intercepts the ARP Replies sent by + the local system on their way out and overwrites the + source hardware address with the unique hardware + address of one of the slaves in the bond such that + different peers use different hardware addresses for + the server. + + Receive traffic from connections created by the server + is also balanced. When the local system sends an ARP + Request the bonding driver copies and saves the peer's + IP information from the ARP packet. When the ARP + Reply arrives from the peer, its hardware address is + retrieved and the bonding driver initiates an ARP + reply to this peer assigning it to one of the slaves + in the bond. A problematic outcome of using ARP + negotiation for balancing is that each time that an + ARP request is broadcast it uses the hardware address + of the bond. Hence, peers learn the hardware address + of the bond and the balancing of receive traffic + collapses to the current slave. This is handled by + sending updates (ARP Replies) to all the peers with + their individually assigned hardware address such that + the traffic is redistributed. Receive traffic is also + redistributed when a new slave is added to the bond + and when an inactive slave is re-activated. The + receive load is distributed sequentially (round robin) + among the group of highest speed slaves in the bond. + + When a link is reconnected or a new slave joins the + bond the receive traffic is redistributed among all + active slaves in the bond by initiating ARP Replies + with the selected MAC address to each of the + clients. The updelay parameter (detailed below) must + be set to a value equal or greater than the switch's + forwarding delay so that the ARP Replies sent to the + peers will not be blocked by the switch. + + Prerequisites: + + 1. Ethtool support in the base drivers for retrieving + the speed of each slave. + + 2. Base driver support for setting the hardware + address of a device while it is open. This is + required so that there will always be one slave in the + team using the bond hardware address (the + curr_active_slave) while having a unique hardware + address for each slave in the bond. If the + curr_active_slave fails its hardware address is + swapped with the new curr_active_slave that was + chosen. + +num_grat_arp +num_unsol_na + + Specify the number of peer notifications (gratuitous ARPs and + unsolicited IPv6 Neighbor Advertisements) to be issued after a + failover event. As soon as the link is up on the new slave + (possibly immediately) a peer notification is sent on the + bonding device and each VLAN sub-device. This is repeated at + each link monitor interval (arp_interval or miimon, whichever + is active) if the number is greater than 1. + + The valid range is 0 - 255; the default value is 1. These options + affect only the active-backup mode. These options were added for + bonding versions 3.3.0 and 3.4.0 respectively. + + From Linux 3.0 and bonding version 3.7.1, these notifications + are generated by the ipv4 and ipv6 code and the numbers of + repetitions cannot be set independently. + +packets_per_slave + + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. + + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. + +primary + + A string (eth0, eth2, etc) specifying which slave is the + primary device. The specified device will always be the + active slave while it is available. Only when the primary is + off-line will alternate devices be used. This is useful when + one slave is preferred over another, e.g., when one slave has + higher throughput than another. + + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. + +primary_reselect + + Specifies the reselection policy for the primary slave. This + affects how the primary slave is chosen to become the active slave + when failure of the active slave or recovery of the primary slave + occurs. This option is designed to prevent flip-flopping between + the primary slave and other slaves. Possible values are: + + always or 0 (default) + + The primary slave becomes the active slave whenever it + comes back up. + + better or 1 + + The primary slave becomes the active slave when it comes + back up, if the speed and duplex of the primary slave is + better than the speed and duplex of the current active + slave. + + failure or 2 + + The primary slave becomes the active slave only if the + current active slave fails and the primary slave is up. + + The primary_reselect setting is ignored in two cases: + + If no slaves are active, the first slave to recover is + made the active slave. + + When initially enslaved, the primary slave is always made + the active slave. + + Changing the primary_reselect policy via sysfs will cause an + immediate selection of the best active slave according to the new + policy. This may or may not result in a change of the active + slave, depending upon the circumstances. + + This option was added for bonding version 3.6.0. + +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + +updelay + + Specifies the time, in milliseconds, to wait before enabling a + slave after a link recovery has been detected. This option is + only valid for the miimon link monitor. The updelay value + should be a multiple of the miimon value; if not, it will be + rounded down to the nearest multiple. The default value is 0. + +use_carrier + + Specifies whether or not miimon should use MII or ETHTOOL + ioctls vs. netif_carrier_ok() to determine the link + status. The MII or ETHTOOL ioctls are less efficient and + utilize a deprecated calling sequence within the kernel. The + netif_carrier_ok() relies on the device driver to maintain its + state with netif_carrier_on/off; at this writing, most, but + not all, device drivers support this facility. + + If bonding insists that the link is up when it should not be, + it may be that your network device driver does not support + netif_carrier_on/off. The default state for netif_carrier is + "carrier on," so if a driver does not support netif_carrier, + it will appear as if the link is always up. In this case, + setting use_carrier to 0 will cause bonding to revert to the + MII / ETHTOOL ioctl method to determine the link state. + + A value of 1 enables the use of netif_carrier_ok(), a value of + 0 will use the deprecated MII / ETHTOOL ioctls. The default + value is 1. + +xmit_hash_policy + + Selects the transmit hash policy to use for slave selection in + balance-xor, 802.3ad, and tlb modes. Possible values are: + + layer2 + + Uses XOR of hardware MAC addresses and packet type ID + field to generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + slave number = hash modulo slave count + + This algorithm will place all traffic to a particular + network peer on the same slave. + + This algorithm is 802.3ad compliant. + + layer2+3 + + This policy uses a combination of layer2 and layer3 + protocol information to generate the hash. + + Uses XOR of hardware MAC addresses and IP addresses to + generate the hash. The formula is + + hash = source MAC XOR destination MAC XOR packet type ID + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + This algorithm will place all traffic to a particular + network peer on the same slave. For non-IP traffic, + the formula is the same as for the layer2 transmit + hash policy. + + This policy is intended to provide a more balanced + distribution of traffic than layer2 alone, especially + in environments where a layer3 gateway device is + required to reach most destinations. + + This algorithm is 802.3ad compliant. + + layer3+4 + + This policy uses upper layer protocol information, + when available, to generate the hash. This allows for + traffic to a particular network peer to span multiple + slaves, although a single connection will not span + multiple slaves. + + The formula for unfragmented TCP and UDP packets is + + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + For fragmented TCP or UDP packets and all other IPv4 and + IPv6 protocol traffic, the source and destination port + information is omitted. For non-IP traffic, the + formula is the same as for the layer2 transmit hash + policy. + + This algorithm is not fully 802.3ad compliant. A + single TCP or UDP conversation containing both + fragmented and unfragmented packets will see packets + striped across two interfaces. This may result in out + of order delivery. Most traffic types will not meet + this criteria, as TCP rarely fragments traffic, and + most UDP traffic is not involved in extended + conversations. Other implementations of 802.3ad may + or may not tolerate this noncompliance. + + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + The default value is layer2. This option was added in bonding + version 2.6.3. In earlier versions of bonding, this parameter + does not exist, and the layer2 policy is the only policy. The + layer2+3 value was added for bonding version 3.2.2. + +resend_igmp + + Specifies the number of IGMP membership reports to be issued after + a failover event. One membership report is issued immediately after + the failover, subsequent packets are sent in each 200ms interval. + + The valid range is 0 - 255; the default value is 1. A value of 0 + prevents the IGMP membership report from being issued in response + to the failover event. + + This option is useful for bonding modes balance-rr (0), active-backup + (1), balance-tlb (5) and balance-alb (6), in which a failover can + switch the IGMP traffic from one slave to another. Therefore a fresh + IGMP report must be issued to cause the switch to forward the incoming + IGMP traffic over the newly selected slave. + + This option was added for bonding version 3.7.0. + +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. + +3. Configuring Bonding Devices +============================== + + You can configure bonding using either your distro's network +initialization scripts, or manually using either iproute2 or the +sysfs interface. Distros generally use one of three packages for the +network initialization scripts: initscripts, sysconfig or interfaces. +Recent versions of these packages have support for bonding, while older +versions do not. + + We will first describe the options for configuring bonding for +distros using versions of initscripts, sysconfig and interfaces with full +or partial support for bonding, then provide information on enabling +bonding without support from the network initialization scripts (i.e., +older versions of initscripts or sysconfig). + + If you're unsure whether your distro uses sysconfig, +initscripts or interfaces, or don't know if it's new enough, have no fear. +Determining this is fairly straightforward. + + First, look for a file called interfaces in /etc/network directory. +If this file is present in your system, then your system use interfaces. See +Configuration with Interfaces Support. + + Else, issue the command: + +$ rpm -qf /sbin/ifup + + It will respond with a line of text starting with either +"initscripts" or "sysconfig," followed by some numbers. This is the +package that provides your network initialization scripts. + + Next, to determine if your installation supports bonding, +issue the command: + +$ grep ifenslave /sbin/ifup + + If this returns any matches, then your initscripts or +sysconfig has support for bonding. + +3.1 Configuration with Sysconfig Support +---------------------------------------- + + This section applies to distros using a version of sysconfig +with bonding support, for example, SuSE Linux Enterprise Server 9. + + SuSE SLES 9's networking configuration system does support +bonding, however, at this writing, the YaST system configuration +front end does not provide any means to work with bonding devices. +Bonding devices can be managed by hand, however, as follows. + + First, if they have not already been configured, configure the +slave devices. On SLES 9, this is most easily done by running the +yast2 sysconfig configuration utility. The goal is for to create an +ifcfg-id file for each slave device. The simplest way to accomplish +this is to configure the devices for DHCP (this is only to get the +file ifcfg-id file created; see below for some issues with DHCP). The +name of the configuration file for each device will be of the form: + +ifcfg-id-xx:xx:xx:xx:xx:xx + + Where the "xx" portion will be replaced with the digits from +the device's permanent MAC address. + + Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been +created, it is necessary to edit the configuration files for the slave +devices (the MAC addresses correspond to those of the slave devices). +Before editing, the file will contain multiple lines, and will look +something like this: + +BOOTPROTO='dhcp' +STARTMODE='on' +USERCTL='no' +UNIQUE='XNzu.WeZGOGF+4wE' +_nm_name='bus-pci-0001:61:01.0' + + Change the BOOTPROTO and STARTMODE lines to the following: + +BOOTPROTO='none' +STARTMODE='off' + + Do not alter the UNIQUE or _nm_name lines. Remove any other +lines (USERCTL, etc). + + Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, +it's time to create the configuration file for the bonding device +itself. This file is named ifcfg-bondX, where X is the number of the +bonding device to create, starting at 0. The first such file is +ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig +network configuration system will correctly start multiple instances +of bonding. + + The contents of the ifcfg-bondX file is as follows: + +BOOTPROTO="static" +BROADCAST="10.0.2.255" +IPADDR="10.0.2.10" +NETMASK="255.255.0.0" +NETWORK="10.0.2.0" +REMOTE_IPADDR="" +STARTMODE="onboot" +BONDING_MASTER="yes" +BONDING_MODULE_OPTS="mode=active-backup miimon=100" +BONDING_SLAVE0="eth0" +BONDING_SLAVE1="bus-pci-0000:06:08.1" + + Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK +values with the appropriate values for your network. + + The STARTMODE specifies when the device is brought online. +The possible values are: + + onboot: The device is started at boot time. If you're not + sure, this is probably what you want. + + manual: The device is started only when ifup is called + manually. Bonding devices may be configured this + way if you do not wish them to start automatically + at boot for some reason. + + hotplug: The device is started by a hotplug event. This is not + a valid choice for a bonding device. + + off or ignore: The device configuration is ignored. + + The line BONDING_MASTER='yes' indicates that the device is a +bonding master device. The only useful value is "yes." + + The contents of BONDING_MODULE_OPTS are supplied to the +instance of the bonding module for this device. Specify the options +for the bonding mode, link monitoring, and so on here. Do not include +the max_bonds bonding parameter; this will confuse the configuration +system if you have multiple bonding devices. + + Finally, supply one BONDING_SLAVEn="slave device" for each +slave. where "n" is an increasing value, one for each slave. The +"slave device" is either an interface name, e.g., "eth0", or a device +specifier for the network device. The interface name is easier to +find, but the ethN names are subject to change at boot time if, e.g., +a device early in the sequence has failed. The device specifiers +(bus-pci-0000:06:08.1 in the example above) specify the physical +network device, and will not change unless the device's bus location +changes (for example, it is moved from one PCI slot to another). The +example above uses one of each type for demonstration purposes; most +configurations will choose one or the other for all slave devices. + + When all configuration files have been modified or created, +networking must be restarted for the configuration changes to take +effect. This can be accomplished via the following: + +# /etc/init.d/network restart + + Note that the network control script (/sbin/ifdown) will +remove the bonding module as part of the network shutdown processing, +so it is not necessary to remove the module by hand if, e.g., the +module parameters have changed. + + Also, at this writing, YaST/YaST2 will not manage bonding +devices (they do not show bonding interfaces on its list of network +devices). It is necessary to edit the configuration file by hand to +change the bonding configuration. + + Additional general options and details of the ifcfg file +format can be found in an example ifcfg template file: + +/etc/sysconfig/network/ifcfg.template + + Note that the template does not document the various BONDING_ +settings described above, but does describe many of the other options. + +3.1.1 Using DHCP with Sysconfig +------------------------------- + + Under sysconfig, configuring a device with BOOTPROTO='dhcp' +will cause it to query DHCP for its IP address information. At this +writing, this does not function for bonding devices; the scripts +attempt to obtain the device address from DHCP prior to adding any of +the slave devices. Without active slaves, the DHCP requests are not +sent to the network. + +3.1.2 Configuring Multiple Bonds with Sysconfig +----------------------------------------------- + + The sysconfig network initialization system is capable of +handling multiple bonding devices. All that is necessary is for each +bonding instance to have an appropriately configured ifcfg-bondX file +(as described above). Do not specify the "max_bonds" parameter to any +instance of bonding, as this will confuse sysconfig. If you require +multiple bonding devices with identical parameters, create multiple +ifcfg-bondX files. + + Because the sysconfig scripts supply the bonding module +options in the ifcfg-bondX file, it is not necessary to add them to +the system /etc/modules.d/*.conf configuration files. + +3.2 Configuration with Initscripts Support +------------------------------------------ + + This section applies to distros using a recent version of +initscripts with bonding support, for example, Red Hat Enterprise Linux +version 3 or later, Fedora, etc. On these systems, the network +initialization scripts have knowledge of bonding, and can be configured to +control bonding devices. Note that older versions of the initscripts +package have lower levels of support for bonding; this will be noted where +applicable. + + These distros will not automatically load the network adapter +driver unless the ethX device is configured with an IP address. +Because of this constraint, users must manually configure a +network-script file for all physical adapters that will be members of +a bondX link. Network script files are located in the directory: + +/etc/sysconfig/network-scripts + + The file name must be prefixed with "ifcfg-eth" and suffixed +with the adapter's physical adapter number. For example, the script +for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. +Place the following text in the file: + +DEVICE=eth0 +USERCTL=no +ONBOOT=yes +MASTER=bond0 +SLAVE=yes +BOOTPROTO=none + + The DEVICE= line will be different for every ethX device and +must correspond with the name of the file, i.e., ifcfg-eth1 must have +a device line of DEVICE=eth1. The setting of the MASTER= line will +also depend on the final bonding interface name chosen for your bond. +As with other network devices, these typically start at 0, and go up +one for each device, i.e., the first bonding instance is bond0, the +second is bond1, and so on. + + Next, create a bond network script. The file name for this +script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is +the number of the bond. For bond0 the file is named "ifcfg-bond0", +for bond1 it is named "ifcfg-bond1", and so on. Within that file, +place the following text: + +DEVICE=bond0 +IPADDR=192.168.1.1 +NETMASK=255.255.255.0 +NETWORK=192.168.1.0 +BROADCAST=192.168.1.255 +ONBOOT=yes +BOOTPROTO=none +USERCTL=no + + Be sure to change the networking specific lines (IPADDR, +NETMASK, NETWORK and BROADCAST) to match your network configuration. + + For later versions of initscripts, such as that found with Fedora +7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, +and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 +file, e.g. a line of the format: + +BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" + + will configure the bond with the specified options. The options +specified in BONDING_OPTS are identical to the bonding module parameters +except for the arp_ip_target field when using versions of initscripts older +than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When +using older versions each target should be included as a separate option and +should be preceded by a '+' to indicate it should be added to the list of +queried targets, e.g., + + arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 + + is the proper syntax to specify multiple targets. When specifying +options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf. + + For even older versions of initscripts that do not support +BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon +your distro) to load the bonding module with your desired options when the +bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf +will load the bonding module, and select its options: + +alias bond0 bonding +options bond0 mode=balance-alb miimon=100 + + Replace the sample parameters with the appropriate set of +options for your configuration. + + Finally run "/etc/rc.d/init.d/network restart" as root. This +will restart the networking subsystem and your bond link should be now +up and running. + +3.2.1 Using DHCP with Initscripts +--------------------------------- + + Recent versions of initscripts (the versions supplied with Fedora +Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to +work) have support for assigning IP information to bonding devices via +DHCP. + + To configure bonding for DHCP, configure it as described +above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" +and add a line consisting of "TYPE=Bonding". Note that the TYPE value +is case sensitive. + +3.2.2 Configuring Multiple Bonds with Initscripts +------------------------------------------------- + + Initscripts packages that are included with Fedora 7 and Red Hat +Enterprise Linux 5 support multiple bonding interfaces by simply +specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the +number of the bond. This support requires sysfs support in the kernel, +and a bonding driver of version 3.0.0 or later. Other configurations may +not support this method for specifying multiple bonding interfaces; for +those instances, see the "Configuring Multiple Bonds Manually" section, +below. + +3.3 Configuring Bonding Manually with iproute2 +----------------------------------------------- + + This section applies to distros whose network initialization +scripts (the sysconfig or initscripts package) do not have specific +knowledge of bonding. One such distro is SuSE Linux Enterprise Server +version 8. + + The general method for these systems is to place the bonding +module parameters into a config file in /etc/modprobe.d/ (as +appropriate for the installed distro), then add modprobe and/or +`ip link` commands to the system's global init script. The name of +the global init script differs; for sysconfig, it is +/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. + + For example, if you wanted to make a simple bond of two e100 +devices (presumed to be eth0 and eth1), and have it persist across +reboots, edit the appropriate file (/etc/init.d/boot.local or +/etc/rc.d/rc.local), and add the following: + +modprobe bonding mode=balance-alb miimon=100 +modprobe e100 +ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up +ip link set eth0 master bond0 +ip link set eth1 master bond0 + + Replace the example bonding module parameters and bond0 +network configuration (IP address, netmask, etc) with the appropriate +values for your configuration. + + Unfortunately, this method will not provide support for the +ifup and ifdown scripts on the bond devices. To reload the bonding +configuration, it is necessary to run the initialization script, e.g., + +# /etc/init.d/boot.local + + or + +# /etc/rc.d/rc.local + + It may be desirable in such a case to create a separate script +which only initializes the bonding configuration, then call that +separate script from within boot.local. This allows for bonding to be +enabled without re-running the entire global init script. + + To shut down the bonding devices, it is necessary to first +mark the bonding device itself as being down, then remove the +appropriate device driver modules. For our example above, you can do +the following: + +# ifconfig bond0 down +# rmmod bonding +# rmmod e100 + + Again, for convenience, it may be desirable to create a script +with these commands. + + +3.3.1 Configuring Multiple Bonds Manually +----------------------------------------- + + This section contains information on configuring multiple +bonding devices with differing options for those systems whose network +initialization scripts lack support for configuring multiple bonds. + + If you require multiple bonding devices, but all with the same +options, you may wish to use the "max_bonds" module parameter, +documented above. + + To create multiple bonding devices with differing options, it is +preferable to use bonding parameters exported by sysfs, documented in the +section below. + + For versions of bonding without sysfs support, the only means to +provide multiple instances of bonding with differing options is to load +the bonding driver multiple times. Note that current versions of the +sysconfig network initialization scripts handle this automatically; if +your distro uses these scripts, no special action is needed. See the +section Configuring Bonding Devices, above, if you're not sure about your +network initialization scripts. + + To load multiple instances of the module, it is necessary to +specify a different name for each instance (the module loading system +requires that every loaded module, even multiple instances of the same +module, have a unique name). This is accomplished by supplying multiple +sets of bonding options in /etc/modprobe.d/*.conf, for example: + +alias bond0 bonding +options bond0 -o bond0 mode=balance-rr miimon=100 + +alias bond1 bonding +options bond1 -o bond1 mode=balance-alb miimon=50 + + will load the bonding module two times. The first instance is +named "bond0" and creates the bond0 device in balance-rr mode with an +miimon of 100. The second instance is named "bond1" and creates the +bond1 device in balance-alb mode with an miimon of 50. + + In some circumstances (typically with older distributions), +the above does not work, and the second bonding instance never sees +its options. In that case, the second options line can be substituted +as follows: + +install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 + + This may be repeated any number of times, specifying a new and +unique name in place of bond1 for each subsequent instance. + + It has been observed that some Red Hat supplied kernels are unable +to rename modules at load time (the "-o bond1" part). Attempts to pass +that option to modprobe will produce an "Operation not permitted" error. +This has been reported on some Fedora Core kernels, and has been seen on +RHEL 4 as well. On kernels exhibiting this problem, it will be impossible +to configure multiple bonds with differing parameters (as they are older +kernels, and also lack sysfs support). + +3.4 Configuring Bonding Manually via Sysfs +------------------------------------------ + + Starting with version 3.0.0, Channel Bonding may be configured +via the sysfs interface. This interface allows dynamic configuration +of all bonds in the system without unloading the module. It also +allows for adding and removing bonds at runtime. Ifenslave is no +longer required, though it is still supported. + + Use of the sysfs interface allows you to use multiple bonds +with different configurations without having to reload the module. +It also allows you to use multiple, differently configured bonds when +bonding is compiled into the kernel. + + You must have the sysfs filesystem mounted to configure +bonding this way. The examples in this document assume that you +are using the standard mount point for sysfs, e.g. /sys. If your +sysfs filesystem is mounted elsewhere, you will need to adjust the +example paths accordingly. + +Creating and Destroying Bonds +----------------------------- +To add a new bond foo: +# echo +foo > /sys/class/net/bonding_masters + +To remove an existing bond bar: +# echo -bar > /sys/class/net/bonding_masters + +To show all existing bonds: +# cat /sys/class/net/bonding_masters + +NOTE: due to 4K size limitation of sysfs files, this list may be +truncated if you have more than a few hundred bonds. This is unlikely +to occur under normal operating conditions. + +Adding and Removing Slaves +-------------------------- + Interfaces may be enslaved to a bond using the file +/sys/class/net/<bond>/bonding/slaves. The semantics for this file +are the same as for the bonding_masters file. + +To enslave interface eth0 to bond bond0: +# ifconfig bond0 up +# echo +eth0 > /sys/class/net/bond0/bonding/slaves + +To free slave eth0 from bond bond0: +# echo -eth0 > /sys/class/net/bond0/bonding/slaves + + When an interface is enslaved to a bond, symlinks between the +two are created in the sysfs filesystem. In this case, you would get +/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and +/sys/class/net/eth0/master pointing to /sys/class/net/bond0. + + This means that you can tell quickly whether or not an +interface is enslaved by looking for the master symlink. Thus: +# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves +will free eth0 from whatever bond it is enslaved to, regardless of +the name of the bond interface. + +Changing a Bond's Configuration +------------------------------- + Each bond may be configured individually by manipulating the +files located in /sys/class/net/<bond name>/bonding + + The names of these files correspond directly with the command- +line parameters described elsewhere in this file, and, with the +exception of arp_ip_target, they accept the same values. To see the +current setting, simply cat the appropriate file. + + A few examples will be given here; for specific usage +guidelines for each parameter, see the appropriate section in this +document. + +To configure bond0 for balance-alb mode: +# ifconfig bond0 down +# echo 6 > /sys/class/net/bond0/bonding/mode + - or - +# echo balance-alb > /sys/class/net/bond0/bonding/mode + NOTE: The bond interface must be down before the mode can be +changed. + +To enable MII monitoring on bond0 with a 1 second interval: +# echo 1000 > /sys/class/net/bond0/bonding/miimon + NOTE: If ARP monitoring is enabled, it will disabled when MII +monitoring is enabled, and vice-versa. + +To add ARP targets: +# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target +# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target + NOTE: up to 16 target addresses may be specified. + +To remove an ARP target: +# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + +To configure the interval between learning packet transmits: +# echo 12 > /sys/class/net/bond0/bonding/lp_interval + NOTE: the lp_interval is the number of seconds between instances where +the bonding driver sends learning packets to each slaves peer switch. The +default interval is 1 second. + +Example Configuration +--------------------- + We begin with the same example that is shown in section 3.3, +executed with sysfs, and without using ifenslave. + + To make a simple bond of two e100 devices (presumed to be eth0 +and eth1), and have it persist across reboots, edit the appropriate +file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the +following: + +modprobe bonding +modprobe e100 +echo balance-alb > /sys/class/net/bond0/bonding/mode +ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up +echo 100 > /sys/class/net/bond0/bonding/miimon +echo +eth0 > /sys/class/net/bond0/bonding/slaves +echo +eth1 > /sys/class/net/bond0/bonding/slaves + + To add a second bond, with two e1000 interfaces in +active-backup mode, using ARP monitoring, add the following lines to +your init script: + +modprobe e1000 +echo +bond1 > /sys/class/net/bonding_masters +echo active-backup > /sys/class/net/bond1/bonding/mode +ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up +echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target +echo 2000 > /sys/class/net/bond1/bonding/arp_interval +echo +eth2 > /sys/class/net/bond1/bonding/slaves +echo +eth3 > /sys/class/net/bond1/bonding/slaves + +3.5 Configuration with Interfaces Support +----------------------------------------- + + This section applies to distros which use /etc/network/interfaces file +to describe network interface configuration, most notably Debian and it's +derivatives. + + The ifup and ifdown commands on Debian don't support bonding out of +the box. The ifenslave-2.6 package should be installed to provide bonding +support. Once installed, this package will provide bond-* options to be used +into /etc/network/interfaces. + + Note that ifenslave-2.6 package will load the bonding module and use +the ifenslave command when appropriate. + +Example Configurations +---------------------- + +In /etc/network/interfaces, the following stanza will configure bond0, in +active-backup mode, with eth0 and eth1 as slaves. + +auto bond0 +iface bond0 inet dhcp + bond-slaves eth0 eth1 + bond-mode active-backup + bond-miimon 100 + bond-primary eth0 eth1 + +If the above configuration doesn't work, you might have a system using +upstart for system startup. This is most notably true for recent +Ubuntu versions. The following stanza in /etc/network/interfaces will +produce the same result on those systems. + +auto bond0 +iface bond0 inet dhcp + bond-slaves none + bond-mode active-backup + bond-miimon 100 + +auto eth0 +iface eth0 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +auto eth1 +iface eth1 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +For a full list of bond-* supported options in /etc/network/interfaces and some +more advanced examples tailored to you particular distros, see the files in +/usr/share/doc/ifenslave-2.6. + +3.6 Overriding Configuration for Special Cases +---------------------------------------------- + +When using the bonding driver, the physical port which transmits a frame is +typically selected by the bonding driver, and is not relevant to the user or +system administrator. The output port is simply selected using the policies of +the selected bonding mode. On occasion however, it is helpful to direct certain +classes of traffic to certain physical interfaces on output to implement +slightly more complex policies. For example, to reach a web server over a +bonded interface in which eth0 connects to a private network, while eth1 +connects via a public network, it may be desirous to bias the bond to send said +traffic over eth0 first, using eth1 only as a fall back, while all other traffic +can safely be sent over either interface. Such configurations may be achieved +using the traffic control utilities inherent in linux. + +By default the bonding driver is multiqueue aware and 16 queues are created +when the driver initializes (see Documentation/networking/multiqueue.txt +for details). If more or less queues are desired the module parameter +tx_queues can be used to change this value. There is no sysfs parameter +available as the allocation is done at module init time. + +The output of the file /proc/net/bonding/bondX has changed so the output Queue +ID is now printed for each slave: + +Bonding Mode: fault-tolerance (active-backup) +Primary Slave: None +Currently Active Slave: eth0 +MII Status: up +MII Polling Interval (ms): 0 +Up Delay (ms): 0 +Down Delay (ms): 0 + +Slave Interface: eth0 +MII Status: up +Link Failure Count: 0 +Permanent HW addr: 00:1a:a0:12:8f:cb +Slave queue ID: 0 + +Slave Interface: eth1 +MII Status: up +Link Failure Count: 0 +Permanent HW addr: 00:1a:a0:12:8f:cc +Slave queue ID: 2 + +The queue_id for a slave can be set using the command: + +# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id + +Any interface that needs a queue_id set should set it with multiple calls +like the one above until proper priorities are set for all interfaces. On +distributions that allow configuration via initscripts, multiple 'queue_id' +arguments can be added to BONDING_OPTS to set all needed slave queues. + +These queue id's can be used in conjunction with the tc utility to configure +a multiqueue qdisc and filters to bias certain traffic to transmit on certain +slave devices. For instance, say we wanted, in the above configuration to +force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output +device. The following commands would accomplish this: + +# tc qdisc add dev bond0 handle 1 root multiq + +# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \ + 192.168.1.100 action skbedit queue_mapping 2 + +These commands tell the kernel to attach a multiqueue queue discipline to the +bond0 interface and filter traffic enqueued to it, such that packets with a dst +ip of 192.168.1.100 have their output queue mapping value overwritten to 2. +This value is then passed into the driver, causing the normal output path +selection policy to be overridden, selecting instead qid 2, which maps to eth1. + +Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver +that normal output policy selection should take place. One benefit to simply +leaving the qid for a slave to 0 is the multiqueue awareness in the bonding +driver that is now present. This awareness allows tc filters to be placed on +slave devices as well as bond devices and the bonding driver will simply act as +a pass-through for selecting output queues on the slave device rather than +output port selection. + +This feature first appeared in bonding driver version 3.7.0 and support for +output slave selection was limited to round-robin and active-backup modes. + +3.7 Configuring LACP for 802.3ad mode in a more secure way +---------------------------------------------------------- + +When using 802.3ad bonding mode, the Actor (host) and Partner (switch) +exchange LACPDUs. These LACPDUs cannot be sniffed, because they are +destined to link local mac addresses (which switches/bridges are not +supposed to forward). However, most of the values are easily predictable +or are simply the machine's MAC address (which is trivially known to all +other hosts in the same L2). This implies that other machines in the L2 +domain can spoof LACPDU packets from other hosts to the switch and potentially +cause mayhem by joining (from the point of view of the switch) another +machine's aggregate, thus receiving a portion of that hosts incoming +traffic and / or spoofing traffic from that machine themselves (potentially +even successfully terminating some portion of flows). Though this is not +a likely scenario, one could avoid this possibility by simply configuring +few bonding parameters: + + (a) ad_actor_system : You can set a random mac-address that can be used for + these LACPDU exchanges. The value can not be either NULL or Multicast. + Also it's preferable to set the local-admin bit. Following shell code + generates a random mac-address as described above. + + # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \ + $(( (RANDOM & 0xFE) | 0x02 )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF )) \ + $(( RANDOM & 0xFF ))) + # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system + + (b) ad_actor_sys_prio : Randomize the system priority. The default value + is 65535, but system can take the value from 1 - 65535. Following shell + code generates random priority and sets it. + + # sys_prio=$(( 1 + RANDOM + RANDOM )) + # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio + + (c) ad_user_port_key : Use the user portion of the port-key. The default + keeps this empty. These are the upper 10 bits of the port-key and value + ranges from 0 - 1023. Following shell code generates these 10 bits and + sets it. + + # usr_port_key=$(( RANDOM & 0x3FF )) + # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key + + +4 Querying Bonding Configuration +================================= + +4.1 Bonding Configuration +------------------------- + + Each bonding device has a read-only file residing in the +/proc/net/bonding directory. The file contents include information +about the bonding configuration, options and state of each slave. + + For example, the contents of /proc/net/bonding/bond0 after the +driver is loaded with parameters of mode=0 and miimon=1000 is +generally as follows: + + Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) + Bonding Mode: load balancing (round-robin) + Currently Active Slave: eth0 + MII Status: up + MII Polling Interval (ms): 1000 + Up Delay (ms): 0 + Down Delay (ms): 0 + + Slave Interface: eth1 + MII Status: up + Link Failure Count: 1 + + Slave Interface: eth0 + MII Status: up + Link Failure Count: 1 + + The precise format and contents will change depending upon the +bonding configuration, state, and version of the bonding driver. + +4.2 Network configuration +------------------------- + + The network configuration can be inspected using the ifconfig +command. Bonding devices will have the MASTER flag set; Bonding slave +devices will have the SLAVE flag set. The ifconfig output does not +contain information on which slaves are associated with which masters. + + In the example below, the bond0 interface is the master +(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of +bond0 have the same MAC address (HWaddr) as bond0 for all modes except +TLB and ALB that require a unique MAC address for each slave. + +# /sbin/ifconfig +bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 + UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 + RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 + TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:0 + +eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:10 Base address:0x1080 + +eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 + UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 + RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 + TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 + collisions:0 txqueuelen:100 + Interrupt:9 Base address:0x1400 + +5. Switch Configuration +======================= + + For this section, "switch" refers to whatever system the +bonded devices are directly connected to (i.e., where the other end of +the cable plugs into). This may be an actual dedicated switch device, +or it may be another regular system (e.g., another computer running +Linux), + + The active-backup, balance-tlb and balance-alb modes do not +require any specific configuration of the switch. + + The 802.3ad mode requires that the switch have the appropriate +ports configured as an 802.3ad aggregation. The precise method used +to configure this varies from switch to switch, but, for example, a +Cisco 3550 series switch requires that the appropriate ports first be +grouped together in a single etherchannel instance, then that +etherchannel is set to mode "lacp" to enable 802.3ad (instead of +standard EtherChannel). + + The balance-rr, balance-xor and broadcast modes generally +require that the switch have the appropriate ports grouped together. +The nomenclature for such a group differs between switches, it may be +called an "etherchannel" (as in the Cisco example, above), a "trunk +group" or some other similar variation. For these modes, each switch +will also have its own configuration options for the switch's transmit +policy to the bond. Typical choices include XOR of either the MAC or +IP addresses. The transmit policy of the two peers does not need to +match. For these three modes, the bonding mode really selects a +transmit policy for an EtherChannel group; all three will interoperate +with another EtherChannel group. + + +6. 802.1q VLAN Support +====================== + + It is possible to configure VLAN devices over a bond interface +using the 8021q driver. However, only packets coming from the 8021q +driver and passing through bonding will be tagged by default. Self +generated packets, for example, bonding's learning packets or ARP +packets generated by either ALB mode or the ARP monitor mechanism, are +tagged internally by bonding itself. As a result, bonding must +"learn" the VLAN IDs configured above it, and use those IDs to tag +self generated packets. + + For reasons of simplicity, and to support the use of adapters +that can do VLAN hardware acceleration offloading, the bonding +interface declares itself as fully hardware offloading capable, it gets +the add_vid/kill_vid notifications to gather the necessary +information, and it propagates those actions to the slaves. In case +of mixed adapter types, hardware accelerated tagged packets that +should go through an adapter that is not offloading capable are +"un-accelerated" by the bonding driver so the VLAN tag sits in the +regular location. + + VLAN interfaces *must* be added on top of a bonding interface +only after enslaving at least one slave. The bonding interface has a +hardware address of 00:00:00:00:00:00 until the first slave is added. +If the VLAN interface is created prior to the first enslavement, it +would pick up the all-zeroes hardware address. Once the first slave +is attached to the bond, the bond device itself will pick up the +slave's hardware address, which is then available for the VLAN device. + + Also, be aware that a similar problem can occur if all slaves +are released from a bond that still has one or more VLAN interfaces on +top of it. When a new slave is added, the bonding interface will +obtain its hardware address from the first slave, which might not +match the hardware address of the VLAN interfaces (which was +ultimately copied from an earlier slave). + + There are two methods to insure that the VLAN device operates +with the correct hardware address if all slaves are removed from a +bond interface: + + 1. Remove all VLAN interfaces then recreate them + + 2. Set the bonding interface's hardware address so that it +matches the hardware address of the VLAN interfaces. + + Note that changing a VLAN interface's HW address would set the +underlying device -- i.e. the bonding interface -- to promiscuous +mode, which might not be what you want. + + +7. Link Monitoring +================== + + The bonding driver at present supports two schemes for +monitoring a slave device's link state: the ARP monitor and the MII +monitor. + + At the present time, due to implementation restrictions in the +bonding driver itself, it is not possible to enable both ARP and MII +monitoring simultaneously. + +7.1 ARP Monitor Operation +------------------------- + + The ARP monitor operates as its name suggests: it sends ARP +queries to one or more designated peer systems on the network, and +uses the response as an indication that the link is operating. This +gives some assurance that traffic is actually flowing to and from one +or more peers on the local network. + + The ARP monitor relies on the device driver itself to verify +that traffic is flowing. In particular, the driver must keep up to +date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX +flag must also update netdev_queue->trans_start. If they do not, then the +ARP monitor will immediately fail any slaves using that driver, and +those slaves will stay down. If networking monitoring (tcpdump, etc) +shows the ARP requests and replies on the network, then it may be that +your device driver is not updating last_rx and trans_start. + +7.2 Configuring Multiple ARP Targets +------------------------------------ + + While ARP monitoring can be done with just one target, it can +be useful in a High Availability setup to have several targets to +monitor. In the case of just one target, the target itself may go +down or have a problem making it unresponsive to ARP requests. Having +an additional target (or several) increases the reliability of the ARP +monitoring. + + Multiple ARP targets must be separated by commas as follows: + +# example options for ARP monitoring with three targets +alias bond0 bonding +options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 + + For just a single target the options would resemble: + +# example options for ARP monitoring with one target +alias bond0 bonding +options bond0 arp_interval=60 arp_ip_target=192.168.0.100 + + +7.3 MII Monitor Operation +------------------------- + + The MII monitor monitors only the carrier state of the local +network interface. It accomplishes this in one of three ways: by +depending upon the device driver to maintain its carrier state, by +querying the device's MII registers, or by making an ethtool query to +the device. + + If the use_carrier module parameter is 1 (the default value), +then the MII monitor will rely on the driver for carrier state +information (via the netif_carrier subsystem). As explained in the +use_carrier parameter information, above, if the MII monitor fails to +detect carrier loss on the device (e.g., when the cable is physically +disconnected), it may be that the driver does not support +netif_carrier. + + If use_carrier is 0, then the MII monitor will first query the +device's (via ioctl) MII registers and check the link state. If that +request fails (not just that it returns carrier down), then the MII +monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain +the same information. If both methods fail (i.e., the driver either +does not support or had some error in processing both the MII register +and ethtool requests), then the MII monitor will assume the link is +up. + +8. Potential Sources of Trouble +=============================== + +8.1 Adventures in Routing +------------------------- + + When bonding is configured, it is important that the slave +devices not have routes that supersede routes of the master (or, +generally, not have routes at all). For example, suppose the bonding +device bond0 has two slaves, eth0 and eth1, and the routing table is +as follows: + +Kernel IP routing table +Destination Gateway Genmask Flags MSS Window irtt Iface +10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 +10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 +10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 +127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo + + This routing configuration will likely still update the +receive/transmit times in the driver (needed by the ARP monitor), but +may bypass the bonding driver (because outgoing traffic to, in this +case, another host on network 10 would use eth0 or eth1 before bond0). + + The ARP monitor (and ARP itself) may become confused by this +configuration, because ARP requests (generated by the ARP monitor) +will be sent on one interface (bond0), but the corresponding reply +will arrive on a different interface (eth0). This reply looks to ARP +as an unsolicited ARP reply (because ARP matches replies on an +interface basis), and is discarded. The MII monitor is not affected +by the state of the routing table. + + The solution here is simply to insure that slaves do not have +routes of their own, and if for some reason they must, those routes do +not supersede routes of their master. This should generally be the +case, but unusual configurations or errant manual or automatic static +route additions may cause trouble. + +8.2 Ethernet Device Renaming +---------------------------- + + On systems with network configuration scripts that do not +associate physical devices directly with network interface names (so +that the same physical device always has the same "ethX" name), it may +be necessary to add some special logic to config files in +/etc/modprobe.d/. + + For example, given a modules.conf containing the following: + +alias bond0 bonding +options bond0 mode=some-mode miimon=50 +alias eth0 tg3 +alias eth1 tg3 +alias eth2 e1000 +alias eth3 e1000 + + If neither eth0 and eth1 are slaves to bond0, then when the +bond0 interface comes up, the devices may end up reordered. This +happens because bonding is loaded first, then its slave device's +drivers are loaded next. Since no other drivers have been loaded, +when the e1000 driver loads, it will receive eth0 and eth1 for its +devices, but the bonding configuration tries to enslave eth2 and eth3 +(which may later be assigned to the tg3 devices). + + Adding the following: + +add above bonding e1000 tg3 + + causes modprobe to load e1000 then tg3, in that order, when +bonding is loaded. This command is fully documented in the +modules.conf manual page. + + On systems utilizing modprobe an equivalent problem can occur. +In this case, the following can be added to config files in +/etc/modprobe.d/ as: + +softdep bonding pre: tg3 e1000 + + This will load tg3 and e1000 modules before loading the bonding one. +Full documentation on this can be found in the modprobe.d and modprobe +manual pages. + +8.3. Painfully Slow Or No Failed Link Detection By Miimon +--------------------------------------------------------- + + By default, bonding enables the use_carrier option, which +instructs bonding to trust the driver to maintain carrier state. + + As discussed in the options section, above, some drivers do +not support the netif_carrier_on/_off link state tracking system. +With use_carrier enabled, bonding will always see these links as up, +regardless of their actual state. + + Additionally, other drivers do support netif_carrier, but do +not maintain it in real time, e.g., only polling the link state at +some fixed interval. In this case, miimon will detect failures, but +only after some long period of time has expired. If it appears that +miimon is very slow in detecting link failures, try specifying +use_carrier=0 to see if that improves the failure detection time. If +it does, then it may be that the driver checks the carrier state at a +fixed interval, but does not cache the MII register values (so the +use_carrier=0 method of querying the registers directly works). If +use_carrier=0 does not improve the failover, then the driver may cache +the registers, or the problem may be elsewhere. + + Also, remember that miimon only checks for the device's +carrier state. It has no way to determine the state of devices on or +beyond other ports of a switch, or if a switch is refusing to pass +traffic while still maintaining carrier on. + +9. SNMP agents +=============== + + If running SNMP agents, the bonding driver should be loaded +before any network drivers participating in a bond. This requirement +is due to the interface index (ipAdEntIfIndex) being associated to +the first interface found with a given IP address. That is, there is +only one ipAdEntIfIndex for each IP address. For example, if eth0 and +eth1 are slaves of bond0 and the driver for eth0 is loaded before the +bonding driver, the interface for the IP address will be associated +with the eth0 interface. This configuration is shown below, the IP +address 192.168.1.1 has an interface index of 2 which indexes to eth0 +in the ifDescr table (ifDescr.2). + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = eth0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth1 + interfaces.ifTable.ifEntry.ifDescr.4 = eth2 + interfaces.ifTable.ifEntry.ifDescr.5 = eth3 + interfaces.ifTable.ifEntry.ifDescr.6 = bond0 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + + This problem is avoided by loading the bonding driver before +any network drivers participating in a bond. Below is an example of +loading the bonding driver first, the IP address 192.168.1.1 is +correctly associated with ifDescr.2. + + interfaces.ifTable.ifEntry.ifDescr.1 = lo + interfaces.ifTable.ifEntry.ifDescr.2 = bond0 + interfaces.ifTable.ifEntry.ifDescr.3 = eth0 + interfaces.ifTable.ifEntry.ifDescr.4 = eth1 + interfaces.ifTable.ifEntry.ifDescr.5 = eth2 + interfaces.ifTable.ifEntry.ifDescr.6 = eth3 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 + ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 + + While some distributions may not report the interface name in +ifDescr, the association between the IP address and IfIndex remains +and SNMP functions such as Interface_Scan_Next will report that +association. + +10. Promiscuous mode +==================== + + When running network monitoring tools, e.g., tcpdump, it is +common to enable promiscuous mode on the device, so that all traffic +is seen (instead of seeing only traffic destined for the local host). +The bonding driver handles promiscuous mode changes to the bonding +master device (e.g., bond0), and propagates the setting to the slave +devices. + + For the balance-rr, balance-xor, broadcast, and 802.3ad modes, +the promiscuous mode setting is propagated to all slaves. + + For the active-backup, balance-tlb and balance-alb modes, the +promiscuous mode setting is propagated only to the active slave. + + For balance-tlb mode, the active slave is the slave currently +receiving inbound traffic. + + For balance-alb mode, the active slave is the slave used as a +"primary." This slave is used for mode-specific control traffic, for +sending to peers that are unassigned or if the load is unbalanced. + + For the active-backup, balance-tlb and balance-alb modes, when +the active slave changes (e.g., due to a link failure), the +promiscuous setting will be propagated to the new active slave. + +11. Configuring Bonding for High Availability +============================================= + + High Availability refers to configurations that provide +maximum network availability by having redundant or backup devices, +links or switches between the host and the rest of the world. The +goal is to provide the maximum availability of network connectivity +(i.e., the network always works), even though other configurations +could provide higher throughput. + +11.1 High Availability in a Single Switch Topology +-------------------------------------------------- + + If two hosts (or a host and a single switch) are directly +connected via multiple physical links, then there is no availability +penalty to optimizing for maximum bandwidth. In this case, there is +only one switch (or peer), so if it fails, there is no alternative +access to fail over to. Additionally, the bonding load balance modes +support link monitoring of their members, so if individual links fail, +the load will be rebalanced across the remaining devices. + + See Section 12, "Configuring Bonding for Maximum Throughput" +for information on configuring bonding with one peer device. + +11.2 High Availability in a Multiple Switch Topology +---------------------------------------------------- + + With multiple switches, the configuration of bonding and the +network changes dramatically. In multiple switch topologies, there is +a trade off between network availability and usable bandwidth. + + Below is a sample network, configured to maximize the +availability of the network: + + | | + |port3 port3| + +-----+----+ +-----+----+ + | |port2 ISL port2| | + | switch A +--------------------------+ switch B | + | | | | + +-----+----+ +-----++---+ + |port1 port1| + | +-------+ | + +-------------+ host1 +---------------+ + eth0 +-------+ eth1 + + In this configuration, there is a link between the two +switches (ISL, or inter switch link), and multiple ports connecting to +the outside world ("port3" on each switch). There is no technical +reason that this could not be extended to a third switch. + +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + + In a topology such as the example above, the active-backup and +broadcast modes are the only useful bonding modes when optimizing for +availability; the other modes require all links to terminate on the +same peer for them to behave rationally. + +active-backup: This is generally the preferred mode, particularly if + the switches have an ISL and play together well. If the + network configuration is such that one switch is specifically + a backup switch (e.g., has lower capacity, higher cost, etc), + then the primary option can be used to insure that the + preferred link is always used when it is available. + +broadcast: This mode is really a special purpose mode, and is suitable + only for very specific needs. For example, if the two + switches are not connected (no ISL), and the networks beyond + them are totally independent. In this case, if it is + necessary for some specific one-way traffic to reach both + independent networks, then the broadcast mode may be suitable. + +11.2.2 HA Link Monitoring Selection for Multiple Switch Topology +---------------------------------------------------------------- + + The choice of link monitoring ultimately depends upon your +switch. If the switch can reliably fail ports in response to other +failures, then either the MII or ARP monitors should work. For +example, in the above example, if the "port3" link fails at the remote +end, the MII monitor has no direct means to detect this. The ARP +monitor could be configured with a target at the remote end of port3, +thus detecting that failure without switch support. + + In general, however, in a multiple switch topology, the ARP +monitor can provide a higher level of reliability in detecting end to +end connectivity failures (which may be caused by the failure of any +individual component to pass traffic for any reason). Additionally, +the ARP monitor should be configured with multiple targets (at least +one for each switch in the network). This will insure that, +regardless of which switch is active, the ARP monitor has a suitable +target to query. + + Note, also, that of late many switches now support a functionality +generally referred to as "trunk failover." This is a feature of the +switch that causes the link state of a particular switch port to be set +down (or up) when the state of another switch port goes down (or up). +Its purpose is to propagate link failures from logically "exterior" ports +to the logically "interior" ports that bonding is able to monitor via +miimon. Availability and configuration for trunk failover varies by +switch, but this can be a viable alternative to the ARP monitor when using +suitable switches. + +12. Configuring Bonding for Maximum Throughput +============================================== + +12.1 Maximizing Throughput in a Single Switch Topology +------------------------------------------------------ + + In a single switch configuration, the best method to maximize +throughput depends upon the application and network environment. The +various load balancing modes each have strengths and weaknesses in +different environments, as detailed below. + + For this discussion, we will break down the topologies into +two categories. Depending upon the destination of most traffic, we +categorize them into either "gatewayed" or "local" configurations. + + In a gatewayed configuration, the "switch" is acting primarily +as a router, and the majority of traffic passes through this router to +other networks. An example would be the following: + + + +----------+ +----------+ + | |eth0 port1| | to other networks + | Host A +---------------------+ router +-------------------> + | +---------------------+ | Hosts B and C are out + | |eth1 port2| | here somewhere + +----------+ +----------+ + + The router may be a dedicated router device, or another host +acting as a gateway. For our discussion, the important point is that +the majority of traffic from Host A will pass through the router to +some other network before reaching its final destination. + + In a gatewayed network configuration, although Host A may +communicate with many other systems, all of its traffic will be sent +and received via one other peer on the local network, the router. + + Note that the case of two systems connected directly via +multiple physical links is, for purposes of configuring bonding, the +same as a gatewayed configuration. In that case, it happens that all +traffic is destined for the "gateway" itself, not some other network +beyond the gateway. + + In a local configuration, the "switch" is acting primarily as +a switch, and the majority of traffic passes through this switch to +reach other stations on the same network. An example would be the +following: + + +----------+ +----------+ +--------+ + | |eth0 port1| +-------+ Host B | + | Host A +------------+ switch |port3 +--------+ + | +------------+ | +--------+ + | |eth1 port2| +------------------+ Host C | + +----------+ +----------+port4 +--------+ + + + Again, the switch may be a dedicated switch device, or another +host acting as a gateway. For our discussion, the important point is +that the majority of traffic from Host A is destined for other hosts +on the same local network (Hosts B and C in the above example). + + In summary, in a gatewayed configuration, traffic to and from +the bonded device will be to the same MAC level peer on the network +(the gateway itself, i.e., the router), regardless of its final +destination. In a local configuration, traffic flows directly to and +from the final destinations, thus, each destination (Host B, Host C) +will be addressed directly by their individual MAC addresses. + + This distinction between a gatewayed and a local network +configuration is important because many of the load balancing modes +available use the MAC addresses of the local network source and +destination to make load balancing decisions. The behavior of each +mode is described below. + + +12.1.1 MT Bonding Mode Selection for Single Switch Topology +----------------------------------------------------------- + + This configuration is the easiest to set up and to understand, +although you will have to decide which bonding mode best suits your +needs. The trade offs for each mode are detailed below: + +balance-rr: This mode is the only mode that will permit a single + TCP/IP connection to stripe traffic across multiple + interfaces. It is therefore the only mode that will allow a + single TCP/IP stream to utilize more than one interface's + worth of throughput. This comes at a cost, however: the + striping generally results in peer systems receiving packets out + of order, causing TCP/IP's congestion control system to kick + in, often by retransmitting segments. + + It is possible to adjust TCP/IP's congestion limits by + altering the net.ipv4.tcp_reordering sysctl parameter. The + usual default value is 3. But keep in mind TCP stack is able + to automatically increase this when it detects reorders. + + Note that the fraction of packets that will be delivered out of + order is highly variable, and is unlikely to be zero. The level + of reordering depends upon a variety of factors, including the + networking interfaces, the switch, and the topology of the + configuration. Speaking in general terms, higher speed network + cards produce more reordering (due to factors such as packet + coalescing), and a "many to many" topology will reorder at a + higher rate than a "many slow to one fast" configuration. + + Many switches do not support any modes that stripe traffic + (instead choosing a port based upon IP or MAC level addresses); + for those devices, traffic for a particular connection flowing + through the switch to a balance-rr bond will not utilize greater + than one interface's worth of bandwidth. + + If you are utilizing protocols other than TCP/IP, UDP for + example, and your application can tolerate out of order + delivery, then this mode can allow for single stream datagram + performance that scales near linearly as interfaces are added + to the bond. + + This mode requires the switch to have the appropriate ports + configured for "etherchannel" or "trunking." + +active-backup: There is not much advantage in this network topology to + the active-backup mode, as the inactive backup devices are all + connected to the same peer as the primary. In this case, a + load balancing mode (with link monitoring) will provide the + same level of network availability, but with increased + available bandwidth. On the plus side, active-backup mode + does not require any configuration of the switch, so it may + have value if the hardware available does not support any of + the load balance modes. + +balance-xor: This mode will limit traffic such that packets destined + for specific peers will always be sent over the same + interface. Since the destination is determined by the MAC + addresses involved, this mode works best in a "local" network + configuration (as described above), with destinations all on + the same local network. This mode is likely to be suboptimal + if all your traffic is passed through a single router (i.e., a + "gatewayed" network configuration, as described above). + + As with balance-rr, the switch ports need to be configured for + "etherchannel" or "trunking." + +broadcast: Like active-backup, there is not much advantage to this + mode in this type of network topology. + +802.3ad: This mode can be a good choice for this type of network + topology. The 802.3ad mode is an IEEE standard, so all peers + that implement 802.3ad should interoperate well. The 802.3ad + protocol includes automatic configuration of the aggregates, + so minimal manual configuration of the switch is needed + (typically only to designate that some set of devices is + available for 802.3ad). The 802.3ad standard also mandates + that frames be delivered in order (within certain limits), so + in general single connections will not see misordering of + packets. The 802.3ad mode does have some drawbacks: the + standard mandates that all devices in the aggregate operate at + the same speed and duplex. Also, as with all bonding load + balance modes other than balance-rr, no single connection will + be able to utilize more than a single interface's worth of + bandwidth. + + Additionally, the linux bonding 802.3ad implementation + distributes traffic by peer (using an XOR of MAC addresses + and packet type ID), so in a "gatewayed" configuration, all + outgoing traffic will generally use the same device. Incoming + traffic may also end up on a single device, but that is + dependent upon the balancing policy of the peer's 802.3ad + implementation. In a "local" configuration, traffic will be + distributed across the devices in the bond. + + Finally, the 802.3ad mode mandates the use of the MII monitor, + therefore, the ARP monitor is not available in this mode. + +balance-tlb: The balance-tlb mode balances outgoing traffic by peer. + Since the balancing is done according to MAC address, in a + "gatewayed" configuration (as described above), this mode will + send all traffic across a single device. However, in a + "local" network configuration, this mode balances multiple + local network peers across devices in a vaguely intelligent + manner (not a simple XOR as in balance-xor or 802.3ad mode), + so that mathematically unlucky MAC addresses (i.e., ones that + XOR to the same value) will not all "bunch up" on a single + interface. + + Unlike 802.3ad, interfaces may be of differing speeds, and no + special switch configuration is required. On the down side, + in this mode all incoming traffic arrives over a single + interface, this mode requires certain ethtool support in the + network device driver of the slave interfaces, and the ARP + monitor is not available. + +balance-alb: This mode is everything that balance-tlb is, and more. + It has all of the features (and restrictions) of balance-tlb, + and will also balance incoming traffic from local network + peers (as described in the Bonding Module Options section, + above). + + The only additional down side to this mode is that the network + device driver must support changing the hardware address while + the device is open. + +12.1.2 MT Link Monitoring for Single Switch Topology +---------------------------------------------------- + + The choice of link monitoring may largely depend upon which +mode you choose to use. The more advanced load balancing modes do not +support the use of the ARP monitor, and are thus restricted to using +the MII monitor (which does not provide as high a level of end to end +assurance as the ARP monitor). + +12.2 Maximum Throughput in a Multiple Switch Topology +----------------------------------------------------- + + Multiple switches may be utilized to optimize for throughput +when they are configured in parallel as part of an isolated network +between two or more systems, for example: + + +-----------+ + | Host A | + +-+---+---+-+ + | | | + +--------+ | +---------+ + | | | + +------+---+ +-----+----+ +-----+----+ + | Switch A | | Switch B | | Switch C | + +------+---+ +-----+----+ +-----+----+ + | | | + +--------+ | +---------+ + | | | + +-+---+---+-+ + | Host B | + +-----------+ + + In this configuration, the switches are isolated from one +another. One reason to employ a topology such as this is for an +isolated network with many hosts (a cluster configured for high +performance, for example), using multiple smaller switches can be more +cost effective than a single larger switch, e.g., on a network with 24 +hosts, three 24 port switches can be significantly less expensive than +a single 72 port switch. + + If access beyond the network is required, an individual host +can be equipped with an additional network device connected to an +external network; this host then additionally acts as a gateway. + +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- + + In actual practice, the bonding mode typically employed in +configurations of this type is balance-rr. Historically, in this +network configuration, the usual caveats about out of order packet +delivery are mitigated by the use of network adapters that do not do +any kind of packet coalescing (via the use of NAPI, or because the +device itself does not generate interrupts until some number of +packets has arrived). When employed in this fashion, the balance-rr +mode allows individual connections between two hosts to effectively +utilize greater than one interface's bandwidth. + +12.2.2 MT Link Monitoring for Multiple Switch Topology +------------------------------------------------------ + + Again, in actual practice, the MII monitor is most often used +in this configuration, as performance is given preference over +availability. The ARP monitor will function in this topology, but its +advantages over the MII monitor are mitigated by the volume of probes +needed as the number of systems involved grows (remember that each +host in the network is configured with bonding). + +13. Switch Behavior Issues +========================== + +13.1 Link Establishment and Failover Delays +------------------------------------------- + + Some switches exhibit undesirable behavior with regard to the +timing of link up and down reporting by the switch. + + First, when a link comes up, some switches may indicate that +the link is up (carrier available), but not pass traffic over the +interface for some period of time. This delay is typically due to +some type of autonegotiation or routing protocol, but may also occur +during switch initialization (e.g., during recovery after a switch +failure). If you find this to be a problem, specify an appropriate +value to the updelay bonding module option to delay the use of the +relevant interface(s). + + Second, some switches may "bounce" the link state one or more +times while a link is changing state. This occurs most commonly while +the switch is initializing. Again, an appropriate updelay value may +help. + + Note that when a bonding interface has no active links, the +driver will immediately reuse the first link that goes up, even if the +updelay parameter has been specified (the updelay is ignored in this +case). If there are slave interfaces waiting for the updelay timeout +to expire, the interface that first went into that state will be +immediately reused. This reduces down time of the network if the +value of updelay has been overestimated, and since this occurs only in +cases with no connectivity, there is no additional penalty for +ignoring the updelay. + + In addition to the concerns about switch timings, if your +switches take a long time to go into backup mode, it may be desirable +to not activate a backup interface immediately after a link goes down. +Failover may be delayed via the downdelay bonding module option. + +13.2 Duplicated Incoming Packets +-------------------------------- + + NOTE: Starting with version 3.0.2, the bonding driver has logic to +suppress duplicate packets, which should largely eliminate this problem. +The following description is kept for reference. + + It is not uncommon to observe a short burst of duplicated +traffic when the bonding device is first used, or after it has been +idle for some period of time. This is most easily observed by issuing +a "ping" to some other host on the network, and noticing that the +output from ping flags duplicates (typically one per slave). + + For example, on a bond in active-backup mode with five slaves +all connected to one switch, the output may appear as follows: + +# ping -n 10.0.4.2 +PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms +64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms +64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms + + This is not due to an error in the bonding driver, rather, it +is a side effect of how many switches update their MAC forwarding +tables. Initially, the switch does not associate the MAC address in +the packet with a particular switch port, and so it may send the +traffic to all ports until its MAC forwarding table is updated. Since +the interfaces attached to the bond may occupy multiple ports on a +single switch, when the switch (temporarily) floods the traffic to all +ports, the bond device receives multiple copies of the same packet +(one per slave device). + + The duplicated packet behavior is switch dependent, some +switches exhibit this, and some do not. On switches that display this +behavior, it can be induced by clearing the MAC forwarding table (on +most Cisco switches, the privileged command "clear mac address-table +dynamic" will accomplish this). + +14. Hardware Specific Considerations +==================================== + + This section contains additional information for configuring +bonding on specific hardware platforms, or for interfacing bonding +with particular switches or other devices. + +14.1 IBM BladeCenter +-------------------- + + This applies to the JS20 and similar systems. + + On the JS20 blades, the bonding driver supports only +balance-rr, active-backup, balance-tlb and balance-alb modes. This is +largely due to the network topology inside the BladeCenter, detailed +below. + +JS20 network adapter information +-------------------------------- + + All JS20s come with two Broadcom Gigabit Ethernet ports +integrated on the planar (that's "motherboard" in IBM-speak). In the +BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to +I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. +An add-on Broadcom daughter card can be installed on a JS20 to provide +two more Gigabit Ethernet ports. These ports, eth2 and eth3, are +wired to I/O Modules 3 and 4, respectively. + + Each I/O Module may contain either a switch or a passthrough +module (which allows ports to be directly connected to an external +switch). Some bonding modes require a specific BladeCenter internal +network topology in order to function; these are detailed below. + + Additional BladeCenter-specific networking information can be +found in two IBM Redbooks (www.ibm.com/redbooks): + +"IBM eServer BladeCenter Networking Options" +"IBM eServer BladeCenter Layer 2-7 Network Switching" + +BladeCenter networking configuration +------------------------------------ + + Because a BladeCenter can be configured in a very large number +of ways, this discussion will be confined to describing basic +configurations. + + Normally, Ethernet Switch Modules (ESMs) are used in I/O +modules 1 and 2. In this configuration, the eth0 and eth1 ports of a +JS20 will be connected to different internal switches (in the +respective I/O modules). + + A passthrough module (OPM or CPM, optical or copper, +passthrough module) connects the I/O module directly to an external +switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 +interfaces of a JS20 can be redirected to the outside world and +connected to a common external switch. + + Depending upon the mix of ESMs and PMs, the network will +appear to bonding as either a single switch topology (all PMs) or as a +multiple switch topology (one or more ESMs, zero or more PMs). It is +also possible to connect ESMs together, resulting in a configuration +much like the example in "High Availability in a Multiple Switch +Topology," above. + +Requirements for specific modes +------------------------------- + + The balance-rr mode requires the use of passthrough modules +for devices in the bond, all connected to an common external switch. +That switch must be configured for "etherchannel" or "trunking" on the +appropriate ports, as is usual for balance-rr. + + The balance-alb and balance-tlb modes will function with +either switch modules or passthrough modules (or a mix). The only +specific requirement for these modes is that all network interfaces +must be able to reach all destinations for traffic sent over the +bonding device (i.e., the network must converge at some point outside +the BladeCenter). + + The active-backup mode has no additional requirements. + +Link monitoring issues +---------------------- + + When an Ethernet Switch Module is in place, only the ARP +monitor will reliably detect link loss to an external switch. This is +nothing unusual, but examination of the BladeCenter cabinet would +suggest that the "external" network ports are the ethernet ports for +the system, when it fact there is a switch between these "external" +ports and the devices on the JS20 system itself. The MII monitor is +only able to detect link failures between the ESM and the JS20 system. + + When a passthrough module is in place, the MII monitor does +detect failures to the "external" port, which is then directly +connected to the JS20 system. + +Other concerns +-------------- + + The Serial Over LAN (SoL) link is established over the primary +ethernet (eth0) only, therefore, any loss of link to eth0 will result +in losing your SoL connection. It will not fail over with other +network traffic, as the SoL system is beyond the control of the +bonding driver. + + It may be desirable to disable spanning tree on the switch +(either the internal Ethernet Switch Module, or an external switch) to +avoid fail-over delay issues when using bonding. + + +15. Frequently Asked Questions +============================== + +1. Is it SMP safe? + + Yes. The old 2.0.xx channel bonding patch was not SMP safe. +The new driver was designed to be SMP safe from the start. + +2. What type of cards will work with it? + + Any Ethernet type cards (you can even mix cards - a Intel +EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, +devices need not be of the same speed. + + Starting with version 3.2.1, bonding also supports Infiniband +slaves in active-backup mode. + +3. How many bonding devices can I have? + + There is no limit. + +4. How many slaves can a bonding device have? + + This is limited only by the number of network interfaces Linux +supports and/or the number of network cards you can place in your +system. + +5. What happens when a slave link dies? + + If link monitoring is enabled, then the failing device will be +disabled. The active-backup mode will fail over to a backup link, and +other modes will ignore the failed link. The link will continue to be +monitored, and should it recover, it will rejoin the bond (in whatever +manner is appropriate for the mode). See the sections on High +Availability and the documentation for each mode for additional +information. + + Link monitoring can be enabled via either the miimon or +arp_interval parameters (described in the module parameters section, +above). In general, miimon monitors the carrier state as sensed by +the underlying network device, and the arp monitor (arp_interval) +monitors connectivity to another host on the local network. + + If no link monitoring is configured, the bonding driver will +be unable to detect link failures, and will assume that all links are +always available. This will likely result in lost packets, and a +resulting degradation of performance. The precise performance loss +depends upon the bonding mode and network configuration. + +6. Can bonding be used for High Availability? + + Yes. See the section on High Availability for details. + +7. Which switches/systems does it work with? + + The full answer to this depends upon the desired mode. + + In the basic balance modes (balance-rr and balance-xor), it +works with any system that supports etherchannel (also called +trunking). Most managed switches currently available have such +support, and many unmanaged switches as well. + + The advanced balance modes (balance-tlb and balance-alb) do +not have special switch requirements, but do need device drivers that +support specific features (described in the appropriate section under +module parameters, above). + + In 802.3ad mode, it works with systems that support IEEE +802.3ad Dynamic Link Aggregation. Most managed and many unmanaged +switches currently available support 802.3ad. + + The active-backup mode should work with any Layer-II switch. + +8. Where does a bonding device get its MAC address from? + + When using slave devices that have fixed MAC addresses, or when +the fail_over_mac option is enabled, the bonding device's MAC address is +the MAC address of the active slave. + + For other configurations, if not explicitly configured (with +ifconfig or ip link), the MAC address of the bonding device is taken from +its first slave device. This MAC address is then passed to all following +slaves and remains persistent (even if the first slave is removed) until +the bonding device is brought down or reconfigured. + + If you wish to change the MAC address, you can set it with +ifconfig or ip link: + +# ifconfig bond0 hw ether 00:11:22:33:44:55 + +# ip link set bond0 address 66:77:88:99:aa:bb + + The MAC address can be also changed by bringing down/up the +device and then changing its slaves (or their order): + +# ifconfig bond0 down ; modprobe -r bonding +# ifconfig bond0 .... up +# ifenslave bond0 eth... + + This method will automatically take the address from the next +slave that is added. + + To restore your slaves' MAC addresses, you need to detach them +from the bond (`ifenslave -d bond0 eth0'). The bonding driver will +then restore the MAC addresses that the slaves had before they were +enslaved. + +16. Resources and Links +======================= + + The latest version of the bonding driver can be found in the latest +version of the linux kernel, found on http://kernel.org + + The latest version of this document can be found in the latest kernel +source (named Documentation/networking/bonding.txt). + + Discussions regarding the usage of the bonding driver take place on the +bonding-devel mailing list, hosted at sourceforge.net. If you have questions or +problems, post them to the list. The list address is: + +bonding-devel@lists.sourceforge.net + + The administrative interface (to subscribe or unsubscribe) can +be found at: + +https://lists.sourceforge.net/lists/listinfo/bonding-devel + + Discussions regarding the development of the bonding driver take place +on the main Linux network mailing list, hosted at vger.kernel.org. The list +address is: + +netdev@vger.kernel.org + + The administrative interface (to subscribe or unsubscribe) can +be found at: + +http://vger.kernel.org/vger-lists.html#netdev + +Donald Becker's Ethernet Drivers and diag programs may be found at : + - http://web.archive.org/web/*/http://www.scyld.com/network/ + +You will also find a lot of information regarding Ethernet, NWay, MII, +etc. at www.scyld.com. + +-- END -- diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst new file mode 100644 index 000000000..4aef9cddd --- /dev/null +++ b/Documentation/networking/bridge.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Ethernet Bridging +================= + +In order to use the Ethernet bridging functionality, you'll need the +userspace tools. + +Documentation for Linux bridging is on: + http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge + +The bridge-utilities are maintained at: + git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git + +Additionally, the iproute2 utilities can be used to configure +bridge devices. + +If you still have questions, don't hesitate to post to the mailing list +(more info https://lists.linux-foundation.org/mailman/listinfo/bridge). + diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt new file mode 100644 index 000000000..0aa4bd381 --- /dev/null +++ b/Documentation/networking/caif/Linux-CAIF.txt @@ -0,0 +1,175 @@ +Linux CAIF +=========== +copyright (C) ST-Ericsson AB 2010 +Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +License terms: GNU General Public License (GPL) version 2 + + +Introduction +------------ +CAIF is a MUX protocol used by ST-Ericsson cellular modems for +communication between Modem and host. The host processes can open virtual AT +channels, initiate GPRS Data connections, Video channels and Utility Channels. +The Utility Channels are general purpose pipes between modem and host. + +ST-Ericsson modems support a number of transports between modem +and host. Currently, UART and Loopback are available for Linux. + + +Architecture: +------------ +The implementation of CAIF is divided into: +* CAIF Socket Layer and GPRS IP Interface. +* CAIF Core Protocol Implementation +* CAIF Link Layer, implemented as NET devices. + + + RTNL + ! + ! +------+ +------+ + ! +------+! +------+! + ! ! IP !! !Socket!! + +-------> !interf!+ ! API !+ <- CAIF Client APIs + ! +------+ +------! + ! ! ! + ! +-----------+ + ! ! + ! +------+ <- CAIF Core Protocol + ! ! CAIF ! + ! ! Core ! + ! +------+ + ! +----------!---------+ + ! ! ! ! + ! +------+ +-----+ +------+ + +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) + +------+ +-----+ +------+ + + + +I M P L E M E N T A T I O N +=========================== + + +CAIF Core Protocol Layer +========================================= + +CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. +It implements the CAIF protocol stack in a layered approach, where +each layer described in the specification is implemented as a separate layer. +The architecture is inspired by the design patterns "Protocol Layer" and +"Protocol Packet". + +== CAIF structure == +The Core CAIF implementation contains: + - Simple implementation of CAIF. + - Layered architecture (a la Streams), each layer in the CAIF + specification is implemented in a separate c-file. + - Clients must call configuration function to add PHY layer. + - Clients must implement CAIF layer to consume/produce + CAIF payload with receive and transmit functions. + - Clients must call configuration function to add and connect the + Client layer. + - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed + to the called function (except for framing layers' receive function) + +Layered Architecture +-------------------- +The CAIF protocol can be divided into two parts: Support functions and Protocol +Implementation. The support functions include: + + - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The + CAIF Packet has functions for creating, destroying and adding content + and for adding/extracting header and trailers to protocol packets. + +The CAIF Protocol implementation contains: + + - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol + Stack and provides a Client interface for adding Link-Layer and + Driver interfaces on top of the CAIF Stack. + + - CFCTRL CAIF Control layer. Encodes and Decodes control messages + such as enumeration and channel setup. Also matches request and + response messages. + + - CFSERVL General CAIF Service Layer functionality; handles flow + control and remote shutdown requests. + + - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual + External Interface). This layer encodes/decodes VEI frames. + + - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP + traffic), encodes/decodes Datagram frames. + + - CFMUX CAIF Mux layer. Handles multiplexing between multiple + physical bearers and multiple channels such as VEI, Datagram, etc. + The MUX keeps track of the existing CAIF Channels and + Physical Instances and selects the appropriate instance based + on Channel-Id and Physical-ID. + + - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length + and frame checksum. + + - CFSERL CAIF Serial layer. Handles concatenation/split of frames + into CAIF Frames with correct length. + + + + +---------+ + | Config | + | CFCNFG | + +---------+ + ! + +---------+ +---------+ +---------+ + | AT | | Control | | Datagram| + | CFVEIL | | CFCTRL | | CFDGML | + +---------+ +---------+ +---------+ + \_____________!______________/ + ! + +---------+ + | MUX | + | | + +---------+ + _____!_____ + / \ + +---------+ +---------+ + | CFFRML | | CFFRML | + | Framing | | Framing | + +---------+ +---------+ + ! ! + +---------+ +---------+ + | | | Serial | + | | | CFSERL | + +---------+ +---------+ + + +In this layered approach the following "rules" apply. + - All layers embed the same structure "struct cflayer" + - A layer does not depend on any other layer's private data. + - Layers are stacked by setting the pointers + layer->up , layer->dn + - In order to send data upwards, each layer should do + layer->up->receive(layer->up, packet); + - In order to send data downwards, each layer should do + layer->dn->transmit(layer->dn, packet); + + +CAIF Socket and IP interface +=========================== + +The IP interface and CAIF socket API are implemented on top of the +CAIF Core protocol. The IP Interface and CAIF socket have an instance of +'struct cflayer', just like the CAIF Core protocol stack. +Net device and Socket implement the 'receive()' function defined by +'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and +receive of packets is handled as by the rest of the layers: the 'dn->transmit()' +function is called in order to transmit data. + +Configuration of Link Layer +--------------------------- +The Link Layer is implemented as Linux network devices (struct net_device). +Payload handling and registration is done using standard Linux mechanisms. + +The CAIF Protocol relies on a loss-less link layer without implementing +retransmission. This implies that packet drops must not happen. +Therefore a flow-control mechanism is implemented where the physical +interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/README b/Documentation/networking/caif/README new file mode 100644 index 000000000..757ccfaa1 --- /dev/null +++ b/Documentation/networking/caif/README @@ -0,0 +1,109 @@ +Copyright (C) ST-Ericsson AB 2010 +Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +License terms: GNU General Public License (GPL) version 2 +--------------------------------------------------------- + +=== Start === +If you have compiled CAIF for modules do: + +$modprobe crc_ccitt +$modprobe caif +$modprobe caif_socket +$modprobe chnl_net + + +=== Preparing the setup with a STE modem === + +If you are working on integration of CAIF you should make sure +that the kernel is built with module support. + +There are some things that need to be tweaked to get the host TTY correctly +set up to talk to the modem. +Since the CAIF stack is running in the kernel and we want to use the existing +TTY, we are installing our physical serial driver as a line discipline above +the TTY device. + +To achieve this we need to install the N_CAIF ldisc from user space. +The benefit is that we can hook up to any TTY. + +The use of Start-of-frame-extension (STX) must also be set as +module parameter "ser_use_stx". + +Normally Frame Checksum is always used on UART, but this is also provided as a +module parameter "ser_use_fcs". + +$ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes +$ ifconfig caif_ttyS0 up + +PLEASE NOTE: There is a limitation in Android shell. + It only accepts one argument to insmod/modprobe! + +=== Trouble shooting === + +There are debugfs parameters provided for serial communication. +/sys/kernel/debug/caif_serial/<tty-name>/ + +* ser_state: Prints the bit-mask status where + - 0x02 means SENDING, this is a transient state. + - 0x10 means FLOW_OFF_SENT, i.e. the previous frame has not been sent + and is blocking further send operation. Flow OFF has been propagated + to all CAIF Channels using this TTY. + +* tty_status: Prints the bit-mask tty status information + - 0x01 - tty->warned is on. + - 0x02 - tty->low_latency is on. + - 0x04 - tty->packed is on. + - 0x08 - tty->flow_stopped is on. + - 0x10 - tty->hw_stopped is on. + - 0x20 - tty->stopped is on. + +* last_tx_msg: Binary blob Prints the last transmitted frame. + This can be printed with + $od --format=x1 /sys/kernel/debug/caif_serial/<tty>/last_rx_msg. + The first two tx messages sent look like this. Note: The initial + byte 02 is start of frame extension (STX) used for re-syncing + upon errors. + + - Enumeration: + 0000000 02 05 00 00 03 01 d2 02 + | | | | | | + STX(1) | | | | + Length(2)| | | + Control Channel(1) + Command:Enumeration(1) + Link-ID(1) + Checksum(2) + - Channel Setup: + 0000000 02 07 00 00 00 21 a1 00 48 df + | | | | | | | | + STX(1) | | | | | | + Length(2)| | | | | + Control Channel(1) + Command:Channel Setup(1) + Channel Type(1) + Priority and Link-ID(1) + Endpoint(1) + Checksum(2) + +* last_rx_msg: Prints the last transmitted frame. + The RX messages for LinkSetup look almost identical but they have the + bit 0x20 set in the command bit, and Channel Setup has added one byte + before Checksum containing Channel ID. + NOTE: Several CAIF Messages might be concatenated. The maximum debug + buffer size is 128 bytes. + +== Error Scenarios: +- last_tx_msg contains channel setup message and last_rx_msg is empty -> + The host seems to be able to send over the UART, at least the CAIF ldisc get + notified that sending is completed. + +- last_tx_msg contains enumeration message and last_rx_msg is empty -> + The host is not able to send the message from UART, the tty has not been + able to complete the transmit operation. + +- if /sys/kernel/debug/caif_serial/<tty>/tty_status is non-zero there + might be problems transmitting over UART. + E.g. host and modem wiring is not correct you will typically see + tty_status = 0x10 (hw_stopped) and ser_state = 0x10 (FLOW_OFF_SENT). + You will probably see the enumeration message in last_tx_message + and empty last_rx_message. diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt new file mode 100644 index 000000000..9efd0687d --- /dev/null +++ b/Documentation/networking/caif/spi_porting.txt @@ -0,0 +1,208 @@ +- CAIF SPI porting - + +- CAIF SPI basics: + +Running CAIF over SPI needs some extra setup, owing to the nature of SPI. +Two extra GPIOs have been added in order to negotiate the transfers + between the master and the slave. The minimum requirement for running +CAIF over SPI is a SPI slave chip and two GPIOs (more details below). +Please note that running as a slave implies that you need to keep up +with the master clock. An overrun or underrun event is fatal. + +- CAIF SPI framework: + +To make porting as easy as possible, the CAIF SPI has been divided in +two parts. The first part (called the interface part) deals with all +generic functionality such as length framing, SPI frame negotiation +and SPI frame delivery and transmission. The other part is the CAIF +SPI slave device part, which is the module that you have to write if +you want to run SPI CAIF on a new hardware. This part takes care of +the physical hardware, both with regard to SPI and to GPIOs. + +- Implementing a CAIF SPI device: + + - Functionality provided by the CAIF SPI slave device: + + In order to implement a SPI device you will, as a minimum, + need to implement the following + functions: + + int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface to give + you a chance to set up your hardware to be ready to receive + a stream of data from the master. The xfer structure contains + both physical and logical addresses, as well as the total length + of the transfer in both directions.The dev parameter can be used + to map to different CAIF SPI slave devices. + + void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface when the output + (SPI_INT) GPIO needs to change state. The boolean value of the xfer + variable indicates whether the GPIO should be asserted (HIGH) or + deasserted (LOW). The dev parameter can be used to map to different CAIF + SPI slave devices. + + - Functionality provided by the CAIF SPI interface: + + void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + signal a change of state of the input GPIO (SS) to the interface. + Only active edges are mandatory to be reported. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + void (*xfer_done_cb) (struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + report that a transfer is completed. This function should only be + called once both the transmission and the reception are completed. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + - Connecting the bits and pieces: + + - Filling in the SPI slave device structure: + + Connect the necessary callback functions. + Indicate clock speed (used to calculate toggle delays). + Chose a suitable name (helps debugging if you use several CAIF + SPI slave devices). + Assign your private data (can be used to map to your structure). + + - Filling in the SPI slave platform device structure: + Add name of driver to connect to ("cfspi_sspi"). + Assign the SPI slave device structure as platform data. + +- Padding: + +In order to optimize throughput, a number of SPI padding options are provided. +Padding can be enabled independently for uplink and downlink transfers. +Padding can be enabled for the head, the tail and for the total frame size. +The padding needs to be correctly configured on both sides of the link. +The padding can be changed via module parameters in cfspi_sspi.c or via +the sysfs directory of the cfspi_sspi driver (before device registration). + +- CAIF SPI device template: + +/* + * Copyright (C) ST-Ericsson AB 2010 + * Author: Daniel Martensson / Daniel.Martensson@stericsson.com + * License terms: GNU General Public License (GPL), version 2. + * + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/wait.h> +#include <linux/interrupt.h> +#include <linux/dma-mapping.h> +#include <net/caif/caif_spi.h> + +MODULE_LICENSE("GPL"); + +struct sspi_struct { + struct cfspi_dev sdev; + struct cfspi_xfer *xfer; +}; + +static struct sspi_struct slave; +static struct platform_device slave_device; + +static irqreturn_t sspi_irq(int irq, void *arg) +{ + /* You only need to trigger on an edge to the active state of the + * SS signal. Once a edge is detected, the ss_cb() function should be + * called with the parameter assert set to true. It is OK + * (and even advised) to call the ss_cb() function in IRQ context in + * order not to add any delay. */ + + return IRQ_HANDLED; +} + +static void sspi_complete(void *context) +{ + /* Normally the DMA or the SPI framework will call you back + * in something similar to this. The only thing you need to + * do is to call the xfer_done_cb() function, providing the pointer + * to the CAIF SPI interface. It is OK to call this function + * from IRQ context. */ +} + +static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) +{ + /* Store transfer info. For a normal implementation you should + * set up your DMA here and make sure that you are ready to + * receive the data from the master SPI. */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + + sspi->xfer = xfer; + + return 0; +} + +void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) +{ + /* If xfer is true then you should assert the SPI_INT to indicate to + * the master that you are ready to receive the data from the master + * SPI. If xfer is false then you should de-assert SPI_INT to indicate + * that the transfer is done. + */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; +} + +static void sspi_release(struct device *dev) +{ + /* + * Here you should release your SPI device resources. + */ +} + +static int __init sspi_init(void) +{ + /* Here you should initialize your SPI device by providing the + * necessary functions, clock speed, name and private data. Once + * done, you can register your device with the + * platform_device_register() function. This function will return + * with the CAIF SPI interface initialized. This is probably also + * the place where you should set up your GPIOs, interrupts and SPI + * resources. */ + + int res = 0; + + /* Initialize slave device. */ + slave.sdev.init_xfer = sspi_init_xfer; + slave.sdev.sig_xfer = sspi_sig_xfer; + slave.sdev.clk_mhz = 13; + slave.sdev.priv = &slave; + slave.sdev.name = "spi_sspi"; + slave_device.dev.release = sspi_release; + + /* Initialize platform device. */ + slave_device.name = "cfspi_sspi"; + slave_device.dev.platform_data = &slave.sdev; + + /* Register platform device. */ + res = platform_device_register(&slave_device); + if (res) { + printk(KERN_WARNING "sspi_init: failed to register dev.\n"); + return -ENODEV; + } + + return res; +} + +static void __exit sspi_exit(void) +{ + platform_device_del(&slave_device); +} + +module_init(sspi_init); +module_exit(sspi_exit); diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst new file mode 100644 index 000000000..2fd0b51a8 --- /dev/null +++ b/Documentation/networking/can.rst @@ -0,0 +1,1437 @@ +=================================== +SocketCAN - Controller Area Network +=================================== + +Overview / What is SocketCAN +============================ + +The socketcan package is an implementation of CAN protocols +(Controller Area Network) for Linux. CAN is a networking technology +which has widespread use in automation, embedded devices, and +automotive fields. While there have been other CAN implementations +for Linux based on character devices, SocketCAN uses the Berkeley +socket API, the Linux network stack and implements the CAN device +drivers as network interfaces. The CAN socket API has been designed +as similar as possible to the TCP/IP protocols to allow programmers, +familiar with network programming, to easily learn how to use CAN +sockets. + + +.. _socketcan-motivation: + +Motivation / Why Using the Socket API +===================================== + +There have been CAN implementations for Linux before SocketCAN so the +question arises, why we have started another project. Most existing +implementations come as a device driver for some CAN hardware, they +are based on character devices and provide comparatively little +functionality. Usually, there is only a hardware-specific device +driver which provides a character device interface to send and +receive raw CAN frames, directly to/from the controller hardware. +Queueing of frames and higher-level transport protocols like ISO-TP +have to be implemented in user space applications. Also, most +character-device implementations support only one single process to +open the device at a time, similar to a serial interface. Exchanging +the CAN controller requires employment of another device driver and +often the need for adaption of large parts of the application to the +new driver's API. + +SocketCAN was designed to overcome all of these limitations. A new +protocol family has been implemented which provides a socket interface +to user space applications and which builds upon the Linux network +layer, enabling use all of the provided queueing functionality. A device +driver for CAN controller hardware registers itself with the Linux +network layer as a network device, so that CAN frames from the +controller can be passed up to the network layer and on to the CAN +protocol family module and also vice-versa. Also, the protocol family +module provides an API for transport protocol modules to register, so +that any number of transport protocols can be loaded or unloaded +dynamically. In fact, the can core module alone does not provide any +protocol and cannot be used without loading at least one additional +protocol module. Multiple sockets can be opened at the same time, +on different or the same protocol module and they can listen/send +frames on different or the same CAN IDs. Several sockets listening on +the same interface for frames with the same CAN ID are all passed the +same received matching CAN frames. An application wishing to +communicate using a specific transport protocol, e.g. ISO-TP, just +selects that protocol when opening the socket, and then can read and +write application data byte streams, without having to deal with +CAN-IDs, frames, etc. + +Similar functionality visible from user-space could be provided by a +character device, too, but this would lead to a technically inelegant +solution for a couple of reasons: + +* **Intricate usage:** Instead of passing a protocol argument to + socket(2) and using bind(2) to select a CAN interface and CAN ID, an + application would have to do all these operations using ioctl(2)s. + +* **Code duplication:** A character device cannot make use of the Linux + network queueing code, so all that code would have to be duplicated + for CAN networking. + +* **Abstraction:** In most existing character-device implementations, the + hardware-specific device driver for a CAN controller directly + provides the character device for the application to work with. + This is at least very unusual in Unix systems for both, char and + block devices. For example you don't have a character device for a + certain UART of a serial interface, a certain sound chip in your + computer, a SCSI or IDE controller providing access to your hard + disk or tape streamer device. Instead, you have abstraction layers + which provide a unified character or block device interface to the + application on the one hand, and a interface for hardware-specific + device drivers on the other hand. These abstractions are provided + by subsystems like the tty layer, the audio subsystem or the SCSI + and IDE subsystems for the devices mentioned above. + + The easiest way to implement a CAN device driver is as a character + device without such a (complete) abstraction layer, as is done by most + existing drivers. The right way, however, would be to add such a + layer with all the functionality like registering for certain CAN + IDs, supporting several open file descriptors and (de)multiplexing + CAN frames between them, (sophisticated) queueing of CAN frames, and + providing an API for device drivers to register with. However, then + it would be no more difficult, or may be even easier, to use the + networking framework provided by the Linux kernel, and this is what + SocketCAN does. + +The use of the networking framework of the Linux kernel is just the +natural and most appropriate way to implement CAN for Linux. + + +.. _socketcan-concept: + +SocketCAN Concept +================= + +As described in :ref:`socketcan-motivation` the main goal of SocketCAN is to +provide a socket interface to user space applications which builds +upon the Linux network layer. In contrast to the commonly known +TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!) +medium that has no MAC-layer addressing like ethernet. The CAN-identifier +(can_id) is used for arbitration on the CAN-bus. Therefore the CAN-IDs +have to be chosen uniquely on the bus. When designing a CAN-ECU +network the CAN-IDs are mapped to be sent by a specific ECU. +For this reason a CAN-ID can be treated best as a kind of source address. + + +.. _socketcan-receive-lists: + +Receive Lists +------------- + +The network transparent access of multiple applications leads to the +problem that different applications may be interested in the same +CAN-IDs from the same CAN network interface. The SocketCAN core +module - which implements the protocol family CAN - provides several +high efficient receive lists for this reason. If e.g. a user space +application opens a CAN RAW socket, the raw protocol module itself +requests the (range of) CAN-IDs from the SocketCAN core that are +requested by the user. The subscription and unsubscription of +CAN-IDs can be done for specific CAN interfaces or for all(!) known +CAN interfaces with the can_rx_(un)register() functions provided to +CAN protocol modules by the SocketCAN core (see :ref:`socketcan-core-module`). +To optimize the CPU usage at runtime the receive lists are split up +into several specific lists per device that match the requested +filter complexity for a given use-case. + + +.. _socketcan-local-loopback1: + +Local Loopback of Sent Frames +----------------------------- + +As known from other networking concepts the data exchanging +applications may run on the same or different nodes without any +change (except for the according addressing information): + +.. code:: + + ___ ___ ___ _______ ___ + | _ | | _ | | _ | | _ _ | | _ | + ||A|| ||B|| ||C|| ||A| |B|| ||C|| + |___| |___| |___| |_______| |___| + | | | | | + -----------------(1)- CAN bus -(2)--------------- + +To ensure that application A receives the same information in the +example (2) as it would receive in example (1) there is need for +some kind of local loopback of the sent CAN frames on the appropriate +node. + +The Linux network devices (by default) just can handle the +transmission and reception of media dependent frames. Due to the +arbitration on the CAN bus the transmission of a low prio CAN-ID +may be delayed by the reception of a high prio CAN frame. To +reflect the correct [#f1]_ traffic on the node the loopback of the sent +data has to be performed right after a successful transmission. If +the CAN network interface is not capable of performing the loopback for +some reason the SocketCAN core can do this task as a fallback solution. +See :ref:`socketcan-local-loopback1` for details (recommended). + +The loopback functionality is enabled by default to reflect standard +networking behaviour for CAN applications. Due to some requests from +the RT-SocketCAN group the loopback optionally may be disabled for each +separate socket. See sockopts from the CAN RAW sockets in :ref:`socketcan-raw-sockets`. + +.. [#f1] you really like to have this when you're running analyser + tools like 'candump' or 'cansniffer' on the (same) node. + + +.. _socketcan-network-problem-notifications: + +Network Problem Notifications +----------------------------- + +The use of the CAN bus may lead to several problems on the physical +and media access control layer. Detecting and logging of these lower +layer problems is a vital requirement for CAN users to identify +hardware issues on the physical transceiver layer as well as +arbitration problems and error frames caused by the different +ECUs. The occurrence of detected errors are important for diagnosis +and have to be logged together with the exact timestamp. For this +reason the CAN interface driver can generate so called Error Message +Frames that can optionally be passed to the user application in the +same way as other CAN frames. Whenever an error on the physical layer +or the MAC layer is detected (e.g. by the CAN controller) the driver +creates an appropriate error message frame. Error messages frames can +be requested by the user application using the common CAN filter +mechanisms. Inside this filter definition the (interested) type of +errors may be selected. The reception of error messages is disabled +by default. The format of the CAN error message frame is briefly +described in the Linux header file "include/uapi/linux/can/error.h". + + +How to use SocketCAN +==================== + +Like TCP/IP, you first need to open a socket for communicating over a +CAN network. Since SocketCAN implements a new protocol family, you +need to pass PF_CAN as the first argument to the socket(2) system +call. Currently, there are two CAN protocols to choose from, the raw +socket protocol and the broadcast manager (BCM). So to open a socket, +you would write:: + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + +and:: + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + +respectively. After the successful creation of the socket, you would +normally use the bind(2) system call to bind the socket to a CAN +interface (which is different from TCP/IP due to different addressing +- see :ref:`socketcan-concept`). After binding (CAN_RAW) or connecting (CAN_BCM) +the socket, you can read(2) and write(2) from/to the socket or use +send(2), sendto(2), sendmsg(2) and the recv* counterpart operations +on the socket as usual. There are also CAN specific socket options +described below. + +The basic CAN frame structure and the sockaddr structure are defined +in include/linux/can.h: + +.. code-block:: C + + struct can_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 can_dlc; /* frame payload length in byte (0 .. 8) */ + __u8 __pad; /* padding */ + __u8 __res0; /* reserved / padding */ + __u8 __res1; /* reserved / padding */ + __u8 data[8] __attribute__((aligned(8))); + }; + +The alignment of the (linear) payload data[] to a 64bit boundary +allows the user to define their own structs and unions to easily access +the CAN payload. There is no given byteorder on the CAN bus by +default. A read(2) system call on a CAN_RAW socket transfers a +struct can_frame to the user space. + +The sockaddr_can structure has an interface index like the +PF_PACKET socket, that also binds to a specific interface: + +.. code-block:: C + + struct sockaddr_can { + sa_family_t can_family; + int can_ifindex; + union { + /* transport protocol class address info (e.g. ISOTP) */ + struct { canid_t rx_id, tx_id; } tp; + + /* reserved for future CAN protocols address information */ + } can_addr; + }; + +To determine the interface index an appropriate ioctl() has to +be used (example for CAN_RAW sockets without error checking): + +.. code-block:: C + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + + strcpy(ifr.ifr_name, "can0" ); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + bind(s, (struct sockaddr *)&addr, sizeof(addr)); + + (..) + +To bind a socket to all(!) CAN interfaces the interface index must +be 0 (zero). In this case the socket receives CAN frames from every +enabled CAN interface. To determine the originating CAN interface +the system call recvfrom(2) may be used instead of read(2). To send +on a socket that is bound to 'any' interface sendto(2) is needed to +specify the outgoing interface. + +Reading CAN frames from a bound CAN_RAW socket (see above) consists +of reading a struct can_frame: + +.. code-block:: C + + struct can_frame frame; + + nbytes = read(s, &frame, sizeof(struct can_frame)); + + if (nbytes < 0) { + perror("can raw socket read"); + return 1; + } + + /* paranoid check ... */ + if (nbytes < sizeof(struct can_frame)) { + fprintf(stderr, "read: incomplete CAN frame\n"); + return 1; + } + + /* do something with the received CAN frame */ + +Writing CAN frames can be done similarly, with the write(2) system call:: + + nbytes = write(s, &frame, sizeof(struct can_frame)); + +When the CAN interface is bound to 'any' existing CAN interface +(addr.can_ifindex = 0) it is recommended to use recvfrom(2) if the +information about the originating CAN interface is needed: + +.. code-block:: C + + struct sockaddr_can addr; + struct ifreq ifr; + socklen_t len = sizeof(addr); + struct can_frame frame; + + nbytes = recvfrom(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, &len); + + /* get interface name of the received CAN frame */ + ifr.ifr_ifindex = addr.can_ifindex; + ioctl(s, SIOCGIFNAME, &ifr); + printf("Received a CAN frame from interface %s", ifr.ifr_name); + +To write CAN frames on sockets bound to 'any' CAN interface the +outgoing interface has to be defined certainly: + +.. code-block:: C + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + addr.can_ifindex = ifr.ifr_ifindex; + addr.can_family = AF_CAN; + + nbytes = sendto(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, sizeof(addr)); + +An accurate timestamp can be obtained with an ioctl(2) call after reading +a message from the socket: + +.. code-block:: C + + struct timeval tv; + ioctl(s, SIOCGSTAMP, &tv); + +The timestamp has a resolution of one microsecond and is set automatically +at the reception of a CAN frame. + +Remark about CAN FD (flexible data rate) support: + +Generally the handling of CAN FD is very similar to the formerly described +examples. The new CAN FD capable CAN controllers support two different +bitrates for the arbitration phase and the payload phase of the CAN FD frame +and up to 64 bytes of payload. This extended payload length breaks all the +kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight +bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g. +the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that +switches the socket into a mode that allows the handling of CAN FD frames +and (legacy) CAN frames simultaneously (see :ref:`socketcan-rawfd`). + +The struct canfd_frame is defined in include/linux/can.h: + +.. code-block:: C + + struct canfd_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 len; /* frame payload length in byte (0 .. 64) */ + __u8 flags; /* additional flags for CAN FD */ + __u8 __res0; /* reserved / padding */ + __u8 __res1; /* reserved / padding */ + __u8 data[64] __attribute__((aligned(8))); + }; + +The struct canfd_frame and the existing struct can_frame have the can_id, +the payload length and the payload data at the same offset inside their +structures. This allows to handle the different structures very similar. +When the content of a struct can_frame is copied into a struct canfd_frame +all structure elements can be used as-is - only the data[] becomes extended. + +When introducing the struct canfd_frame it turned out that the data length +code (DLC) of the struct can_frame was used as a length information as the +length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve +the easy handling of the length information the canfd_frame.len element +contains a plain length value from 0 .. 64. So both canfd_frame.len and +can_frame.can_dlc are equal and contain a length information and no DLC. +For details about the distinction of CAN and CAN FD capable devices and +the mapping to the bus-relevant data length code (DLC), see :ref:`socketcan-can-fd-driver`. + +The length of the two CAN(FD) frame structures define the maximum transfer +unit (MTU) of the CAN(FD) network interface and skbuff data length. Two +definitions are specified for CAN specific MTUs in include/linux/can.h: + +.. code-block:: C + + #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame + #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame + + +.. _socketcan-raw-sockets: + +RAW Protocol Sockets with can_filters (SOCK_RAW) +------------------------------------------------ + +Using CAN_RAW sockets is extensively comparable to the commonly +known access to CAN character devices. To meet the new possibilities +provided by the multi user SocketCAN approach, some reasonable +defaults are set at RAW socket binding time: + +- The filters are set to exactly one filter receiving everything +- The socket only receives valid data frames (=> no error message frames) +- The loopback of sent CAN frames is enabled (see :ref:`socketcan-local-loopback2`) +- The socket does not receive its own sent frames (in loopback mode) + +These default settings may be changed before or after binding the socket. +To use the referenced definitions of the socket options for CAN_RAW +sockets, include <linux/can/raw.h>. + + +.. _socketcan-rawfilter: + +RAW socket option CAN_RAW_FILTER +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The reception of CAN frames using CAN_RAW sockets can be controlled +by defining 0 .. n filters with the CAN_RAW_FILTER socket option. + +The CAN filter structure is defined in include/linux/can.h: + +.. code-block:: C + + struct can_filter { + canid_t can_id; + canid_t can_mask; + }; + +A filter matches, when: + +.. code-block:: C + + <received_can_id> & mask == can_id & mask + +which is analogous to known CAN controllers hardware filter semantics. +The filter can be inverted in this semantic, when the CAN_INV_FILTER +bit is set in can_id element of the can_filter structure. In +contrast to CAN controller hardware filters the user may set 0 .. n +receive filters for each open socket separately: + +.. code-block:: C + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + rfilter[1].can_id = 0x200; + rfilter[1].can_mask = 0x700; + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + +To disable the reception of CAN frames on the selected CAN_RAW socket: + +.. code-block:: C + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0); + +To set the filters to zero filters is quite obsolete as to not read +data causes the raw socket to discard the received CAN frames. But +having this 'send only' use-case we may remove the receive list in the +Kernel to save a little (really a very little!) CPU usage. + +CAN Filter Usage Optimisation +............................. + +The CAN filters are processed in per-device filter lists at CAN frame +reception time. To reduce the number of checks that need to be performed +while walking through the filter lists the CAN core provides an optimized +filter handling when the filter subscription focusses on a single CAN ID. + +For the possible 2048 SFF CAN identifiers the identifier is used as an index +to access the corresponding subscription list without any further checks. +For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as +hash function to retrieve the EFF table index. + +To benefit from the optimized filters for single CAN identifiers the +CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together +with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the +can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is +subscribed. E.g. in the example from above: + +.. code-block:: C + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + +both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass. + +To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the +filter has to be defined in this way to benefit from the optimized filters: + +.. code-block:: C + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK); + rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG; + rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + + +RAW Socket Option CAN_RAW_ERR_FILTER +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As described in :ref:`socketcan-network-problem-notifications` the CAN interface driver can generate so +called Error Message Frames that can optionally be passed to the user +application in the same way as other CAN frames. The possible +errors are divided into different error classes that may be filtered +using the appropriate error mask. To register for every possible +error condition CAN_ERR_MASK can be used as value for the error mask. +The values for the error mask are defined in linux/can/error.h: + +.. code-block:: C + + can_err_mask_t err_mask = ( CAN_ERR_TX_TIMEOUT | CAN_ERR_BUSOFF ); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_ERR_FILTER, + &err_mask, sizeof(err_mask)); + + +RAW Socket Option CAN_RAW_LOOPBACK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To meet multi user needs the local loopback is enabled by default +(see :ref:`socketcan-local-loopback1` for details). But in some embedded use-cases +(e.g. when only one application uses the CAN bus) this loopback +functionality can be disabled (separately for each socket): + +.. code-block:: C + + int loopback = 0; /* 0 = disabled, 1 = enabled (default) */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_LOOPBACK, &loopback, sizeof(loopback)); + + +RAW socket option CAN_RAW_RECV_OWN_MSGS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When the local loopback is enabled, all the sent CAN frames are +looped back to the open CAN sockets that registered for the CAN +frames' CAN-ID on this given interface to meet the multi user +needs. The reception of the CAN frames on the same socket that was +sending the CAN frame is assumed to be unwanted and therefore +disabled by default. This default behaviour may be changed on +demand: + +.. code-block:: C + + int recv_own_msgs = 1; /* 0 = disabled (default), 1 = enabled */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, + &recv_own_msgs, sizeof(recv_own_msgs)); + + +.. _socketcan-rawfd: + +RAW Socket Option CAN_RAW_FD_FRAMES +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +CAN FD support in CAN_RAW sockets can be enabled with a new socket option +CAN_RAW_FD_FRAMES which is off by default. When the new socket option is +not supported by the CAN_RAW socket (e.g. on older kernels), switching the +CAN_RAW_FD_FRAMES option returns the error -ENOPROTOOPT. + +Once CAN_RAW_FD_FRAMES is enabled the application can send both CAN frames +and CAN FD frames. OTOH the application has to handle CAN and CAN FD frames +when reading from the socket: + +.. code-block:: C + + CAN_RAW_FD_FRAMES enabled: CAN_MTU and CANFD_MTU are allowed + CAN_RAW_FD_FRAMES disabled: only CAN_MTU is allowed (default) + +Example: + +.. code-block:: C + + [ remember: CANFD_MTU == sizeof(struct canfd_frame) ] + + struct canfd_frame cfd; + + nbytes = read(s, &cfd, CANFD_MTU); + + if (nbytes == CANFD_MTU) { + printf("got CAN FD frame with length %d\n", cfd.len); + /* cfd.flags contains valid data */ + } else if (nbytes == CAN_MTU) { + printf("got legacy CAN frame with length %d\n", cfd.len); + /* cfd.flags is undefined */ + } else { + fprintf(stderr, "read: invalid CAN(FD) frame\n"); + return 1; + } + + /* the content can be handled independently from the received MTU size */ + + printf("can_id: %X data length: %d data: ", cfd.can_id, cfd.len); + for (i = 0; i < cfd.len; i++) + printf("%02X ", cfd.data[i]); + +When reading with size CANFD_MTU only returns CAN_MTU bytes that have +been received from the socket a legacy CAN frame has been read into the +provided CAN FD structure. Note that the canfd_frame.flags data field is +not specified in the struct can_frame and therefore it is only valid in +CANFD_MTU sized CAN FD frames. + +Implementation hint for new CAN applications: + +To build a CAN FD aware application use struct canfd_frame as basic CAN +data structure for CAN_RAW based applications. When the application is +executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES +socket option returns an error: No problem. You'll get legacy CAN frames +or CAN FD frames and can process them the same way. + +When sending to CAN devices make sure that the device is capable to handle +CAN FD frames by checking if the device maximum transfer unit is CANFD_MTU. +The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. + + +RAW socket option CAN_RAW_JOIN_FILTERS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The CAN_RAW socket can set multiple CAN identifier specific filters that +lead to multiple filters in the af_can.c filter processing. These filters +are indenpendent from each other which leads to logical OR'ed filters when +applied (see :ref:`socketcan-rawfilter`). + +This socket option joines the given CAN filters in the way that only CAN +frames are passed to user space that matched *all* given CAN filters. The +semantic for the applied filters is therefore changed to a logical AND. + +This is useful especially when the filterset is a combination of filters +where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or +CAN ID ranges from the incoming traffic. + + +RAW Socket Returned Message Flags +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using recvmsg() call, the msg->msg_flags may contain following flags: + +MSG_DONTROUTE: + set when the received frame was created on the local host. + +MSG_CONFIRM: + set when the frame was sent via the socket it is received on. + This flag can be interpreted as a 'transmission confirmation' when the + CAN driver supports the echo of frames on driver level, see + :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`. + In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set. + + +Broadcast Manager Protocol Sockets (SOCK_DGRAM) +----------------------------------------------- + +The Broadcast Manager protocol provides a command based configuration +interface to filter and send (e.g. cyclic) CAN messages in kernel space. + +Receive filters can be used to down sample frequent messages; detect events +such as message contents changes, packet length changes, and do time-out +monitoring of received messages. + +Periodic transmission tasks of CAN frames or a sequence of CAN frames can be +created and modified at runtime; both the message content and the two +possible transmit intervals can be altered. + +A BCM socket is not intended for sending individual CAN frames using the +struct can_frame as known from the CAN_RAW socket. Instead a special BCM +configuration message is defined. The basic BCM configuration message used +to communicate with the broadcast manager and the available operations are +defined in the linux/can/bcm.h include. The BCM message consists of a +message header with a command ('opcode') followed by zero or more CAN frames. +The broadcast manager sends responses to user space in the same form: + +.. code-block:: C + + struct bcm_msg_head { + __u32 opcode; /* command */ + __u32 flags; /* special flags */ + __u32 count; /* run 'count' times with ival1 */ + struct timeval ival1, ival2; /* count and subsequent interval */ + canid_t can_id; /* unique can_id for task */ + __u32 nframes; /* number of can_frames following */ + struct can_frame frames[0]; + }; + +The aligned payload 'frames' uses the same basic CAN frame structure defined +at the beginning of :ref:`socketcan-rawfd` and in the include/linux/can.h include. All +messages to the broadcast manager from user space have this structure. + +Note a CAN_BCM socket must be connected instead of bound after socket +creation (example without error checking): + +.. code-block:: C + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + connect(s, (struct sockaddr *)&addr, sizeof(addr)); + + (..) + +The broadcast manager socket is able to handle any number of in flight +transmissions or receive filters concurrently. The different RX/TX jobs are +distinguished by the unique can_id in each BCM message. However additional +CAN_BCM sockets are recommended to communicate on multiple CAN interfaces. +When the broadcast manager socket is bound to 'any' CAN interface (=> the +interface index is set to zero) the configured receive filters apply to any +CAN interface unless the sendto() syscall is used to overrule the 'any' CAN +interface index. When using recvfrom() instead of read() to retrieve BCM +socket messages the originating CAN interface is provided in can_ifindex. + + +Broadcast Manager Operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The opcode defines the operation for the broadcast manager to carry out, +or details the broadcast managers response to several events, including +user requests. + +Transmit Operations (user space to broadcast manager): + +TX_SETUP: + Create (cyclic) transmission task. + +TX_DELETE: + Remove (cyclic) transmission task, requires only can_id. + +TX_READ: + Read properties of (cyclic) transmission task for can_id. + +TX_SEND: + Send one CAN frame. + +Transmit Responses (broadcast manager to user space): + +TX_STATUS: + Reply to TX_READ request (transmission task configuration). + +TX_EXPIRED: + Notification when counter finishes sending at initial interval + 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP. + +Receive Operations (user space to broadcast manager): + +RX_SETUP: + Create RX content filter subscription. + +RX_DELETE: + Remove RX content filter subscription, requires only can_id. + +RX_READ: + Read properties of RX content filter subscription for can_id. + +Receive Responses (broadcast manager to user space): + +RX_STATUS: + Reply to RX_READ request (filter task configuration). + +RX_TIMEOUT: + Cyclic message is detected to be absent (timer ival1 expired). + +RX_CHANGED: + BCM message with updated CAN frame (detected content change). + Sent on first message received or on receipt of revised CAN messages. + + +Broadcast Manager Message Flags +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When sending a message to the broadcast manager the 'flags' element may +contain the following flag definitions which influence the behaviour: + +SETTIMER: + Set the values of ival1, ival2 and count + +STARTTIMER: + Start the timer with the actual values of ival1, ival2 + and count. Starting the timer leads simultaneously to emit a CAN frame. + +TX_COUNTEVT: + Create the message TX_EXPIRED when count expires + +TX_ANNOUNCE: + A change of data by the process is emitted immediately. + +TX_CP_CAN_ID: + Copies the can_id from the message header to each + subsequent frame in frames. This is intended as usage simplification. For + TX tasks the unique can_id from the message header may differ from the + can_id(s) stored for transmission in the subsequent struct can_frame(s). + +RX_FILTER_ID: + Filter by can_id alone, no frames required (nframes=0). + +RX_CHECK_DLC: + A change of the DLC leads to an RX_CHANGED. + +RX_NO_AUTOTIMER: + Prevent automatically starting the timeout monitor. + +RX_ANNOUNCE_RESUME: + If passed at RX_SETUP and a receive timeout occurred, a + RX_CHANGED message will be generated when the (cyclic) receive restarts. + +TX_RESET_MULTI_IDX: + Reset the index for the multiple frame transmission. + +RX_RTR_FRAME: + Send reply for RTR-request (placed in op->frames[0]). + + +Broadcast Manager Transmission Timers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Periodic transmission configurations may use up to two interval timers. +In this case the BCM sends a number of messages ('count') at an interval +'ival1', then continuing to send at another given interval 'ival2'. When +only one timer is needed 'count' is set to zero and only 'ival2' is used. +When SET_TIMER and START_TIMER flag were set the timers are activated. +The timer values can be altered at runtime when only SET_TIMER is set. + + +Broadcast Manager message sequence transmission +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic +TX task configuration. The number of CAN frames is provided in the 'nframes' +element of the BCM message head. The defined number of CAN frames are added +as array to the TX_SETUP BCM configuration message: + +.. code-block:: C + + /* create a struct to set up a sequence of four CAN frames */ + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[4]; + } mytxmsg; + + (..) + mytxmsg.msg_head.nframes = 4; + (..) + + write(s, &mytxmsg, sizeof(mytxmsg)); + +With every transmission the index in the array of CAN frames is increased +and set to zero at index overflow. + + +Broadcast Manager Receive Filter Timers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP. +When the SET_TIMER flag is set the timers are enabled: + +ival1: + Send RX_TIMEOUT when a received message is not received again within + the given time. When START_TIMER is set at RX_SETUP the timeout detection + is activated directly - even without a former CAN frame reception. + +ival2: + Throttle the received message rate down to the value of ival2. This + is useful to reduce messages for the application when the signal inside the + CAN frame is stateless as state changes within the ival2 periode may get + lost. + +Broadcast Manager Multiplex Message Receive Filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To filter for content changes in multiplex message sequences an array of more +than one CAN frames can be passed in a RX_SETUP configuration message. The +data bytes of the first CAN frame contain the mask of relevant bits that +have to match in the subsequent CAN frames with the received CAN frame. +If one of the subsequent CAN frames is matching the bits in that frame data +mark the relevant content to be compared with the previous received content. +Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN +filters) can be added as array to the TX_SETUP BCM configuration message: + +.. code-block:: C + + /* usually used to clear CAN frame data[] - beware of endian problems! */ + #define U64_DATA(p) (*(unsigned long long*)(p)->data) + + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[5]; + } msg; + + msg.msg_head.opcode = RX_SETUP; + msg.msg_head.can_id = 0x42; + msg.msg_head.flags = 0; + msg.msg_head.nframes = 5; + U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */ + U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */ + U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */ + U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */ + U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */ + + write(s, &msg, sizeof(msg)); + + +Broadcast Manager CAN FD Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The programming API of the CAN_BCM depends on struct can_frame which is +given as array directly behind the bcm_msg_head structure. To follow this +schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head +flags indicates that the concatenated CAN frame structures behind the +bcm_msg_head are defined as struct canfd_frame: + +.. code-block:: C + + struct { + struct bcm_msg_head msg_head; + struct canfd_frame frame[5]; + } msg; + + msg.msg_head.opcode = RX_SETUP; + msg.msg_head.can_id = 0x42; + msg.msg_head.flags = CAN_FD_FRAME; + msg.msg_head.nframes = 5; + (..) + +When using CAN FD frames for multiplex filtering the MUX mask is still +expected in the first 64 bit of the struct canfd_frame data section. + + +Connected Transport Protocols (SOCK_SEQPACKET) +---------------------------------------------- + +(to be written) + + +Unconnected Transport Protocols (SOCK_DGRAM) +-------------------------------------------- + +(to be written) + + +.. _socketcan-core-module: + +SocketCAN Core Module +===================== + +The SocketCAN core module implements the protocol family +PF_CAN. CAN protocol modules are loaded by the core module at +runtime. The core module provides an interface for CAN protocol +modules to subscribe needed CAN IDs (see :ref:`socketcan-receive-lists`). + + +can.ko Module Params +-------------------- + +- **stats_timer**: + To calculate the SocketCAN core statistics + (e.g. current/maximum frames per second) this 1 second timer is + invoked at can.ko module start time by default. This timer can be + disabled by using stattimer=0 on the module commandline. + +- **debug**: + (removed since SocketCAN SVN r546) + + +procfs content +-------------- + +As described in :ref:`socketcan-receive-lists` the SocketCAN core uses several filter +lists to deliver received CAN frames to CAN protocol modules. These +receive lists, their filters and the count of filter matches can be +checked in the appropriate receive list. All entries contain the +device and a protocol module identifier:: + + foo@bar:~$ cat /proc/net/can/rcvlist_all + + receive list 'rx_all': + (vcan3: no entry) + (vcan2: no entry) + (vcan1: no entry) + device can_id can_mask function userdata matches ident + vcan0 000 00000000 f88e6370 f6c6f400 0 raw + (any: no entry) + +In this example an application requests any CAN traffic from vcan0:: + + rcvlist_all - list for unfiltered entries (no filter operations) + rcvlist_eff - list for single extended frame (EFF) entries + rcvlist_err - list for error message frames masks + rcvlist_fil - list for mask/value filters + rcvlist_inv - list for mask/value filters (inverse semantic) + rcvlist_sff - list for single standard frame (SFF) entries + +Additional procfs files in /proc/net/can:: + + stats - SocketCAN core statistics (rx/tx frames, match ratios, ...) + reset_stats - manual statistic reset + version - prints the SocketCAN core version and the ABI version + + +Writing Own CAN Protocol Modules +-------------------------------- + +To implement a new protocol in the protocol family PF_CAN a new +protocol has to be defined in include/linux/can.h . +The prototypes and definitions to use the SocketCAN core can be +accessed by including include/linux/can/core.h . +In addition to functions that register the CAN protocol and the +CAN device notifier chain there are functions to subscribe CAN +frames received by CAN interfaces and to send CAN frames:: + + can_rx_register - subscribe CAN frames from a specific interface + can_rx_unregister - unsubscribe CAN frames from a specific interface + can_send - transmit a CAN frame (optional with local loopback) + +For details see the kerneldoc documentation in net/can/af_can.c or +the source code of net/can/raw.c or net/can/bcm.c . + + +CAN Network Drivers +=================== + +Writing a CAN network device driver is much easier than writing a +CAN character device driver. Similar to other known network device +drivers you mainly have to deal with: + +- TX: Put the CAN frame from the socket buffer to the CAN controller. +- RX: Put the CAN frame from the CAN controller to the socket buffer. + +See e.g. at Documentation/networking/netdevices.txt . The differences +for writing CAN network device driver are described below: + + +General Settings +---------------- + +.. code-block:: C + + dev->type = ARPHRD_CAN; /* the netdevice hardware type */ + dev->flags = IFF_NOARP; /* CAN has no arp */ + + dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */ + + or alternative, when the controller supports CAN with flexible data rate: + dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */ + +The struct can_frame or struct canfd_frame is the payload of each socket +buffer (skbuff) in the protocol family PF_CAN. + + +.. _socketcan-local-loopback2: + +Local Loopback of Sent Frames +----------------------------- + +As described in :ref:`socketcan-local-loopback1` the CAN network device driver should +support a local loopback functionality similar to the local echo +e.g. of tty devices. In this case the driver flag IFF_ECHO has to be +set to prevent the PF_CAN core from locally echoing sent frames +(aka loopback) as fallback solution:: + + dev->flags = (IFF_NOARP | IFF_ECHO); + + +CAN Controller Hardware Filters +------------------------------- + +To reduce the interrupt load on deep embedded systems some CAN +controllers support the filtering of CAN IDs or ranges of CAN IDs. +These hardware filter capabilities vary from controller to +controller and have to be identified as not feasible in a multi-user +networking approach. The use of the very controller specific +hardware filters could make sense in a very dedicated use-case, as a +filter on driver level would affect all users in the multi-user +system. The high efficient filter sets inside the PF_CAN core allow +to set different multiple filters for each socket separately. +Therefore the use of hardware filters goes to the category 'handmade +tuning on deep embedded systems'. The author is running a MPC603e +@133MHz with four SJA1000 CAN controllers from 2002 under heavy bus +load without any problems ... + + +The Virtual CAN Driver (vcan) +----------------------------- + +Similar to the network loopback devices, vcan offers a virtual local +CAN interface. A full qualified address on CAN consists of + +- a unique CAN Identifier (CAN ID) +- the CAN bus this CAN ID is transmitted on (e.g. can0) + +so in common use cases more than one virtual CAN interface is needed. + +The virtual CAN interfaces allow the transmission and reception of CAN +frames without real CAN controller hardware. Virtual CAN network +devices are usually named 'vcanX', like vcan0 vcan1 vcan2 ... +When compiled as a module the virtual CAN driver module is called vcan.ko + +Since Linux Kernel version 2.6.24 the vcan driver supports the Kernel +netlink interface to create vcan network devices. The creation and +removal of vcan network devices can be managed with the ip(8) tool:: + + - Create a virtual CAN network interface: + $ ip link add type vcan + + - Create a virtual CAN network interface with a specific name 'vcan42': + $ ip link add dev vcan42 type vcan + + - Remove a (virtual CAN) network interface 'vcan42': + $ ip link del vcan42 + + +The CAN Network Device Driver Interface +--------------------------------------- + +The CAN network device driver interface provides a generic interface +to setup, configure and monitor CAN network devices. The user can then +configure the CAN device, like setting the bit-timing parameters, via +the netlink interface using the program "ip" from the "IPROUTE2" +utility suite. The following chapter describes briefly how to use it. +Furthermore, the interface uses a common data structure and exports a +set of common functions, which all real CAN network device drivers +should use. Please have a look to the SJA1000 or MSCAN driver to +understand how to use them. The name of the module is can-dev.ko. + + +Netlink interface to set/get devices properties +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The CAN device must be configured via netlink interface. The supported +netlink message types are defined and briefly described in +"include/linux/can/netlink.h". CAN link support for the program "ip" +of the IPROUTE2 utility suite is available and it can be used as shown +below: + +Setting CAN device properties:: + + $ ip link set can0 type can help + Usage: ip link set DEVICE type can + [ bitrate BITRATE [ sample-point SAMPLE-POINT] ] | + [ tq TQ prop-seg PROP_SEG phase-seg1 PHASE-SEG1 + phase-seg2 PHASE-SEG2 [ sjw SJW ] ] + + [ dbitrate BITRATE [ dsample-point SAMPLE-POINT] ] | + [ dtq TQ dprop-seg PROP_SEG dphase-seg1 PHASE-SEG1 + dphase-seg2 PHASE-SEG2 [ dsjw SJW ] ] + + [ loopback { on | off } ] + [ listen-only { on | off } ] + [ triple-sampling { on | off } ] + [ one-shot { on | off } ] + [ berr-reporting { on | off } ] + [ fd { on | off } ] + [ fd-non-iso { on | off } ] + [ presume-ack { on | off } ] + + [ restart-ms TIME-MS ] + [ restart ] + + Where: BITRATE := { 1..1000000 } + SAMPLE-POINT := { 0.000..0.999 } + TQ := { NUMBER } + PROP-SEG := { 1..8 } + PHASE-SEG1 := { 1..8 } + PHASE-SEG2 := { 1..8 } + SJW := { 1..4 } + RESTART-MS := { 0 | NUMBER } + +Display CAN device details and statistics:: + + $ ip -details -statistics link show can0 + 2: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP qlen 10 + link/can + can <TRIPLE-SAMPLING> state ERROR-ACTIVE restart-ms 100 + bitrate 125000 sample_point 0.875 + tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 + sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 + clock 8000000 + re-started bus-errors arbit-lost error-warn error-pass bus-off + 41 17457 0 41 42 41 + RX: bytes packets errors dropped overrun mcast + 140859 17608 17457 0 0 0 + TX: bytes packets errors dropped carrier collsns + 861 112 0 41 0 0 + +More info to the above output: + +"<TRIPLE-SAMPLING>" + Shows the list of selected CAN controller modes: LOOPBACK, + LISTEN-ONLY, or TRIPLE-SAMPLING. + +"state ERROR-ACTIVE" + The current state of the CAN controller: "ERROR-ACTIVE", + "ERROR-WARNING", "ERROR-PASSIVE", "BUS-OFF" or "STOPPED" + +"restart-ms 100" + Automatic restart delay time. If set to a non-zero value, a + restart of the CAN controller will be triggered automatically + in case of a bus-off condition after the specified delay time + in milliseconds. By default it's off. + +"bitrate 125000 sample-point 0.875" + Shows the real bit-rate in bits/sec and the sample-point in the + range 0.000..0.999. If the calculation of bit-timing parameters + is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the + bit-timing can be defined by setting the "bitrate" argument. + Optionally the "sample-point" can be specified. By default it's + 0.000 assuming CIA-recommended sample-points. + +"tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1" + Shows the time quanta in ns, propagation segment, phase buffer + segment 1 and 2 and the synchronisation jump width in units of + tq. They allow to define the CAN bit-timing in a hardware + independent format as proposed by the Bosch CAN 2.0 spec (see + chapter 8 of http://www.semiconductors.bosch.de/pdf/can2spec.pdf). + +"sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 clock 8000000" + Shows the bit-timing constants of the CAN controller, here the + "sja1000". The minimum and maximum values of the time segment 1 + and 2, the synchronisation jump width in units of tq, the + bitrate pre-scaler and the CAN system clock frequency in Hz. + These constants could be used for user-defined (non-standard) + bit-timing calculation algorithms in user-space. + +"re-started bus-errors arbit-lost error-warn error-pass bus-off" + Shows the number of restarts, bus and arbitration lost errors, + and the state changes to the error-warning, error-passive and + bus-off state. RX overrun errors are listed in the "overrun" + field of the standard network statistics. + +Setting the CAN Bit-Timing +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The CAN bit-timing parameters can always be defined in a hardware +independent format as proposed in the Bosch CAN 2.0 specification +specifying the arguments "tq", "prop_seg", "phase_seg1", "phase_seg2" +and "sjw":: + + $ ip link set canX type can tq 125 prop-seg 6 \ + phase-seg1 7 phase-seg2 2 sjw 1 + +If the kernel option CONFIG_CAN_CALC_BITTIMING is enabled, CIA +recommended CAN bit-timing parameters will be calculated if the bit- +rate is specified with the argument "bitrate":: + + $ ip link set canX type can bitrate 125000 + +Note that this works fine for the most common CAN controllers with +standard bit-rates but may *fail* for exotic bit-rates or CAN system +clock frequencies. Disabling CONFIG_CAN_CALC_BITTIMING saves some +space and allows user-space tools to solely determine and set the +bit-timing parameters. The CAN controller specific bit-timing +constants can be used for that purpose. They are listed by the +following command:: + + $ ip -details link show can0 + ... + sja1000: clock 8000000 tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 + + +Starting and Stopping the CAN Network Device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A CAN network device is started or stopped as usual with the command +"ifconfig canX up/down" or "ip link set canX up/down". Be aware that +you *must* define proper bit-timing parameters for real CAN devices +before you can start it to avoid error-prone default settings:: + + $ ip link set canX up type can bitrate 125000 + +A device may enter the "bus-off" state if too many errors occurred on +the CAN bus. Then no more messages are received or sent. An automatic +bus-off recovery can be enabled by setting the "restart-ms" to a +non-zero value, e.g.:: + + $ ip link set canX type can restart-ms 100 + +Alternatively, the application may realize the "bus-off" condition +by monitoring CAN error message frames and do a restart when +appropriate with the command:: + + $ ip link set canX type can restart + +Note that a restart will also create a CAN error message frame (see +also :ref:`socketcan-network-problem-notifications`). + + +.. _socketcan-can-fd-driver: + +CAN FD (Flexible Data Rate) Driver Support +------------------------------------------ + +CAN FD capable CAN controllers support two different bitrates for the +arbitration phase and the payload phase of the CAN FD frame. Therefore a +second bit timing has to be specified in order to enable the CAN FD bitrate. + +Additionally CAN FD capable CAN controllers support up to 64 bytes of +payload. The representation of this length in can_frame.can_dlc and +canfd_frame.len for userspace applications and inside the Linux network +layer is a plain value from 0 .. 64 instead of the CAN 'data length code'. +The data length code was a 1:1 mapping to the payload length in the legacy +CAN frames anyway. The payload length to the bus-relevant DLC mapping is +only performed inside the CAN drivers, preferably with the helper +functions can_dlc2len() and can_len2dlc(). + +The CAN netdevice driver capabilities can be distinguished by the network +devices maximum transfer unit (MTU):: + + MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device + MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device + +The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. +N.B. CAN FD capable devices can also handle and send legacy CAN frames. + +When configuring CAN FD capable CAN controllers an additional 'data' bitrate +has to be set. This bitrate for the data phase of the CAN FD frame has to be +at least the bitrate which was configured for the arbitration phase. This +second bitrate is specified analogue to the first bitrate but the bitrate +setting keywords for the 'data' bitrate start with 'd' e.g. dbitrate, +dsample-point, dsjw or dtq and similar settings. When a data bitrate is set +within the configuration process the controller option "fd on" can be +specified to enable the CAN FD mode in the CAN controller. This controller +option also switches the device MTU to 72 (CANFD_MTU). + +The first CAN FD specification presented as whitepaper at the International +CAN Conference 2012 needed to be improved for data integrity reasons. +Therefore two CAN FD implementations have to be distinguished today: + +- ISO compliant: The ISO 11898-1:2015 CAN FD implementation (default) +- non-ISO compliant: The CAN FD implementation following the 2012 whitepaper + +Finally there are three types of CAN FD controllers: + +1. ISO compliant (fixed) +2. non-ISO compliant (fixed, like the M_CAN IP core v3.0.1 in m_can.c) +3. ISO/non-ISO CAN FD controllers (switchable, like the PEAK PCAN-USB FD) + +The current ISO/non-ISO mode is announced by the CAN controller driver via +netlink and displayed by the 'ip' tool (controller option FD-NON-ISO). +The ISO/non-ISO-mode can be altered by setting 'fd-non-iso {on|off}' for +switchable CAN FD controllers only. + +Example configuring 500 kbit/s arbitration bitrate and 4 Mbit/s data bitrate:: + + $ ip link set can0 up type can bitrate 500000 sample-point 0.75 \ + dbitrate 4000000 dsample-point 0.8 fd on + $ ip -details link show can0 + 5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN \ + mode DEFAULT group default qlen 10 + link/can promiscuity 0 + can <FD> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 + bitrate 500000 sample-point 0.750 + tq 50 prop-seg 14 phase-seg1 15 phase-seg2 10 sjw 1 + pcan_usb_pro_fd: tseg1 1..64 tseg2 1..16 sjw 1..16 brp 1..1024 \ + brp-inc 1 + dbitrate 4000000 dsample-point 0.800 + dtq 12 dprop-seg 7 dphase-seg1 8 dphase-seg2 4 dsjw 1 + pcan_usb_pro_fd: dtseg1 1..16 dtseg2 1..8 dsjw 1..4 dbrp 1..1024 \ + dbrp-inc 1 + clock 80000000 + +Example when 'fd-non-iso on' is added on this switchable CAN FD adapter:: + + can <FD,FD-NON-ISO> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 + + +Supported CAN Hardware +---------------------- + +Please check the "Kconfig" file in "drivers/net/can" to get an actual +list of the support CAN hardware. On the SocketCAN project website +(see :ref:`socketcan-resources`) there might be further drivers available, also for +older kernel versions. + + +.. _socketcan-resources: + +SocketCAN Resources +=================== + +The Linux CAN / SocketCAN project resources (project site / mailing list) +are referenced in the MAINTAINERS file in the Linux source tree. +Search for CAN NETWORK [LAYERS|DRIVERS]. + +Credits +======= + +- Oliver Hartkopp (PF_CAN core, filters, drivers, bcm, SJA1000 driver) +- Urs Thuermann (PF_CAN core, kernel integration, socket interfaces, raw, vcan) +- Jan Kizka (RT-SocketCAN core, Socket-API reconciliation) +- Wolfgang Grandegger (RT-SocketCAN core & drivers, Raw Socket-API reviews, CAN device driver interface, MSCAN driver) +- Robert Schwebel (design reviews, PTXdist integration) +- Marc Kleine-Budde (design reviews, Kernel 2.6 cleanups, drivers) +- Benedikt Spranger (reviews) +- Thomas Gleixner (LKML reviews, coding style, posting hints) +- Andrey Volkov (kernel subtree structure, ioctls, MSCAN driver) +- Matthias Brukner (first SJA1000 CAN netdevice implementation Q2/2003) +- Klaus Hitschler (PEAK driver integration) +- Uwe Koppe (CAN netdevices with PF_PACKET approach) +- Michael Schulze (driver layer loopback requirement, RT CAN drivers review) +- Pavel Pisa (Bit-timing calculation) +- Sascha Hauer (SJA1000 platform driver) +- Sebastian Haas (SJA1000 EMS PCI driver) +- Markus Plessing (SJA1000 EMS PCI driver) +- Per Dalen (SJA1000 Kvaser PCI driver) +- Sam Ravnborg (reviews, coding style, kbuild help) diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst new file mode 100644 index 000000000..4cef88d24 --- /dev/null +++ b/Documentation/networking/can_ucan_protocol.rst @@ -0,0 +1,332 @@ +================= +The UCAN Protocol +================= + +UCAN is the protocol used by the microcontroller-based USB-CAN +adapter that is integrated on System-on-Modules from Theobroma Systems +and that is also available as a standalone USB stick. + +The UCAN protocol has been designed to be hardware-independent. +It is modeled closely after how Linux represents CAN devices +internally. All multi-byte integers are encoded as Little Endian. + +All structures mentioned in this document are defined in +``drivers/net/can/usb/ucan.c``. + +USB Endpoints +============= + +UCAN devices use three USB endpoints: + +CONTROL endpoint + The driver sends device management commands on this endpoint + +IN endpoint + The device sends CAN data frames and CAN error frames + +OUT endpoint + The driver sends CAN data frames on the out endpoint + + +CONTROL Messages +================ + +UCAN devices are configured using vendor requests on the control pipe. + +To support multiple CAN interfaces in a single USB device all +configuration commands target the corresponding interface in the USB +descriptor. + +The driver uses ``ucan_ctrl_command_in/out`` and +``ucan_device_request_in`` to deliver commands to the device. + +Setup Packet +------------ + +================= ===================================================== +``bmRequestType`` Direction | Vendor | (Interface or Device) +``bRequest`` Command Number +``wValue`` Subcommand Number (16 Bit) or 0 if not used +``wIndex`` USB Interface Index (0 for device commands) +``wLength`` * Host to Device - Number of bytes to transmit + * Device to Host - Maximum Number of bytes to + receive. If the device send less. Commom ZLP + semantics are used. +================= ===================================================== + +Error Handling +-------------- + +The device indicates failed control commands by stalling the +pipe. + +Device Commands +--------------- + +UCAN_DEVICE_GET_FW_STRING +~~~~~~~~~~~~~~~~~~~~~~~~~ + +*Dev2Host; optional* + +Request the device firmware string. + + +Interface Commands +------------------ + +UCAN_COMMAND_START +~~~~~~~~~~~~~~~~~~ + +*Host2Dev; mandatory* + +Bring the CAN interface up. + +Payload Format + ``ucan_ctl_payload_t.cmd_start`` + +==== ============================ +mode or mask of ``UCAN_MODE_*`` +==== ============================ + +UCAN_COMMAND_STOP +~~~~~~~~~~~~~~~~~~ + +*Host2Dev; mandatory* + +Stop the CAN interface + +Payload Format + *empty* + +UCAN_COMMAND_RESET +~~~~~~~~~~~~~~~~~~ + +*Host2Dev; mandatory* + +Reset the CAN controller (including error counters) + +Payload Format + *empty* + +UCAN_COMMAND_GET +~~~~~~~~~~~~~~~~ + +*Host2Dev; mandatory* + +Get Information from the Device + +Subcommands +^^^^^^^^^^^ + +UCAN_COMMAND_GET_INFO + Request the device information structure ``ucan_ctl_payload_t.device_info``. + + See the ``device_info`` field for details, and + ``uapi/linux/can/netlink.h`` for an explanation of the + ``can_bittiming fields``. + + Payload Format + ``ucan_ctl_payload_t.device_info`` + +UCAN_COMMAND_GET_PROTOCOL_VERSION + + Request the device protocol version + ``ucan_ctl_payload_t.protocol_version``. The current protocol version is 3. + + Payload Format + ``ucan_ctl_payload_t.protocol_version`` + +.. note:: Devices that do not implement this command use the old + protocol version 1 + +UCAN_COMMAND_SET_BITTIMING +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +*Host2Dev; mandatory* + +Setup bittiming by sending the the structure +``ucan_ctl_payload_t.cmd_set_bittiming`` (see ``struct bittiming`` for +details) + +Payload Format + ``ucan_ctl_payload_t.cmd_set_bittiming``. + +UCAN_SLEEP/WAKE +~~~~~~~~~~~~~~~ + +*Host2Dev; optional* + +Configure sleep and wake modes. Not yet supported by the driver. + +UCAN_FILTER +~~~~~~~~~~~ + +*Host2Dev; optional* + +Setup hardware CAN filters. Not yet supported by the driver. + +Allowed interface commands +-------------------------- + +================== =================== ================== +Legal Device State Command New Device State +================== =================== ================== +stopped SET_BITTIMING stopped +stopped START started +started STOP or RESET stopped +stopped STOP or RESET stopped +started RESTART started +any GET *no change* +================== =================== ================== + +IN Message Format +================= + +A data packet on the USB IN endpoint contains one or more +``ucan_message_in`` values. If multiple messages are batched in a USB +data packet, the ``len`` field can be used to jump to the next +``ucan_message_in`` value (take care to sanity-check the ``len`` value +against the actual data size). + +.. _can_ucan_in_message_len: + +``len`` field +------------- + +Each ``ucan_message_in`` must be aligned to a 4-byte boundary (relative +to the start of the start of the data buffer). That means that there +may be padding bytes between multiple ``ucan_message_in`` values: + +.. code:: + + +----------------------------+ < 0 + | | + | struct ucan_message_in | + | | + +----------------------------+ < len + [padding] + +----------------------------+ < round_up(len, 4) + | | + | struct ucan_message_in | + | | + +----------------------------+ + [...] + +``type`` field +-------------- + +The ``type`` field specifies the type of the message. + +UCAN_IN_RX +~~~~~~~~~~ + +``subtype`` + zero + +Data received from the CAN bus (ID + payload). + +UCAN_IN_TX_COMPLETE +~~~~~~~~~~~~~~~~~~~ + +``subtype`` + zero + +The CAN device has sent a message to the CAN bus. It answers with a +list of of tuples <echo-ids, flags>. + +The echo-id identifies the frame from (echos the id from a previous +UCAN_OUT_TX message). The flag indicates the result of the +transmission. Whereas a set Bit 0 indicates success. All other bits +are reserved and set to zero. + +Flow Control +------------ + +When receiving CAN messages there is no flow control on the USB +buffer. The driver has to handle inbound message quickly enough to +avoid drops. I case the device buffer overflow the condition is +reported by sending corresponding error frames (see +:ref:`can_ucan_error_handling`) + + +OUT Message Format +================== + +A data packet on the USB OUT endpoint contains one or more ``struct +ucan_message_out`` values. If multiple messages are batched into one +data packet, the device uses the ``len`` field to jump to the next +ucan_message_out value. Each ucan_message_out must be aligned to 4 +bytes (relative to the start of the data buffer). The mechanism is +same as described in :ref:`can_ucan_in_message_len`. + +.. code:: + + +----------------------------+ < 0 + | | + | struct ucan_message_out | + | | + +----------------------------+ < len + [padding] + +----------------------------+ < round_up(len, 4) + | | + | struct ucan_message_out | + | | + +----------------------------+ + [...] + +``type`` field +-------------- + +In protocol version 3 only ``UCAN_OUT_TX`` is defined, others are used +only by legacy devices (protocol version 1). + +UCAN_OUT_TX +~~~~~~~~~~~ +``subtype`` + echo id to be replied within a CAN_IN_TX_COMPLETE message + +Transmit a CAN frame. (parameters: ``id``, ``data``) + +Flow Control +------------ + +When the device outbound buffers are full it starts sending *NAKs* on +the *OUT* pipe until more buffers are available. The driver stops the +queue when a certain threshold of out packets are incomplete. + +.. _can_ucan_error_handling: + +CAN Error Handling +================== + +If error reporting is turned on the device encodes errors into CAN +error frames (see ``uapi/linux/can/error.h``) and sends it using the +IN endpoint. The driver updates its error statistics and forwards +it. + +Although UCAN devices can suppress error frames completely, in Linux +the driver is always interested. Hence, the device is always started with +the ``UCAN_MODE_BERR_REPORT`` set. Filtering those messages for the +user space is done by the driver. + +Bus OFF +------- + +- The device does not recover from bus of automatically. +- Bus OFF is indicated by an error frame (see ``uapi/linux/can/error.h``) +- Bus OFF recovery is started by ``UCAN_COMMAND_RESTART`` +- Once Bus OFF recover is completed the device sends an error frame + indicating that it is on ERROR-ACTIVE state. +- During Bus OFF no frames are sent by the device. +- During Bus OFF transmission requests from the host are completed + immediately with the success bit left unset. + +Example Conversation +==================== + +#) Device is connected to USB +#) Host sends command ``UCAN_COMMAND_RESET``, subcmd 0 +#) Host sends command ``UCAN_COMMAND_GET``, subcmd ``UCAN_COMMAND_GET_INFO`` +#) Device sends ``UCAN_IN_DEVICE_INFO`` +#) Host sends command ``UCAN_OUT_SET_BITTIMING`` +#) Host sends command ``UCAN_COMMAND_START``, subcmd 0, mode ``UCAN_MODE_BERR_REPORT`` diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt new file mode 100644 index 000000000..4e68f0bc5 --- /dev/null +++ b/Documentation/networking/cdc_mbim.txt @@ -0,0 +1,339 @@ + cdc_mbim - Driver for CDC MBIM Mobile Broadband modems + ======================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +Type: Boolean +Valid Range: N/Y (0-1) +Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides a userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +-------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + + #include <stdio.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <linux/types.h> + #include <linux/usb/cdc-wdm.h> + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that a userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 <VLANID> + 256 - 511 DSS 0 - 255 <VLANID - 256> + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + +[1] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + - http://www.usb.org/developers/docs/devclass_docs/ + +[2] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + - http://www.usb.org/developers/docs/devclass_docs/ + +[3] libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + - http://www.freedesktop.org/wiki/Software/libmbim/ + +[4] ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + - http://www.freedesktop.org/wiki/Software/ModemManager/ + +[5] "MBIM (Mobile Broadband Interface Model) Registry" + - http://compliance.usb.org/mbim/ + +[6] "/sys/kernel/debug/usb/devices output format" + - Documentation/driver-api/usb/usb.rst + +[7] "/sys/bus/usb/devices/.../descriptors" + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt new file mode 100644 index 000000000..27bc09cfc --- /dev/null +++ b/Documentation/networking/checksum-offloads.txt @@ -0,0 +1,122 @@ +Checksum Offloads in the Linux Networking Stack + + +Introduction +============ + +This document describes a set of techniques in the Linux networking stack + to take advantage of checksum offload capabilities of various NICs. + +The following technologies are described: + * TX Checksum Offload + * LCO: Local Checksum Offload + * RCO: Remote Checksum Offload + +Things that should be documented here but aren't yet: + * RX Checksum Offload + * CHECKSUM_UNNECESSARY conversion + + +TX Checksum Offload +=================== + +The interface for offloading a transmit checksum to a device is explained + in detail in comments near the top of include/linux/skbuff.h. +In brief, it allows to request the device fill in a single ones-complement + checksum defined by the sk_buff fields skb->csum_start and + skb->csum_offset. The device should compute the 16-bit ones-complement + checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the + packet, and fill in the result at (csum_start + csum_offset). +Because csum_offset cannot be negative, this ensures that the previous + value of the checksum field is included in the checksum computation, thus + it can be used to supply any needed corrections to the checksum (such as + the sum of the pseudo-header for UDP or TCP). +This interface only allows a single checksum to be offloaded. Where + encapsulation is used, the packet may have multiple checksum fields in + different header layers, and the rest will have to be handled by another + mechanism such as LCO or RCO. +CRC32c can also be offloaded using this interface, by means of filling + skb->csum_start and skb->csum_offset as described above, and setting + skb->csum_not_inet: see skbuff.h comment (section 'D') for more details. +No offloading of the IP header checksum is performed; it is always done in + software. This is OK because when we build the IP header, we obviously + have it in cache, so summing it isn't expensive. It's also rather short. +The requirements for GSO are more complicated, because when segmenting an + encapsulated packet both the inner and outer checksums may need to be + edited or recomputed for each resulting segment. See the skbuff.h comment + (section 'E') for more details. + +A driver declares its offload capabilities in netdev->hw_features; see + Documentation/networking/netdev-features.txt for more. Note that a device + which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start + and csum_offset given in the SKB; if it tries to deduce these itself in + hardware (as some NICs do) the driver should check that the values in the + SKB match those which the hardware will deduce, and if not, fall back to + checksumming in software instead (with skb_csum_hwoffload_help() or one of + the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in + include/linux/skbuff.h). + +The stack should, for the most part, assume that checksum offload is + supported by the underlying device. The only place that should check is + validate_xmit_skb(), and the functions it calls directly or indirectly. + That function compares the offload features requested by the SKB (which + may include other offloads besides TX Checksum Offload) and, if they are + not supported or enabled on the device (determined by netdev->features), + performs the corresponding offload in software. In the case of TX + Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features). + + +LCO: Local Checksum Offload +=========================== + +LCO is a technique for efficiently computing the outer checksum of an + encapsulated datagram when the inner checksum is due to be offloaded. +The ones-complement sum of a correctly checksummed TCP or UDP packet is + equal to the complement of the sum of the pseudo header, because everything + else gets 'cancelled out' by the checksum field. This is because the sum was + complemented before being written to the checksum field. +More generally, this holds in any case where the 'IP-style' ones complement + checksum is used, and thus any checksum that TX Checksum Offload supports. +That is, if we have set up TX Checksum Offload with a start/offset pair, we + know that after the device has filled in that checksum, the ones + complement sum from csum_start to the end of the packet will be equal to + the complement of whatever value we put in the checksum field beforehand. + This allows us to compute the outer checksum without looking at the payload: + we simply stop summing when we get to csum_start, then add the complement of + the 16-bit word at (csum_start + csum_offset). +Then, when the true inner checksum is filled in (either by hardware or by + skb_checksum_help()), the outer checksum will become correct by virtue of + the arithmetic. + +LCO is performed by the stack when constructing an outer UDP header for an + encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for + the IPv6 equivalents, in udp6_set_csum(). +It is also performed when constructing an IPv4 GRE header, in + net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when + constructing an IPv6 GRE header; the GRE checksum is computed over the + whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be + possible to use LCO here as IPv6 GRE still uses an IP-style checksum. +All of the LCO implementations use a helper function lco_csum(), in + include/linux/skbuff.h. + +LCO can safely be used for nested encapsulations; in this case, the outer + encapsulation layer will sum over both its own header and the 'middle' + header. This does mean that the 'middle' header will get summed multiple + times, but there doesn't seem to be a way to avoid that without incurring + bigger costs (e.g. in SKB bloat). + + +RCO: Remote Checksum Offload +============================ + +RCO is a technique for eliding the inner checksum of an encapsulated + datagram, allowing the outer checksum to be offloaded. It does, however, + involve a change to the encapsulation protocols, which the receiver must + also support. For this reason, it is disabled by default. +RCO is detailed in the following Internet-Drafts: +https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 +https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 +In Linux, RCO is implemented individually in each encapsulation protocol, + and most tunnel types have flags controlling its use. For instance, VXLAN + has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that + RCO should be used when transmitting to a given remote destination. diff --git a/Documentation/networking/conf.py b/Documentation/networking/conf.py new file mode 100644 index 000000000..40f69e67a --- /dev/null +++ b/Documentation/networking/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Networking Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'networking.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/networking/cops.txt b/Documentation/networking/cops.txt new file mode 100644 index 000000000..3e344b448 --- /dev/null +++ b/Documentation/networking/cops.txt @@ -0,0 +1,63 @@ +Text File for the COPS LocalTalk Linux driver (cops.c). + By Jay Schulist <jschlst@samba.org> + +This driver has two modes and they are: Dayna mode and Tangent mode. +Each mode corresponds with the type of card. It has been found +that there are 2 main types of cards and all other cards are +the same and just have different names or only have minor differences +such as more IO ports. As this driver is tested it will +become more clear exactly what cards are supported. + +Right now these cards are known to work with the COPS driver. The +LT-200 cards work in a somewhat more limited capacity than the +DL200 cards, which work very well and are in use by many people. + +TANGENT driver mode: + Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200 +DAYNA driver mode: + Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95, + Farallon PhoneNET PC III, Farallon PhoneNET PC II +Other cards possibly supported mode unknown though: + Dayna DL2000 (Full length) + +The COPS driver defaults to using Dayna mode. To change the driver's +mode if you built a driver with dual support use board_type=1 or +board_type=2 for Dayna or Tangent with insmod. + +** Operation/loading of the driver. +Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #) +If you do not specify any options the driver will try and use the IO = 0x240, +IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing. + +To load multiple COPS driver Localtalk cards you can do one of the following. + +insmod cops io=0x240 irq=5 +insmod -o cops2 cops io=0x260 irq=3 + +Or in lilo.conf put something like this: + append="ether=5,0x240,lt0 ether=3,0x260,lt1" + +Then bring up the interface with ifconfig. It will look something like this: +lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00 + inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 + UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1 + RX packets:0 errors:0 dropped:0 overruns:0 frame:0 + TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0 + +** Netatalk Configuration +You will need to configure atalkd with something like the following to make +it work with the cops.c driver. + +* For single LTalk card use. +dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033" +lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple cards, Ethernet and LocalTalk. +eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033" +lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033" + +* For multiple LocalTalk cards, and an Ethernet card. +* Order seems to matter here, Ethernet last. +lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1" +lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2" +eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk" diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/cs89x0.txt new file mode 100644 index 000000000..0e190180e --- /dev/null +++ b/Documentation/networking/cs89x0.txt @@ -0,0 +1,624 @@ + +NOTE +---- + +This document was contributed by Cirrus Logic for kernel 2.2.5. This version +has been updated for 2.3.48 by Andrew Morton. + +Cirrus make a copy of this driver available at their website, as +described below. In general, you should use the driver version which +comes with your Linux distribution. + + + +CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS +Linux Network Interface Driver ver. 2.00 <kernel 2.3.48> +=============================================================================== + + +TABLE OF CONTENTS + +1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS + 1.1 Product Overview + 1.2 Driver Description + 1.2.1 Driver Name + 1.2.2 File in the Driver Package + 1.3 System Requirements + 1.4 Licensing Information + +2.0 ADAPTER INSTALLATION and CONFIGURATION + 2.1 CS8900-based Adapter Configuration + 2.2 CS8920-based Adapter Configuration + +3.0 LOADING THE DRIVER AS A MODULE + +4.0 COMPILING THE DRIVER + 4.1 Compiling the Driver as a Loadable Module + 4.2 Compiling the driver to support memory mode + 4.3 Compiling the driver to support Rx DMA + +5.0 TESTING AND TROUBLESHOOTING + 5.1 Known Defects and Limitations + 5.2 Testing the Adapter + 5.2.1 Diagnostic Self-Test + 5.2.2 Diagnostic Network Test + 5.3 Using the Adapter's LEDs + 5.4 Resolving I/O Conflicts + +6.0 TECHNICAL SUPPORT + 6.1 Contacting Cirrus Logic's Technical Support + 6.2 Information Required Before Contacting Technical Support + 6.3 Obtaining the Latest Driver Version + 6.4 Current maintainer + 6.5 Kernel boot parameters + + +1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS +=============================================================================== + + +1.1 PRODUCT OVERVIEW + +The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow +IEEE 802.3 standards and support half or full-duplex operation in ISA bus +computers on 10 Mbps Ethernet networks. The adapters are designed for operation +in 16-bit ISA or EISA bus expansion slots and are available in +10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5 +or fiber networks). + +CS8920-based adapters are similar to the CS8900-based adapter with additional +features for Plug and Play (PnP) support and Wakeup Frame recognition. As +such, the configuration procedures differ somewhat between the two types of +adapters. Refer to the "Adapter Configuration" section for details on +configuring both types of adapters. + + +1.2 DRIVER DESCRIPTION + +The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux +v2.3.48 or greater kernel. It can be compiled directly into the kernel +or loaded at run-time as a device driver module. + +1.2.1 Driver Name: cs89x0 + +1.2.2 Files in the Driver Archive: + +The files in the driver at Cirrus' website include: + + readme.txt - this file + build - batch file to compile cs89x0.c. + cs89x0.c - driver C code + cs89x0.h - driver header file + cs89x0.o - pre-compiled module (for v2.2.5 kernel) + config/Config.in - sample file to include cs89x0 driver in the kernel. + config/Makefile - sample file to include cs89x0 driver in the kernel. + config/Space.c - sample file to include cs89x0 driver in the kernel. + + + +1.3 SYSTEM REQUIREMENTS + +The following hardware is required: + + * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter + + * IBM or IBM-compatible PC with: + * An 80386 or higher processor + * 16 bytes of contiguous IO space available between 210h - 370h + * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920). + + * Appropriate cable (and connector for AUI, 10BASE-2) for your network + topology. + +The following software is required: + +* LINUX kernel version 2.3.48 or higher + + * CS8900/20 Setup Utility (DOS-based) + + * LINUX kernel sources for your kernel (if compiling into kernel) + + * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel + or a module) + + + +1.4 LICENSING INFORMATION + +This program is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free Software +Foundation, version 1. + +This program is distributed in the hope that it will be useful, but WITHOUT +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for +more details. + +For a full copy of the GNU General Public License, write to the Free Software +Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + + + +2.0 ADAPTER INSTALLATION and CONFIGURATION +=============================================================================== + +Both the CS8900 and CS8920-based adapters can be configured using parameters +stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup +Utility if you want to change the adapter's configuration in EEPROM. + +When loading the driver as a module, you can specify many of the adapter's +configuration parameters on the command-line to override the EEPROM's settings +or for interface configuration when an EEPROM is not used. (CS8920-based +adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE. + +Since the CS8900/20 Setup Utility is a DOS-based application, you must install +and configure the adapter in a DOS-based system using the CS8900/20 Setup +Utility before installation in the target LINUX system. (Not required if +installing a CS8900-based adapter and the default configuration is acceptable.) + + +2.1 CS8900-BASED ADAPTER CONFIGURATION + +CS8900-based adapters shipped from Cirrus Logic have been configured +with the following "default" settings: + + Operation Mode: Memory Mode + IRQ: 10 + Base I/O Address: 300 + Memory Base Address: D0000 + Optimization: DOS Client + Transmission Mode: Half-duplex + BootProm: None + Media Type: Autodetect (3-media cards) or + 10BASE-T (10BASE-T only adapter) + +You should only change the default configuration settings if conflicts with +another adapter exists. To change the adapter's configuration, run the +CS8900/20 Setup Utility. + + +2.2 CS8920-BASED ADAPTER CONFIGURATION + +CS8920-based adapters are shipped from Cirrus Logic configured as Plug +and Play (PnP) enabled. However, since the cs89x0 driver does NOT +support PnP, you must install the CS8920 adapter in a DOS-based PC and +run the CS8900/20 Setup Utility to disable PnP and configure the +adapter before installation in the target Linux system. Failure to do +this will leave the adapter inactive and the driver will be unable to +communicate with the adapter. + + + **************************************************************** + * CS8920-BASED ADAPTERS: * + * * + * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. * + * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST * + * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND * + * TO ACTIVATE THE ADAPTER. * + **************************************************************** + + + + +3.0 LOADING THE DRIVER AS A MODULE +=============================================================================== + +If the driver is compiled as a loadable module, you can load the driver module +with the 'modprobe' command. Many of the adapter's configuration parameters can +be specified as command-line arguments to the load command. This facility +provides a means to override the EEPROM's settings or for interface +configuration when an EEPROM is not used. + +Example: + + insmod cs89x0.o io=0x200 irq=0xA media=aui + +This example loads the module and configures the adapter to use an IO port base +address of 200h, interrupt 10, and use the AUI media connection. The following +configuration options are available on the command line: + +* io=### - specify IO address (200h-360h) +* irq=## - specify interrupt level +* use_dma=1 - Enable DMA +* dma=# - specify dma channel (Driver is compiled to support + Rx DMA only) +* dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16. +* media=rj45 - specify media type + or media=bnc + or media=aui + or media=auto +* duplex=full - specify forced half/full/autonegotiate duplex + or duplex=half + or duplex=auto +* debug=# - debug level (only available if the driver was compiled + for debugging) + +NOTES: + +a) If an EEPROM is present, any specified command-line parameter + will override the corresponding configuration value stored in + EEPROM. + +b) The "io" parameter must be specified on the command-line. + +c) The driver's hardware probe routine is designed to avoid + writing to I/O space until it knows that there is a cs89x0 + card at the written addresses. This could cause problems + with device probing. To avoid this behaviour, add one + to the `io=' module parameter. This doesn't actually change + the I/O address, but it is a flag to tell the driver + to partially initialise the hardware before trying to + identify the card. This could be dangerous if you are + not sure that there is a cs89x0 card at the provided address. + + For example, to scan for an adapter located at IO base 0x300, + specify an IO address of 0x301. + +d) The "duplex=auto" parameter is only supported for the CS8920. + +e) The minimum command-line configuration required if an EEPROM is + not present is: + + io + irq + media type (no autodetect) + +f) The following additional parameters are CS89XX defaults (values + used with no EEPROM or command-line argument). + + * DMA Burst = enabled + * IOCHRDY Enabled = enabled + * UseSA = enabled + * CS8900 defaults to half-duplex if not specified on command-line + * CS8920 defaults to autoneg if not specified on command-line + * Use reset defaults for other config parameters + * dma_mode = 0 + +g) You can use ifconfig to set the adapter's Ethernet address. + +h) Many Linux distributions use the 'modprobe' command to load + modules. This program uses the '/etc/conf.modules' file to + determine configuration information which is passed to a driver + module when it is loaded. All the configuration options which are + described above may be placed within /etc/conf.modules. + + For example: + + > cat /etc/conf.modules + ... + alias eth0 cs89x0 + options cs89x0 io=0x0200 dma=5 use_dma=1 + ... + + In this example we are telling the module system that the + ethernet driver for this machine should use the cs89x0 driver. We + are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma' + arguments to the driver when it is loaded. + +i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or + 7. You will probably find that other DMA channels will not work. + +j) The cs89x0 supports DMA for receiving only. DMA mode is + significantly more efficient. Flooding a 400 MHz Celeron machine + with large ping packets consumes 82% of its CPU capacity in non-DMA + mode. With DMA this is reduced to 45%. + +k) If your Linux kernel was compiled with inbuilt plug-and-play + support you will be able to find information about the cs89x0 card + with the command + + cat /proc/isapnp + +l) If during DMA operation you find erratic behavior or network data + corruption you should use your PC's BIOS to slow the EISA bus clock. + +m) If the cs89x0 driver is compiled directly into the kernel + (non-modular) then its I/O address is automatically determined by + ISA bus probing. The IRQ number, media options, etc are determined + from the card's EEPROM. + +n) If the cs89x0 driver is compiled directly into the kernel, DMA + mode may be selected by providing the kernel with a boot option + 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7). + + Kernel boot options may be provided on the LILO command line: + + LILO boot: linux cs89x0_dma=5 + + or they may be placed in /etc/lilo.conf: + + image=/boot/bzImage-2.3.48 + append="cs89x0_dma=5" + label=linux + root=/dev/hda5 + read-only + + The DMA Rx buffer size is hardwired to 16 kbytes in this mode. + (64k mode is not available). + + +4.0 COMPILING THE DRIVER +=============================================================================== + +The cs89x0 driver can be compiled directly into the kernel or compiled into +a loadable device driver module. + + +4.1 COMPILING THE DRIVER AS A LOADABLE MODULE + +To compile the driver into a loadable module, use the following command +(single command line, without quotes): + +"gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall +-Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS +-c cs89x0.c" + +4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE + +Support for memory mode was not carried over into the 2.3 series kernels. + +4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA + +The compile-time optionality for DMA was removed in the 2.3 kernel +series. DMA support is now unconditionally part of the driver. It is +enabled by the 'use_dma=1' module option. + + +5.0 TESTING AND TROUBLESHOOTING +=============================================================================== + +5.1 KNOWN DEFECTS and LIMITATIONS + +Refer to the RELEASE.TXT file distributed as part of this archive for a list of +known defects, driver limitations, and work arounds. + + +5.2 TESTING THE ADAPTER + +Once the adapter has been installed and configured, the diagnostic option of +the CS8900/20 Setup Utility can be used to test the functionality of the +adapter and its network connection. Use the diagnostics 'Self Test' option to +test the functionality of the adapter with the hardware configuration you have +assigned. You can use the diagnostics 'Network Test' to test the ability of the +adapter to communicate across the Ethernet with another PC equipped with a +CS8900/20-based adapter card (it must also be running the CS8900/20 Setup +Utility). + + NOTE: The Setup Utility's diagnostics are designed to run in a + DOS-only operating system environment. DO NOT run the diagnostics + from a DOS or command prompt session under Windows 95, Windows NT, + OS/2, or other operating system. + +To run the diagnostics tests on the CS8900/20 adapter: + + 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility. + + 2.) The adapter's current configuration is displayed. Hit the ENTER key to + get to the main menu. + + 4.) Select 'Diagnostics' (ALT-G) from the main menu. + * Select 'Self-Test' to test the adapter's basic functionality. + * Select 'Network Test' to test the network connection and cabling. + + +5.2.1 DIAGNOSTIC SELF-TEST + +The diagnostic self-test checks the adapter's basic functionality as well as +its ability to communicate across the ISA bus based on the system resources +assigned during hardware configuration. The following tests are performed: + + * IO Register Read/Write Test + The IO Register Read/Write test insures that the CS8900/20 can be + accessed in IO mode, and that the IO base address is correct. + + * Shared Memory Test + The Shared Memory test insures the CS8900/20 can be accessed in memory + mode and that the range of memory addresses assigned does not conflict + with other devices in the system. + + * Interrupt Test + The Interrupt test insures there are no conflicts with the assigned IRQ + signal. + + * EEPROM Test + The EEPROM test insures the EEPROM can be read. + + * Chip RAM Test + The Chip RAM test insures the 4K of memory internal to the CS8900/20 is + working properly. + + * Internal Loop-back Test + The Internal Loop Back test insures the adapter's transmitter and + receiver are operating properly. If this test fails, make sure the + adapter's cable is connected to the network (check for LED activity for + example). + + * Boot PROM Test + The Boot PROM test insures the Boot PROM is present, and can be read. + Failure indicates the Boot PROM was not successfully read due to a + hardware problem or due to a conflicts on the Boot PROM address + assignment. (Test only applies if the adapter is configured to use the + Boot PROM option.) + +Failure of a test item indicates a possible system resource conflict with +another device on the ISA bus. In this case, you should use the Manual Setup +option to reconfigure the adapter by selecting a different value for the system +resource that failed. + + +5.2.2 DIAGNOSTIC NETWORK TEST + +The Diagnostic Network Test verifies a working network connection by +transferring data between two CS8900/20 adapters installed in different PCs +on the same network. (Note: the diagnostic network test should not be run +between two nodes across a router.) + +This test requires that each of the two PCs have a CS8900/20-based adapter +installed and have the CS8900/20 Setup Utility running. The first PC is +configured as a Responder and the other PC is configured as an Initiator. +Once the Initiator is started, it sends data frames to the Responder which +returns the frames to the Initiator. + +The total number of frames received and transmitted are displayed on the +Initiator's display, along with a count of the number of frames received and +transmitted OK or in error. The test can be terminated anytime by the user at +either PC. + +To setup the Diagnostic Network Test: + + 1.) Select a PC with a CS8900/20-based adapter and a known working network + connection to act as the Responder. Run the CS8900/20 Setup Utility + and select 'Diagnostics -> Network Test -> Responder' from the main + menu. Hit ENTER to start the Responder. + + 2.) Return to the PC with the CS8900/20-based adapter you want to test and + start the CS8900/20 Setup Utility. + + 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'. + Hit ENTER to start the test. + +You may stop the test on the Initiator at any time while allowing the Responder +to continue running. In this manner, you can move to additional PCs and test +them by starting the Initiator on another PC without having to stop/start the +Responder. + + + +5.3 USING THE ADAPTER'S LEDs + +The 2 and 3-media adapters have two LEDs visible on the back end of the board +located near the 10Base-T connector. + +Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T +connection. (Only applies to 10Base-T. The green LED has no significance for +a 10Base-2 or AUI connection.) + +TX/RX LED: The yellow LED lights briefly each time the adapter transmits or +receives data. (The yellow LED will appear to "flicker" on a typical network.) + + +5.4 RESOLVING I/O CONFLICTS + +An IO conflict occurs when two or more adapter use the same ISA resource (IO +address, memory address or IRQ). You can usually detect an IO conflict in one +of four ways after installing and or configuring the CS8900/20-based adapter: + + 1.) The system does not boot properly (or at all). + + 2.) The driver cannot communicate with the adapter, reporting an "Adapter + not found" error message. + + 3.) You cannot connect to the network or the driver will not load. + + 4.) If you have configured the adapter to run in memory mode but the driver + reports it is using IO mode when loading, this is an indication of a + memory address conflict. + +If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a +diagnostic self-test. Normally, the ISA resource in conflict will fail the +self-test. If so, reconfigure the adapter selecting another choice for the +resource in conflict. Run the diagnostics again to check for further IO +conflicts. + +In some cases, such as when the PC will not boot, it may be necessary to remove +the adapter and reconfigure it by installing it in another PC to run the +CS8900/20 Setup Utility. Once reinstalled in the target system, run the +diagnostics self-test to ensure the new configuration is free of conflicts +before loading the driver again. + +When manually configuring the adapter, keep in mind the typical ISA system +resource usage as indicated in the tables below. + +I/O Address Device IRQ Device +----------- -------- --- -------- + 200-20F Game I/O adapter 3 COM2, Bus Mouse + 230-23F Bus Mouse 4 COM1 + 270-27F LPT3: third parallel port 5 LPT2 + 2F0-2FF COM2: second serial port 6 Floppy Disk controller + 320-32F Fixed disk controller 7 LPT1 + 8 Real-time Clock + 9 EGA/VGA display adapter + 12 Mouse (PS/2) +Memory Address Device 13 Math Coprocessor +-------------- --------------------- 14 Hard Disk controller +A000-BFFF EGA Graphics Adapter +A000-C7FF VGA Graphics Adapter +B000-BFFF Mono Graphics Adapter +B800-BFFF Color Graphics Adapter +E000-FFFF AT BIOS + + + + +6.0 TECHNICAL SUPPORT +=============================================================================== + +6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT + +Cirrus Logic's CS89XX Technical Application Support can be reached at: + +Telephone :(800) 888-5016 (from inside U.S. and Canada) + :(512) 442-7555 (from outside the U.S. and Canada) +Fax :(512) 912-3871 +Email :ethernet@crystal.cirrus.com +WWW :http://www.cirrus.com + + +6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT + +Before contacting Cirrus Logic for technical support, be prepared to provide as +Much of the following information as possible. + +1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.) + +2.) Adapter configuration + + * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel + * Plug and Play enabled/disabled (CS8920-based adapters only) + * Configured for media auto-detect or specific media type (which type). + +3.) PC System's Configuration + + * Plug and Play system (yes/no) + * BIOS (make and version) + * System make and model + * CPU (type and speed) + * System RAM + * SCSI Adapter + +4.) Software + + * CS89XX driver and version + * Your network operating system and version + * Your system's OS version + * Version of all protocol support files + +5.) Any Error Message displayed. + + + +6.3 OBTAINING THE LATEST DRIVER VERSION + +You can obtain the latest CS89XX drivers and support software from Cirrus Logic's +Web site. You can also contact Cirrus Logic's Technical Support (email: +ethernet@crystal.cirrus.com) and request that you be registered for automatic +software-update notification. + +Cirrus Logic maintains a web page at http://www.cirrus.com with the +latest drivers and technical publications. + + +6.4 Current maintainer + +In February 2000 the maintenance of this driver was assumed by Andrew +Morton. + +6.5 Kernel module parameters + +For use in embedded environments with no cs89x0 EEPROM, the kernel boot +parameter `cs89x0_media=' has been implemented. Usage is: + + cs89x0_media=rj45 or + cs89x0_media=aui or + cs89x0_media=bnc + diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/cxacru-cf.py new file mode 100644 index 000000000..b41d29839 --- /dev/null +++ b/Documentation/networking/cxacru-cf.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python +# Copyright 2009 Simon Arlott +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation; either version 2 of the License, or (at your option) +# any later version. +# +# This program is distributed in the hope that it will be useful, but WITHOUT +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# You should have received a copy of the GNU General Public License along with +# this program; if not, write to the Free Software Foundation, Inc., 59 +# Temple Place - Suite 330, Boston, MA 02111-1307, USA. +# +# Usage: cxacru-cf.py < cxacru-cf.bin +# Output: values string suitable for the sysfs adsl_config attribute +# +# Warning: cxacru-cf.bin with MD5 hash cdbac2689969d5ed5d4850f117702110 +# contains mis-aligned values which will stop the modem from being able +# to make a connection. If the first and last two bytes are removed then +# the values become valid, but the modulation will be forced to ANSI +# T1.413 only which may not be appropriate. +# +# The original binary format is a packed list of le32 values. + +import sys +import struct + +i = 0 +while True: + buf = sys.stdin.read(4) + + if len(buf) == 0: + break + elif len(buf) != 4: + sys.stdout.write("\n") + sys.stderr.write("Error: read {0} not 4 bytes\n".format(len(buf))) + sys.exit(1) + + if i > 0: + sys.stdout.write(" ") + sys.stdout.write("{0:x}={1}".format(i, struct.unpack("<I", buf)[0])) + i += 1 + +sys.stdout.write("\n") diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/cxacru.txt new file mode 100644 index 000000000..2cce04457 --- /dev/null +++ b/Documentation/networking/cxacru.txt @@ -0,0 +1,100 @@ +Firmware is required for this device: http://accessrunner.sourceforge.net/ + +While it is capable of managing/maintaining the ADSL connection without the +module loaded, the device will sometimes stop responding after unloading the +driver and it is necessary to unplug/remove power to the device to fix this. + +Note: support for cxacru-cf.bin has been removed. It was not loaded correctly +so it had no effect on the device configuration. Fixing it could have stopped +existing devices working when an invalid configuration is supplied. + +There is a script cxacru-cf.py to convert an existing file to the sysfs form. + +Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ +these are directories named cxacruN where N is the device number. A symlink +named device points to the USB interface device's directory which contains +several sysfs attribute files for retrieving device statistics: + +* adsl_controller_version + +* adsl_headend +* adsl_headend_environment + Information about the remote headend. + +* adsl_config + Configuration writing interface. + Write parameters in hexadecimal format <index>=<value>, + separated by whitespace, e.g.: + "1=0 a=5" + Up to 7 parameters at a time will be sent and the modem will restart + the ADSL connection when any value is set. These are logged for future + reference. + +* downstream_attenuation (dB) +* downstream_bits_per_frame +* downstream_rate (kbps) +* downstream_snr_margin (dB) + Downstream stats. + +* upstream_attenuation (dB) +* upstream_bits_per_frame +* upstream_rate (kbps) +* upstream_snr_margin (dB) +* transmitter_power (dBm/Hz) + Upstream stats. + +* downstream_crc_errors +* downstream_fec_errors +* downstream_hec_errors +* upstream_crc_errors +* upstream_fec_errors +* upstream_hec_errors + Error counts. + +* line_startable + Indicates that ADSL support on the device + is/can be enabled, see adsl_start. + +* line_status + "initialising" + "down" + "attempting to activate" + "training" + "channel analysis" + "exchange" + "waiting" + "up" + + Changes between "down" and "attempting to activate" + if there is no signal. + +* link_status + "not connected" + "connected" + "lost" + +* mac_address + +* modulation + "" (when not connected) + "ANSI T1.413" + "ITU-T G.992.1 (G.DMT)" + "ITU-T G.992.2 (G.LITE)" + +* startup_attempts + Count of total attempts to initialise ADSL. + +To enable/disable ADSL, the following can be written to the adsl_state file: + "start" + "stop + "restart" (stops, waits 1.5s, then starts) + "poll" (used to resume status polling if it was disabled due to failure) + +Changes in adsl/line state are reported via kernel log messages: + [4942145.150704] ATM dev 0: ADSL state: running + [4942243.663766] ATM dev 0: ADSL line: down + [4942249.665075] ATM dev 0: ADSL line: attempting to activate + [4942253.654954] ATM dev 0: ADSL line: training + [4942255.666387] ATM dev 0: ADSL line: channel analysis + [4942259.656262] ATM dev 0: ADSL line: exchange + [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/cxgb.txt new file mode 100644 index 000000000..20a887615 --- /dev/null +++ b/Documentation/networking/cxgb.txt @@ -0,0 +1,352 @@ + Chelsio N210 10Gb Ethernet Network Controller + + Driver Release Notes for Linux + + Version 2.1.1 + + June 20, 2005 + +CONTENTS +======== + INTRODUCTION + FEATURES + PERFORMANCE + DRIVER MESSAGES + KNOWN ISSUES + SUPPORT + + +INTRODUCTION +============ + + This document describes the Linux driver for Chelsio 10Gb Ethernet Network + Controller. This driver supports the Chelsio N210 NIC and is backward + compatible with the Chelsio N110 model 10Gb NICs. + + +FEATURES +======== + + Adaptive Interrupts (adaptive-rx) + --------------------------------- + + This feature provides an adaptive algorithm that adjusts the interrupt + coalescing parameters, allowing the driver to dynamically adapt the latency + settings to achieve the highest performance during various types of network + load. + + The interface used to control this feature is ethtool. Please see the + ethtool manpage for additional usage information. + + By default, adaptive-rx is disabled. + To enable adaptive-rx: + + ethtool -C <interface> adaptive-rx on + + To disable adaptive-rx, use ethtool: + + ethtool -C <interface> adaptive-rx off + + After disabling adaptive-rx, the timer latency value will be set to 50us. + You may set the timer latency after disabling adaptive-rx: + + ethtool -C <interface> rx-usecs <microseconds> + + An example to set the timer latency value to 100us on eth0: + + ethtool -C eth0 rx-usecs 100 + + You may also provide a timer latency value while disabling adaptive-rx: + + ethtool -C <interface> adaptive-rx off rx-usecs <microseconds> + + If adaptive-rx is disabled and a timer latency value is specified, the timer + will be set to the specified value until changed by the user or until + adaptive-rx is enabled. + + To view the status of the adaptive-rx and timer latency values: + + ethtool -c <interface> + + + TCP Segmentation Offloading (TSO) Support + ----------------------------------------- + + This feature, also known as "large send", enables a system's protocol stack + to offload portions of outbound TCP processing to a network interface card + thereby reducing system CPU utilization and enhancing performance. + + The interface used to control this feature is ethtool version 1.8 or higher. + Please see the ethtool manpage for additional usage information. + + By default, TSO is enabled. + To disable TSO: + + ethtool -K <interface> tso off + + To enable TSO: + + ethtool -K <interface> tso on + + To view the status of TSO: + + ethtool -k <interface> + + +PERFORMANCE +=========== + + The following information is provided as an example of how to change system + parameters for "performance tuning" an what value to use. You may or may not + want to change these system parameters, depending on your server/workstation + application. Doing so is not warranted in any way by Chelsio Communications, + and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss + of data or damage to equipment. + + Your distribution may have a different way of doing things, or you may prefer + a different method. These commands are shown only to provide an example of + what to do and are by no means definitive. + + Making any of the following system changes will only last until you reboot + your system. You may want to write a script that runs at boot-up which + includes the optimal settings for your system. + + Setting PCI Latency Timer: + setpci -d 1425:* 0x0c.l=0x0000F800 + + Disabling TCP timestamp: + sysctl -w net.ipv4.tcp_timestamps=0 + + Disabling SACK: + sysctl -w net.ipv4.tcp_sack=0 + + Setting large number of incoming connection requests: + sysctl -w net.ipv4.tcp_max_syn_backlog=3000 + + Setting maximum receive socket buffer size: + sysctl -w net.core.rmem_max=1024000 + + Setting maximum send socket buffer size: + sysctl -w net.core.wmem_max=1024000 + + Set smp_affinity (on a multiprocessor system) to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + Setting default receive socket buffer size: + sysctl -w net.core.rmem_default=524287 + + Setting default send socket buffer size: + sysctl -w net.core.wmem_default=524287 + + Setting maximum option memory buffers: + sysctl -w net.core.optmem_max=524287 + + Setting maximum backlog (# of unprocessed packets before kernel drops): + sysctl -w net.core.netdev_max_backlog=300000 + + Setting TCP read buffers (min/default/max): + sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" + + Setting TCP write buffers (min/pressure/max): + sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + + Setting TCP buffer space (min/pressure/max): + sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000" + + TCP window size for single connections: + The receive buffer (RX_WINDOW) size must be at least as large as the + Bandwidth-Delay Product of the communication link between the sender and + receiver. Due to the variations of RTT, you may want to increase the buffer + size up to 2 times the Bandwidth-Delay Product. Reference page 289 of + "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens. + At 10Gb speeds, use the following formula: + RX_WINDOW >= 1.25MBytes * RTT(in milliseconds) + Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000 + RX_WINDOW sizes of 256KB - 512KB should be sufficient. + Setting the min, max, and default receive buffer (RX_WINDOW) size: + sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>" + + TCP window size for multiple connections: + The receive buffer (RX_WINDOW) size may be calculated the same as single + connections, but should be divided by the number of connections. The + smaller window prevents congestion and facilitates better pacing, + especially if/when MAC level flow control does not work well or when it is + not supported on the machine. Experimentation may be necessary to attain + the correct value. This method is provided as a starting point for the + correct receive buffer size. + Setting the min, max, and default receive buffer (RX_WINDOW) size is + performed in the same manner as single connection. + + +DRIVER MESSAGES +=============== + + The following messages are the most common messages logged by syslog. These + may be found in /var/log/messages. + + Driver up: + Chelsio Network Driver - version 2.1.1 + + NIC detected: + eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit + + Link up: + eth#: link is up at 10 Gbps, full duplex + + Link down: + eth#: link is down + + +KNOWN ISSUES +============ + + These issues have been identified during testing. The following information + is provided as a workaround to the problem. In some cases, this problem is + inherent to Linux or to a particular Linux Distribution and/or hardware + platform. + + 1. Large number of TCP retransmits on a multiprocessor (SMP) system. + + On a system with multiple CPUs, the interrupt (IRQ) for the network + controller may be bound to more than one CPU. This will cause TCP + retransmits if the packet data were to be split across different CPUs + and re-assembled in a different order than expected. + + To eliminate the TCP retransmits, set smp_affinity on the particular + interrupt to a single CPU. You can locate the interrupt (IRQ) used on + the N110/N210 by using ifconfig: + ifconfig <dev_name> | grep Interrupt + Set the smp_affinity to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + It is highly suggested that you do not run the irqbalance daemon on your + system, as this will change any smp_affinity setting you have applied. + The irqbalance daemon runs on a 10 second interval and binds interrupts + to the least loaded CPU determined by the daemon. To disable this daemon: + chkconfig --level 2345 irqbalance off + + By default, some Linux distributions enable the kernel feature, + irqbalance, which performs the same function as the daemon. To disable + this feature, add the following line to your bootloader: + noirqbalance + + Example using the Grub bootloader: + title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp) + root (hd0,0) + kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance + initrd /initrd-2.4.21-27.ELsmp.img + + 2. After running insmod, the driver is loaded and the incorrect network + interface is brought up without running ifup. + + When using 2.4.x kernels, including RHEL kernels, the Linux kernel + invokes a script named "hotplug". This script is primarily used to + automatically bring up USB devices when they are plugged in, however, + the script also attempts to automatically bring up a network interface + after loading the kernel module. The hotplug script does this by scanning + the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking + for HWADDR=<mac_address>. + + If the hotplug script does not find the HWADDRR within any of the + ifcfg-eth# files, it will bring up the device with the next available + interface name. If this interface is already configured for a different + network card, your new interface will have incorrect IP address and + network settings. + + To solve this issue, you can add the HWADDR=<mac_address> key to the + interface config file of your network controller. + + To disable this "hotplug" feature, you may add the driver (module name) + to the "blacklist" file located in /etc/hotplug. It has been noted that + this does not work for network devices because the net.agent script + does not use the blacklist file. Simply remove, or rename, the net.agent + script located in /etc/hotplug to disable this feature. + + 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic + on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset. + + If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel + chipset, you may experience the "133-Mhz Mode Split Completion Data + Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the + bus PCI-X bus. + + AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel + can provide stale data via split completion cycles to a PCI-X card that + is operating at 133 Mhz", causing data corruption. + + AMD's provides three workarounds for this problem, however, Chelsio + recommends the first option for best performance with this bug: + + For 133Mhz secondary bus operation, limit the transaction length and + the number of outstanding transactions, via BIOS configuration + programming of the PCI-X card, to the following: + + Data Length (bytes): 1k + Total allowed outstanding transactions: 2 + + Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004, + section 56, "133-MHz Mode Split Completion Data Corruption" for more + details with this bug and workarounds suggested by AMD. + + It may be possible to work outside AMD's recommended PCI-X settings, try + increasing the Data Length to 2k bytes for increased performance. If you + have issues with these settings, please revert to the "safe" settings + and duplicate the problem before submitting a bug or asking for support. + + NOTE: The default setting on most systems is 8 outstanding transactions + and 2k bytes data length. + + 4. On multiprocessor systems, it has been noted that an application which + is handling 10Gb networking can switch between CPUs causing degraded + and/or unstable performance. + + If running on an SMP system and taking performance measurements, it + is suggested you either run the latest netperf-2.4.0+ or use a binding + tool such as Tim Hockin's procstate utilities (runon) + <http://www.hockin.org/~thockin/procstate/>. + + Binding netserver and netperf (or other applications) to particular + CPUs will have a significant difference in performance measurements. + You may need to experiment which CPU to bind the application to in + order to achieve the best performance for your system. + + If you are developing an application designed for 10Gb networking, + please keep in mind you may want to look at kernel functions + sched_setaffinity & sched_getaffinity to bind your application. + + If you are just running user-space applications such as ftp, telnet, + etc., you may want to try the runon tool provided by Tim Hockin's + procstate utility. You could also try binding the interface to a + particular CPU: runon 0 ifup eth0 + + +SUPPORT +======= + + If you have problems with the software or hardware, please contact our + customer support team via email at support@chelsio.com or check our website + at http://www.chelsio.com + +=============================================================================== + + Chelsio Communications + 370 San Aleso Ave. + Suite 100 + Sunnyvale, CA 94085 + http://www.chelsio.com + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License, version 2, as +published by the Free Software Foundation. + +You should have received a copy of the GNU General Public License along +with this program; if not, write to the Free Software Foundation, Inc., +59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED +WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + + Copyright (c) 2003-2005 Chelsio Communications. All rights reserved. + +=============================================================================== diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt new file mode 100644 index 000000000..55c575fca --- /dev/null +++ b/Documentation/networking/dccp.txt @@ -0,0 +1,207 @@ +DCCP protocol +============= + + +Contents +======== +- Introduction +- Missing features +- Socket options +- Sysctl variables +- IOCTLs +- Other tunables +- Notes + + +Introduction +============ +Datagram Congestion Control Protocol (DCCP) is an unreliable, connection +oriented protocol designed to solve issues present in UDP and TCP, particularly +for real-time and multimedia (streaming) traffic. +It divides into a base protocol (RFC 4340) and pluggable congestion control +modules called CCIDs. Like pluggable TCP congestion control, at least one CCID +needs to be enabled in order for the protocol to function properly. In the Linux +implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as +the TCP-friendly CCID3 (RFC 4342), are optional. +For a brief introduction to CCIDs and suggestions for choosing a CCID to match +given applications, see section 10 of RFC 4340. + +It has a base protocol and pluggable congestion control IDs (CCIDs). + +DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol +is at http://www.ietf.org/html.charters/dccp-charter.html + + +Missing features +================ +The Linux DCCP implementation does not currently support all the features that are +specified in RFCs 4340...42. + +The known bugs are at: + http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP + +For more up-to-date versions of the DCCP implementation, please consider using +the experimental DCCP test tree; instructions for checking this out are on: +http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree + + +Socket options +============== +DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes +a policy ID as argument and can only be set before the connection (i.e. changes +during an established connection are not supported). Currently, two policies are +defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, +and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an +u32 priority value as ancillary data to sendmsg(), where higher numbers indicate +a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to +be formatted using a cmsg(3) message header filled in as follows: + cmsg->cmsg_level = SOL_DCCP; + cmsg->cmsg_type = DCCP_SCM_PRIORITY; + cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ + +DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero +value is always interpreted as unbounded queue length. If different from zero, +the interpretation of this parameter depends on the current dequeuing policy +(see above): the "simple" policy will enforce a fixed queue size by returning +EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the +lowest-priority packet first. The default value for this parameter is +initialised from /proc/sys/net/dccp/default/tx_qlen. + +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). On active sockets this is set before connect(); specifying more +than one code has no effect (all subsequent service codes are ignored). The +case is different for passive sockets, where multiple service codes (up to 32) +can be set before calling bind(). + +DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet +size (application payload size) in bytes, see RFC 4340, section 14. + +DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs +supported by the endpoint. The option value is an array of type uint8_t whose +size is passed as option length. The minimum array size is 4 elements, the +value returned in the optlen argument always reflects the true number of +built-in CCIDs. + +DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same +time, combining the operation of the next two socket options. This option is +preferable over the latter two, since often applications will use the same +type of CCID for both directions; and mixed use of CCIDs is not currently well +understood. This socket option takes as argument at least one uint8_t value, or +an array of uint8_t values, which must match available CCIDS (see above). CCIDs +must be registered on the socket before calling connect() or listen(). + +DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets +the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. +Please note that the getsockopt argument type here is `int', not uint8_t. + +DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. + +DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold +timewait state when closing the connection (RFC 4340, 8.3). The usual case is +that the closing server sends a CloseReq, whereupon the client holds timewait +state. When this boolean socket option is on, the server sends a Close instead +and will enter TIMEWAIT. This option must be set after accept() returns. + +DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the +partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums +always cover the entire packet and that only fully covered application data is +accepted by the receiver. Hence, when using this feature on the sender, it must +be enabled at the receiver, too with suitable choice of CsCov. + +DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the + range 0..15 are acceptable. The default setting is 0 (full coverage), + values between 1..15 indicate partial coverage. +DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it + sets a threshold, where again values 0..15 are acceptable. The default + of 0 means that all packets with a partial coverage will be discarded. + Values in the range 1..15 indicate that packets with minimally such a + coverage value are also acceptable. The higher the number, the more + restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage + settings are inherited to the child socket after accept(). + +The following two options apply to CCID 3 exclusively and are getsockopt()-only. +In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. +DCCP_SOCKOPT_CCID_RX_INFO + Returns a `struct tfrc_rx_info' in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_rx_info). +DCCP_SOCKOPT_CCID_TX_INFO + Returns a `struct tfrc_tx_info' in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_tx_info). + +On unidirectional connections it is useful to close the unused half-connection +via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. + + +Sysctl variables +================ +Several DCCP default parameters can be managed by the following sysctls +(sysctl net.dccp.default or /proc/sys/net/dccp/default): + +request_retries + The number of active connection initiation retries (the number of + Requests minus one) before timing out. In addition, it also governs + the behaviour of the other, passive side: this variable also sets + the number of times DCCP repeats sending a Response when the initial + handshake does not progress from RESPOND to OPEN (i.e. when no Ack + is received after the initial Request). This value should be greater + than 0, suggested is less than 10. Analogue of tcp_syn_retries. + +retries1 + How often a DCCP Response is retransmitted until the listening DCCP + side considers its connecting peer dead. Analogue of tcp_retries1. + +retries2 + The number of times a general DCCP packet is retransmitted. This has + importance for retransmitted acknowledgments and feature negotiation, + data packets are never retransmitted. Analogue of tcp_retries2. + +tx_ccid = 2 + Default CCID for the sender-receiver half-connection. Depending on the + choice of CCID, the Send Ack Vector feature is enabled automatically. + +rx_ccid = 2 + Default CCID for the receiver-sender half-connection; see tx_ccid. + +seq_window = 100 + The initial sequence window (sec. 7.5.2) of the sender. This influences + the local ackno validity and the remote seqno validity windows (7.5.1). + Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. + +tx_qlen = 5 + The size of the transmit buffer in packets. A value of 0 corresponds + to an unbounded transmit buffer. + +sync_ratelimit = 125 ms + The timeout between subsequent DCCP-Sync packets sent in response to + sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit + of this parameter is milliseconds; a value of 0 disables rate-limiting. + + +IOCTLS +====== +FIONREAD + Works as in udp(7): returns in the `int' argument pointer the size of + the next pending datagram in bytes, or 0 when no datagram is pending. + + +Other tunables +============== +Per-route rto_min support + CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value + of the RTO timer. This setting can be modified via the 'rto_min' option + of iproute2; for example: + > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 + > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 + > ip route show dev wlan0 + CCID-3 also supports the rto_min setting: it is used to define the lower + bound for the expiry of the nofeedback timer. This can be useful on LANs + with very low RTTs (e.g., loopback, Gbit ethernet). + + +Notes +===== +DCCP does not travel through NAT successfully at present on many boxes. This is +because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT +support for DCCP has been added. diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt new file mode 100644 index 000000000..13a857753 --- /dev/null +++ b/Documentation/networking/dctcp.txt @@ -0,0 +1,44 @@ +DCTCP (DataCenter TCP) +---------------------- + +DCTCP is an enhancement to the TCP congestion control algorithm for data +center networks and leverages Explicit Congestion Notification (ECN) in +the data center network to provide multi-bit feedback to the end hosts. + +To enable it on end hosts: + + sysctl -w net.ipv4.tcp_congestion_control=dctcp + sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional) + +All switches in the data center network running DCTCP must support ECN +marking and be configured for marking when reaching defined switch buffer +thresholds. The default ECN marking threshold heuristic for DCTCP on +switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps, +but might need further careful tweaking. + +For more details, see below documents: + +Paper: + +The algorithm is further described in detail in the following two +SIGCOMM/SIGMETRICS papers: + + i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, + Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan: + "Data Center TCP (DCTCP)", Data Center Networks session + Proc. ACM SIGCOMM, New Delhi, 2010. + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf + http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192 + +ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar: + "Analysis of DCTCP: Stability, Convergence, and Fairness" + Proc. ACM SIGMETRICS, San Jose, 2011. + http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf + +IETF informational draft: + + http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 + +DCTCP site: + + http://simula.stanford.edu/~alizade/Site/DCTCP.html diff --git a/Documentation/networking/de4x5.txt b/Documentation/networking/de4x5.txt new file mode 100644 index 000000000..c8e4ca9b2 --- /dev/null +++ b/Documentation/networking/de4x5.txt @@ -0,0 +1,178 @@ + Originally, this driver was written for the Digital Equipment + Corporation series of EtherWORKS Ethernet cards: + + DE425 TP/COAX EISA + DE434 TP PCI + DE435 TP/COAX/AUI PCI + DE450 TP/COAX/AUI PCI + DE500 10/100 PCI Fasternet + + but it will now attempt to support all cards which conform to the + Digital Semiconductor SROM Specification. The driver currently + recognises the following chips: + + DC21040 (no SROM) + DC21041[A] + DC21140[A] + DC21142 + DC21143 + + So far the driver is known to work with the following cards: + + KINGSTON + Linksys + ZNYX342 + SMC8432 + SMC9332 (w/new SROM) + ZNYX31[45] + ZNYX346 10/100 4 port (can act as a 10/100 bridge!) + + The driver has been tested on a relatively busy network using the DE425, + DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred + 16M of data to a DECstation 5000/200 as follows: + + TCP UDP + TX RX TX RX + DE425 1030k 997k 1170k 1128k + DE434 1063k 995k 1170k 1125k + DE435 1063k 995k 1170k 1125k + DE500 1063k 998k 1170k 1125k in 10Mb/s mode + + All values are typical (in kBytes/sec) from a sample of 4 for each + measurement. Their error is +/-20k on a quiet (private) network and also + depend on what load the CPU has. + + ========================================================================= + + The ability to load this driver as a loadable module has been included + and used extensively during the driver development (to save those long + reboot sequences). Loadable module support under PCI and EISA has been + achieved by letting the driver autoprobe as if it were compiled into the + kernel. Do make sure you're not sharing interrupts with anything that + cannot accommodate interrupt sharing! + + To utilise this ability, you have to do 8 things: + + 0) have a copy of the loadable modules code installed on your system. + 1) copy de4x5.c from the /linux/drivers/net directory to your favourite + temporary directory. + 2) for fixed autoprobes (not recommended), edit the source code near + line 5594 to reflect the I/O address you're using, or assign these when + loading by: + + insmod de4x5 io=0xghh where g = bus number + hh = device number + + NB: autoprobing for modules is now supported by default. You may just + use: + + insmod de4x5 + + to load all available boards. For a specific board, still use + the 'io=?' above. + 3) compile de4x5.c, but include -DMODULE in the command line to ensure + that the correct bits are compiled (see end of source code). + 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a + kernel with the de4x5 configuration turned off and reboot. + 5) insmod de4x5 [io=0xghh] + 6) run the net startup bits for your new eth?? interface(s) manually + (usually /etc/rc.inet[12] at boot time). + 7) enjoy! + + To unload a module, turn off the associated interface(s) + 'ifconfig eth?? down' then 'rmmod de4x5'. + + Automedia detection is included so that in principle you can disconnect + from, e.g. TP, reconnect to BNC and things will still work (after a + pause whilst the driver figures out where its media went). My tests + using ping showed that it appears to work.... + + By default, the driver will now autodetect any DECchip based card. + Should you have a need to restrict the driver to DIGITAL only cards, you + can compile with a DEC_ONLY define, or if loading as a module, use the + 'dec_only=1' parameter. + + I've changed the timing routines to use the kernel timer and scheduling + functions so that the hangs and other assorted problems that occurred + while autosensing the media should be gone. A bonus for the DC21040 + auto media sense algorithm is that it can now use one that is more in + line with the rest (the DC21040 chip doesn't have a hardware timer). + The downside is the 1 'jiffies' (10ms) resolution. + + IEEE 802.3u MII interface code has been added in anticipation that some + products may use it in the future. + + The SMC9332 card has a non-compliant SROM which needs fixing - I have + patched this driver to detect it because the SROM format used complies + to a previous DEC-STD format. + + I have removed the buffer copies needed for receive on Intels. I cannot + remove them for Alphas since the Tulip hardware only does longword + aligned DMA transfers and the Alphas get alignment traps with non + longword aligned data copies (which makes them really slow). No comment. + + I have added SROM decoding routines to make this driver work with any + card that supports the Digital Semiconductor SROM spec. This will help + all cards running the dc2114x series chips in particular. Cards using + the dc2104x chips should run correctly with the basic driver. I'm in + debt to <mjacob@feral.com> for the testing and feedback that helped get + this feature working. So far we have tested KINGSTON, SMC8432, SMC9332 + (with the latest SROM complying with the SROM spec V3: their first was + broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315 + (quad 21041 MAC) cards also appear to work despite their incorrectly + wired IRQs. + + I have added a temporary fix for interrupt problems when some SCSI cards + share the same interrupt as the DECchip based cards. The problem occurs + because the SCSI card wants to grab the interrupt as a fast interrupt + (runs the service routine with interrupts turned off) vs. this card + which really needs to run the service routine with interrupts turned on. + This driver will now add the interrupt service routine as a fast + interrupt if it is bounced from the slow interrupt. THIS IS NOT A + RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time + until people sort out their compatibility issues and the kernel + interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST + INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not + run on the same interrupt. PCMCIA/CardBus is another can of worms... + + Finally, I think I have really fixed the module loading problem with + more than one DECchip based card. As a side effect, I don't mess with + the device structure any more which means that if more than 1 card in + 2.0.x is installed (4 in 2.1.x), the user will have to edit + linux/drivers/net/Space.c to make room for them. Hence, module loading + is the preferred way to use this driver, since it doesn't have this + limitation. + + Where SROM media detection is used and full duplex is specified in the + SROM, the feature is ignored unless lp->params.fdx is set at compile + time OR during a module load (insmod de4x5 args='eth??:fdx' [see + below]). This is because there is no way to automatically detect full + duplex links except through autonegotiation. When I include the + autonegotiation feature in the SROM autoconf code, this detection will + occur automatically for that case. + + Command line arguments are now allowed, similar to passing arguments + through LILO. This will allow a per adapter board set up of full duplex + and media. The only lexical constraints are: the board name (dev->name) + appears in the list before its parameters. The list of parameters ends + either at the end of the parameter list or with another board name. The + following parameters are allowed: + + fdx for full duplex + autosense to set the media/speed; with the following + sub-parameters: + TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO + + Case sensitivity is important for the sub-parameters. They *must* be + upper case. Examples: + + insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'. + + For a compiled in driver, in linux/drivers/net/CONFIG, place e.g. + DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"' + + Yes, I know full duplex isn't permissible on BNC or AUI; they're just + examples. By default, full duplex is turned off and AUTO is the default + autosense setting. In reality, I expect only the full duplex option to + be used. Note the use of single quotes in the two examples above and the + lack of commas to separate items. diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt new file mode 100644 index 000000000..e12a4900c --- /dev/null +++ b/Documentation/networking/decnet.txt @@ -0,0 +1,232 @@ + Linux DECnet Networking Layer Information + =========================================== + +1) Other documentation.... + + o Project Home Pages + http://www.chygwyn.com/ - Kernel info + http://linux-decnet.sourceforge.net/ - Userland tools + http://www.sourceforge.net/projects/linux-decnet/ - Status page + +2) Configuring the kernel + +Be sure to turn on the following options: + + CONFIG_DECNET (obviously) + CONFIG_PROC_FS (to see what's going on) + CONFIG_SYSCTL (for easy configuration) + +if you want to try out router support (not properly debugged yet) +you'll need the following options as well... + + CONFIG_DECNET_ROUTER (to be able to add/delete routes) + CONFIG_NETFILTER (will be required for the DECnet routing daemon) + + CONFIG_DECNET_ROUTE_FWMARK is optional + +Don't turn on SIOCGIFCONF support for DECnet unless you are really sure +that you need it, in general you won't and it can cause ifconfig to +malfunction. + +Run time configuration has changed slightly from the 2.4 system. If you +want to configure an endnode, then the simplified procedure is as follows: + + o Set the MAC address on your ethernet card before starting _any_ other + network protocols. + +As soon as your network card is brought into the UP state, DECnet should +start working. If you need something more complicated or are unsure how +to set the MAC address, see the next section. Also all configurations which +worked with 2.4 will work under 2.5 with no change. + +3) Command line options + +You can set a DECnet address on the kernel command line for compatibility +with the 2.4 configuration procedure, but in general it's not needed any more. +If you do st a DECnet address on the command line, it has only one purpose +which is that its added to the addresses on the loopback device. + +With 2.4 kernels, DECnet would only recognise addresses as local if they +were added to the loopback device. In 2.5, any local interface address +can be used to loop back to the local machine. Of course this does not +prevent you adding further addresses to the loopback device if you +want to. + +N.B. Since the address list of an interface determines the addresses for +which "hello" messages are sent, if you don't set an address on the loopback +interface then you won't see any entries in /proc/net/neigh for the local +host until such time as you start a connection. This doesn't affect the +operation of the local communications in any other way though. + +The kernel command line takes options looking like the following: + + decnet.addr=1,2 + +the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels +and early 2.3.xx kernels, you must use a comma when specifying the +DECnet address like this. For more recent 2.3.xx kernels, you may +use almost any character except space, although a `.` would be the most +obvious choice :-) + +There used to be a third number specifying the node type. This option +has gone away in favour of a per interface node type. This is now set +using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be +set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. + +There are also equivalent options for modules. The node address can +also be set through the /proc/sys/net/decnet/ files, as can other system +parameters. + +Currently the only supported devices are ethernet and ip_gre. The +ethernet address of your ethernet card has to be set according to the DECnet +address of the node in order for it to be autoconfigured (and then appear in +/proc/net/decnet_dev). There is a utility available at the above +FTP sites called dn2ethaddr which can compute the correct ethernet +address to use. The address can be set by ifconfig either before or +at the time the device is brought up. If you are using RedHat you can +add the line: + + MACADDR=AA:00:04:00:03:04 + +or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or +wherever your network card's configuration lives. Setting the MAC address +of your ethernet card to an address starting with "hi-ord" will cause a +DECnet address which matches to be added to the interface (which you can +verify with iproute2). + +The default device for routing can be set through the /proc filesystem +by setting /proc/sys/net/decnet/default_device to the +device you want DECnet to route packets out of when no specific route +is available. Usually this will be eth0, for example: + + echo -n "eth0" >/proc/sys/net/decnet/default_device + +If you don't set the default device, then it will default to the first +ethernet card which has been autoconfigured as described above. You can +confirm that by looking in the default_device file of course. + +There is a list of what the other files under /proc/sys/net/decnet/ do +on the kernel patch web site (shown above). + +4) Run time kernel configuration + +This is either done through the sysctl/proc interface (see the kernel web +pages for details on what the various options do) or through the iproute2 +package in the same way as IPv4/6 configuration is performed. + +Documentation for iproute2 is included with the package, although there is +as yet no specific section on DECnet, most of the features apply to both +IP and DECnet, albeit with DECnet addresses instead of IP addresses and +a reduced functionality. + +If you want to configure a DECnet router you'll need the iproute2 package +since its the _only_ way to add and delete routes currently. Eventually +there will be a routing daemon to send and receive routing messages for +each interface and update the kernel routing tables accordingly. The +routing daemon will use netfilter to listen to routing packets, and +rtnetlink to update the kernels routing tables. + +The DECnet raw socket layer has been removed since it was there purely +for use by the routing daemon which will now use netfilter (a much cleaner +and more generic solution) instead. + +5) How can I tell if its working ? + +Here is a quick guide of what to look for in order to know if your DECnet +kernel subsystem is working. + + - Is the node address set (see /proc/sys/net/decnet/node_address) + - Is the node of the correct type + (see /proc/sys/net/decnet/conf/<dev>/forwarding) + - Is the Ethernet MAC address of each Ethernet card set to match + the DECnet address. If in doubt use the dn2ethaddr utility available + at the ftp archive. + - If the previous two steps are satisfied, and the Ethernet card is up, + you should find that it is listed in /proc/net/decnet_dev and also + that it appears as a directory in /proc/sys/net/decnet/conf/. The + loopback device (lo) should also appear and is required to communicate + within a node. + - If you have any DECnet routers on your network, they should appear + in /proc/net/decnet_neigh, otherwise this file will only contain the + entry for the node itself (if it doesn't check to see if lo is up). + - If you want to send to any node which is not listed in the + /proc/net/decnet_neigh file, you'll need to set the default device + to point to an Ethernet card with connection to a router. This is + again done with the /proc/sys/net/decnet/default_device file. + - Try starting a simple server and client, like the dnping/dnmirror + over the loopback interface. With luck they should communicate. + For this step and those after, you'll need the DECnet library + which can be obtained from the above ftp sites as well as the + actual utilities themselves. + - If this seems to work, then try talking to a node on your local + network, and see if you can obtain the same results. + - At this point you are on your own... :-) + +6) How to send a bug report + +If you've found a bug and want to report it, then there are several things +you can do to help me work out exactly what it is that is wrong. Useful +information (_most_ of which _is_ _essential_) includes: + + - What kernel version are you running ? + - What version of the patch are you running ? + - How far though the above set of tests can you get ? + - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? + - Which services are you running ? + - Which client caused the problem ? + - How much data was being transferred ? + - Was the network congested ? + - How can the problem be reproduced ? + - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of + tcpdump don't understand how to dump DECnet properly, so including + the hex listing of the packet contents is _essential_, usually the -x flag. + You may also need to increase the length grabbed with the -s flag. The + -e flag also provides very useful information (ethernet MAC addresses)) + +7) MAC FAQ + +A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet +interact and how to get the best performance from your hardware. + +Ethernet cards are designed to normally only pass received network frames +to a host computer when they are addressed to it, or to the broadcast address. + +Linux has an interface which allows the setting of extra addresses for +an ethernet card to listen to. If the ethernet card supports it, the +filtering operation will be done in hardware, if not the extra unwanted packets +received will be discarded by the host computer. In the latter case, +significant processor time and bus bandwidth can be used up on a busy +network (see the NAPI documentation for a longer explanation of these +effects). + +DECnet makes use of this interface to allow running DECnet on an ethernet +card which has already been configured using TCP/IP (presumably using the +built in MAC address of the card, as usual) and/or to allow multiple DECnet +addresses on each physical interface. If you do this, be aware that if your +ethernet card doesn't support perfect hashing in its MAC address filter +then your computer will be doing more work than required. Some cards +will simply set themselves into promiscuous mode in order to receive +packets from the DECnet specified addresses. So if you have one of these +cards its better to set the MAC address of the card as described above +to gain the best efficiency. Better still is to use a card which supports +NAPI as well. + + +8) Mailing list + +If you are keen to get involved in development, or want to ask questions +about configuration, or even just report bugs, then there is a mailing +list that you can join, details are at: + +http://sourceforge.net/mail/?group_id=4993 + +9) Legal Info + +The Linux DECnet project team have placed their code under the GPL. The +software is provided "as is" and without warranty express or implied. +DECnet is a trademark of Compaq. This software is not a product of +Compaq. We acknowledge the help of people at Compaq in providing extra +documentation above and beyond what was previously publicly available. + +Steve Whitehouse <SteveW@ACM.org> + diff --git a/Documentation/networking/dl2k.txt b/Documentation/networking/dl2k.txt new file mode 100644 index 000000000..cba74f7a3 --- /dev/null +++ b/Documentation/networking/dl2k.txt @@ -0,0 +1,282 @@ + + D-Link DL2000-based Gigabit Ethernet Adapter Installation + for Linux + May 23, 2002 + +Contents +======== + - Compatibility List + - Quick Install + - Compiling the Driver + - Installing the Driver + - Option parameter + - Configuration Script Sample + - Troubleshooting + + +Compatibility List +================= +Adapter Support: + +D-Link DGE-550T Gigabit Ethernet Adapter. +D-Link DGE-550SX Gigabit Ethernet Adapter. +D-Link DL2000-based Gigabit Ethernet Adapter. + + +The driver support Linux kernel 2.4.7 later. We had tested it +on the environments below. + + . Red Hat v6.2 (update kernel to 2.4.7) + . Red Hat v7.0 (update kernel to 2.4.7) + . Red Hat v7.1 (kernel 2.4.7) + . Red Hat v7.2 (kernel 2.4.7-10) + + +Quick Install +============= +Install linux driver as following command: + +1. make all +2. insmod dl2k.ko +3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0 + ^^^^^^^^^^^^^^^\ ^^^^^^^^\ + IP NETMASK +Now eth0 should active, you can test it by "ping" or get more information by +"ifconfig". If tested ok, continue the next step. + +4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net +5. Add the following line to /etc/modprobe.d/dl2k.conf: + alias eth0 dl2k +6. Run depmod to updated module indexes. +7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 + located at /etc/sysconfig/network-scripts or create it manually. + [see - Configuration Script Sample] +8. Driver will automatically load and configure at next boot time. + +Compiling the Driver +==================== + In Linux, NIC drivers are most commonly configured as loadable modules. +The approach of building a monolithic kernel has become obsolete. The driver +can be compiled as part of a monolithic kernel, but is strongly discouraged. +The remainder of this section assumes the driver is built as a loadable module. +In the Linux environment, it is a good idea to rebuild the driver from the +source instead of relying on a precompiled version. This approach provides +better reliability since a precompiled driver might depend on libraries or +kernel features that are not present in a given Linux installation. + +The 3 files necessary to build Linux device driver are dl2k.c, dl2k.h and +Makefile. To compile, the Linux installation must include the gcc compiler, +the kernel source, and the kernel headers. The Linux driver supports Linux +Kernels 2.4.7. Copy the files to a directory and enter the following command +to compile and link the driver: + +CD-ROM drive +------------ + +[root@XXX /] mkdir cdrom +[root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom +[root@XXX /] cd root +[root@XXX /root] mkdir dl2k +[root@XXX /root] cd dl2k +[root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k +[root@XXX dl2k] tar xfvz dl2k.tgz +[root@XXX dl2k] make all + +Floppy disc drive +----------------- + +[root@XXX /] cd root +[root@XXX /root] mkdir dl2k +[root@XXX /root] cd dl2k +[root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k +[root@XXX dl2k] tar xfvz dl2k.tgz +[root@XXX dl2k] make all + +Installing the Driver +===================== + + Manual Installation + ------------------- + Once the driver has been compiled, it must be loaded, enabled, and bound + to a protocol stack in order to establish network connectivity. To load a + module enter the command: + + insmod dl2k.o + + or + + insmod dl2k.o <optional parameter> ; add parameter + + =============================================================== + example: insmod dl2k.o media=100mbps_hd + or insmod dl2k.o media=3 + or insmod dl2k.o media=3,2 ; for 2 cards + =============================================================== + + Please reference the list of the command line parameters supported by + the Linux device driver below. + + The insmod command only loads the driver and gives it a name of the form + eth0, eth1, etc. To bring the NIC into an operational state, + it is necessary to issue the following command: + + ifconfig eth0 up + + Finally, to bind the driver to the active protocol (e.g., TCP/IP with + Linux), enter the following command: + + ifup eth0 + + Note that this is meaningful only if the system can find a configuration + script that contains the necessary network information. A sample will be + given in the next paragraph. + + The commands to unload a driver are as follows: + + ifdown eth0 + ifconfig eth0 down + rmmod dl2k.o + + The following are the commands to list the currently loaded modules and + to see the current network configuration. + + lsmod + ifconfig + + + Automated Installation + ---------------------- + This section describes how to install the driver such that it is + automatically loaded and configured at boot time. The following description + is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to + other distributions as well. + + Red Hat v6.x/v7.x + ----------------- + 1. Copy dl2k.o to the network modules directory, typically + /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net. + 2. Locate the boot module configuration file, most commonly in the + /etc/modprobe.d/ directory. Add the following lines: + + alias ethx dl2k + options dl2k <optional parameters> + + where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if + one other ethernet adapter is installed, etc. Refer to the table in the + previous section for the list of optional parameters. + 3. Locate the network configuration scripts, normally the + /etc/sysconfig/network-scripts directory, and create a configuration + script named ifcfg-ethx that contains network information. + 4. Note that for most Linux distributions, Red Hat included, a configuration + utility with a graphical user interface is provided to perform steps 2 + and 3 above. + + +Parameter Description +===================== +You can install this driver without any additional parameter. However, if you +are going to have extensive functions then it is necessary to set extra +parameter. Below is a list of the command line parameters supported by the +Linux device +driver. + +mtu=packet_size - Specifies the maximum packet size. default + is 1500. + +media=media_type - Specifies the media type the NIC operates at. + autosense Autosensing active media. + 10mbps_hd 10Mbps half duplex. + 10mbps_fd 10Mbps full duplex. + 100mbps_hd 100Mbps half duplex. + 100mbps_fd 100Mbps full duplex. + 1000mbps_fd 1000Mbps full duplex. + 1000mbps_hd 1000Mbps half duplex. + 0 Autosensing active media. + 1 10Mbps half duplex. + 2 10Mbps full duplex. + 3 100Mbps half duplex. + 4 100Mbps full duplex. + 5 1000Mbps half duplex. + 6 1000Mbps full duplex. + + By default, the NIC operates at autosense. + 1000mbps_fd and 1000mbps_hd types are only + available for fiber adapter. + +vlan=n - Specifies the VLAN ID. If vlan=0, the + Virtual Local Area Network (VLAN) function is + disable. + +jumbo=[0|1] - Specifies the jumbo frame support. If jumbo=1, + the NIC accept jumbo frames. By default, this + function is disabled. + Jumbo frame usually improve the performance + int gigabit. + This feature need jumbo frame compatible + remote. + +rx_coalesce=m - Number of rx frame handled each interrupt. +rx_timeout=n - Rx DMA wait time for an interrupt. + If set rx_coalesce > 0, hardware only assert + an interrupt for m frames. Hardware won't + assert rx interrupt until m frames received or + reach timeout of n * 640 nano seconds. + Set proper rx_coalesce and rx_timeout can + reduce congestion collapse and overload which + has been a bottleneck for high speed network. + + For example, rx_coalesce=10 rx_timeout=800. + that is, hardware assert only 1 interrupt + for 10 frames received or timeout of 512 us. + +tx_coalesce=n - Number of tx frame handled each interrupt. + Set n > 1 can reduce the interrupts + congestion usually lower performance of + high speed network card. Default is 16. + +tx_flow=[1|0] - Specifies the Tx flow control. If tx_flow=0, + the Tx flow control disable else driver + autodetect. +rx_flow=[1|0] - Specifies the Rx flow control. If rx_flow=0, + the Rx flow control enable else driver + autodetect. + + +Configuration Script Sample +=========================== +Here is a sample of a simple configuration script: + +DEVICE=eth0 +USERCTL=no +ONBOOT=yes +POOTPROTO=none +BROADCAST=207.200.5.255 +NETWORK=207.200.5.0 +NETMASK=255.255.255.0 +IPADDR=207.200.5.2 + + +Troubleshooting +=============== +Q1. Source files contain ^ M behind every line. + Make sure all files are Unix file format (no LF). Try the following + shell command to convert files. + + cat dl2k.c | col -b > dl2k.tmp + mv dl2k.tmp dl2k.c + + OR + + cat dl2k.c | tr -d "\r" > dl2k.tmp + mv dl2k.tmp dl2k.c + +Q2: Could not find header files (*.h) ? + To compile the driver, you need kernel header files. After + installing the kernel source, the header files are usually located in + /usr/src/linux/include, which is the default include directory configured + in Makefile. For some distributions, there is a copy of header files in + /usr/src/include/linux and /usr/src/include/asm, that you can change the + INCLUDEDIR in Makefile to /usr/include without installing kernel source. + Note that RH 7.0 didn't provide correct header files in /usr/include, + including those files will make a wrong version driver. + diff --git a/Documentation/networking/dm9000.txt b/Documentation/networking/dm9000.txt new file mode 100644 index 000000000..5552e2e57 --- /dev/null +++ b/Documentation/networking/dm9000.txt @@ -0,0 +1,167 @@ +DM9000 Network driver +===================== + +Copyright 2008 Simtec Electronics, + Ben Dooks <ben@simtec.co.uk> <ben-linux@fluff.org> + + +Introduction +------------ + +This file describes how to use the DM9000 platform-device based network driver +that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h. + +The driver supports three DM9000 variants, the DM9000E which is the first chip +supported as well as the newer DM9000A and DM9000B devices. It is currently +maintained and tested by Ben Dooks, who should be CC: to any patches for this +driver. + + +Defining the platform device +---------------------------- + +The minimum set of resources attached to the platform device are as follows: + + 1) The physical address of the address register + 2) The physical address of the data register + 3) The IRQ line the device's interrupt pin is connected to. + +These resources should be specified in that order, as the ordering of the +two address regions is important (the driver expects these to be address +and then data). + +An example from arch/arm/mach-s3c2410/mach-bast.c is: + +static struct resource bast_dm9k_resource[] = { + [0] = { + .start = S3C2410_CS5 + BAST_PA_DM9000, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 3, + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f, + .flags = IORESOURCE_MEM, + }, + [2] = { + .start = IRQ_DM9000, + .end = IRQ_DM9000, + .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL, + } +}; + +static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, +}; + +Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags, +as this will generate a warning if it is not present. The trigger from +the flags field will be passed to request_irq() when registering the IRQ +handler to ensure that the IRQ is setup correctly. + +This shows a typical platform device, without the optional configuration +platform data supplied. The next example uses the same resources, but adds +the optional platform data to pass extra configuration data: + +static struct dm9000_plat_data bast_dm9k_platdata = { + .flags = DM9000_PLATF_16BITONLY, +}; + +static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, + .dev = { + .platform_data = &bast_dm9k_platdata, + } +}; + +The platform data is defined in include/linux/dm9000.h and described below. + + +Platform data +------------- + +Extra platform data for the DM9000 can describe the IO bus width to the +device, whether or not an external PHY is attached to the device and +the availability of an external configuration EEPROM. + +The flags for the platform data .flags field are as follows: + +DM9000_PLATF_8BITONLY + + The IO should be done with 8bit operations. + +DM9000_PLATF_16BITONLY + + The IO should be done with 16bit operations. + +DM9000_PLATF_32BITONLY + + The IO should be done with 32bit operations. + +DM9000_PLATF_EXT_PHY + + The chip is connected to an external PHY. + +DM9000_PLATF_NO_EEPROM + + This can be used to signify that the board does not have an + EEPROM, or that the EEPROM should be hidden from the user. + +DM9000_PLATF_SIMPLE_PHY + + Switch to using the simpler PHY polling method which does not + try and read the MII PHY state regularly. This is only available + when using the internal PHY. See the section on link state polling + for more information. + + The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry + "Force simple NSR based PHY polling" allows this flag to be + forced on at build time. + + +PHY Link state polling +---------------------- + +The driver keeps track of the link state and informs the network core +about link (carrier) availability. This is managed by several methods +depending on the version of the chip and on which PHY is being used. + +For the internal PHY, the original (and currently default) method is +to read the MII state, either when the status changes if we have the +necessary interrupt support in the chip or every two seconds via a +periodic timer. + +To reduce the overhead for the internal PHY, there is now the option +of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY +platform data option to read the summary information without the +expensive MII accesses. This method is faster, but does not print +as much information. + +When using an external PHY, the driver currently has to poll the MII +link status as there is no method for getting an interrupt on link change. + + +DM9000A / DM9000B +----------------- + +These chips are functionally similar to the DM9000E and are supported easily +by the same driver. The features are: + + 1) Interrupt on internal PHY state change. This means that the periodic + polling of the PHY status may be disabled on these devices when using + the internal PHY. + + 2) TCP/UDP checksum offloading, which the driver does not currently support. + + +ethtool +------- + +The driver supports the ethtool interface for access to the driver +state information, the PHY state and the EEPROM. diff --git a/Documentation/networking/dmfe.txt b/Documentation/networking/dmfe.txt new file mode 100644 index 000000000..25320bf19 --- /dev/null +++ b/Documentation/networking/dmfe.txt @@ -0,0 +1,66 @@ +Note: This driver doesn't have a maintainer. + +Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux. + +This program is free software; you can redistribute it and/or +modify it under the terms of the GNU General Public License +as published by the Free Software Foundation; either version 2 +of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + + +This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET +10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you +didn't compile this driver as a module, it will automatically load itself on boot and print a +line similar to : + + dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17) + +If you compiled this driver as a module, you have to load it on boot.You can load it with command : + + insmod dmfe + +This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass +a mode= setting to module while loading, like : + + insmod dmfe mode=0 # Force 10M Half Duplex + insmod dmfe mode=1 # Force 100M Half Duplex + insmod dmfe mode=4 # Force 10M Full Duplex + insmod dmfe mode=5 # Force 100M Full Duplex + +Next you should configure your network interface with a command similar to : + + ifconfig eth0 172.22.3.18 + ^^^^^^^^^^^ + Your IP Address + +Then you may have to modify the default routing table with command : + + route add default eth0 + + +Now your ethernet card should be up and running. + + +TODO: + +Implement pci_driver::suspend() and pci_driver::resume() power management methods. +Check on 64 bit boxes. +Check and fix on big endian boxes. +Test and make sure PCI latency is now correct for all cases. + + +Authors: + +Sten Wang <sten_wang@davicom.com.tw > : Original Author + +Contributors: + +Marcelo Tosatti <marcelo@conectiva.com.br> +Alan Cox <alan@lxorguk.ukuu.org.uk> +Jeff Garzik <jgarzik@pobox.com> +Vojtech Pavlik <vojtech@suse.cz> diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt new file mode 100644 index 000000000..eaa8f9a6f --- /dev/null +++ b/Documentation/networking/dns_resolver.txt @@ -0,0 +1,157 @@ + =================== + DNS Resolver Module + =================== + +Contents: + + - Overview. + - Compilation. + - Setting up. + - Usage. + - Mechanism. + - Debugging. + + +======== +OVERVIEW +======== + +The DNS resolver module provides a way for kernel services to make DNS queries +by way of requesting a key of key type dns_resolver. These queries are +upcalled to userspace through /sbin/request-key. + +These routines must be supported by userspace tools dns.upcall, cifs.upcall and +request-key. It is under development and does not yet provide the full feature +set. The features it does support include: + + (*) Implements the dns_resolver key_type to contact userspace. + +It does not yet support the following AFS features: + + (*) Dns query support for AFSDB resource record. + +This code is extracted from the CIFS filesystem. + + +=========== +COMPILATION +=========== + +The module should be enabled by turning on the kernel configuration options: + + CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" + + +========== +SETTING UP +========== + +To set up this facility, the /etc/request-key.conf file must be altered so that +/sbin/request-key can appropriately direct the upcalls. For example, to handle +basic dname to IPv4/IPv6 address resolution, the following line should be +added: + + #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... + #====== ============ ======= ======= ========================== + create dns_resolver * * /usr/sbin/cifs.upcall %k + +To direct a query for query type 'foo', a line of the following should be added +before the more general line given above as the first match is the one taken. + + create dns_resolver foo:* * /usr/sbin/dns.foo %k + + +===== +USAGE +===== + +To make use of this facility, one of the following functions that are +implemented in the module can be called after doing: + + #include <linux/dns_resolver.h> + + (1) int dns_query(const char *type, const char *name, size_t namelen, + const char *options, char **_result, time_t *_expiry); + + This is the basic access function. It looks for a cached DNS query and if + it doesn't find it, it upcalls to userspace to make a new DNS query, which + may then be cached. The key description is constructed as a string of the + form: + + [<type>:]<name> + + where <type> optionally specifies the particular upcall program to invoke, + and thus the type of query to do, and <name> specifies the string to be + looked up. The default query type is a straight hostname to IP address + set lookup. + + The name parameter is not required to be a NUL-terminated string, and its + length should be given by the namelen argument. + + The options parameter may be NULL or it may be a set of options + appropriate to the query type. + + The return value is a string appropriate to the query type. For instance, + for the default query type it is just a list of comma-separated IPv4 and + IPv6 addresses. The caller must free the result. + + The length of the result string is returned on success, and a negative + error code is returned otherwise. -EKEYREJECTED will be returned if the + DNS lookup failed. + + If _expiry is non-NULL, the expiry time (TTL) of the result will be + returned also. + +The kernel maintains an internal keyring in which it caches looked up keys. +This can be cleared by any process that has the CAP_SYS_ADMIN capability by +the use of KEYCTL_KEYRING_CLEAR on the keyring ID. + + +=============================== +READING DNS KEYS FROM USERSPACE +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + + +========= +MECHANISM +========= + +The dnsresolver module registers a key type called "dns_resolver". Keys of +this type are used to transport and cache DNS lookup results from userspace. + +When dns_query() is invoked, it calls request_key() to search the local +keyrings for a cached DNS result. If that fails to find one, it upcalls to +userspace to get a new result. + +Upcalls to userspace are made through the request_key() upcall vector, and are +directed by means of configuration lines in /etc/request-key.conf that tell +/sbin/request-key what program to run to instantiate the key. + +The upcall handler program is responsible for querying the DNS, processing the +result into a form suitable for passing to the keyctl_instantiate_key() +routine. This then passes the data to dns_resolver_instantiate() which strips +off and processes any options included in the data, and then attaches the +remainder of the string to the key as its payload. + +The upcall handler program should set the expiry time on the key to that of the +lowest TTL of all the records it has extracted a result from. This means that +the key will be discarded and recreated when the data it holds has expired. + +dns_query() returns a copy of the value attached to the key, or an error if +that is indicated instead. + +See <file:Documentation/security/keys/request-key.rst> for further +information about request-key function. + + +========= +DEBUGGING +========= + +Debugging messages can be turned on dynamically by writing a 1 into the +following file: + + /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/dpaa.txt b/Documentation/networking/dpaa.txt new file mode 100644 index 000000000..f88194f71 --- /dev/null +++ b/Documentation/networking/dpaa.txt @@ -0,0 +1,260 @@ +The QorIQ DPAA Ethernet Driver +============================== + +Authors: +Madalin Bucur <madalin.bucur@nxp.com> +Camelia Groza <camelia.groza@nxp.com> + +Contents +======== + + - DPAA Ethernet Overview + - DPAA Ethernet Supported SoCs + - Configuring DPAA Ethernet in your kernel + - DPAA Ethernet Frame Processing + - DPAA Ethernet Features + - DPAA IRQ Affinity and Receive Side Scaling + - Debugging + +DPAA Ethernet Overview +====================== + +DPAA stands for Data Path Acceleration Architecture and it is a +set of networking acceleration IPs that are available on several +generations of SoCs, both on PowerPC and ARM64. + +The Freescale DPAA architecture consists of a series of hardware blocks +that support Ethernet connectivity. The Ethernet driver depends upon the +following drivers in the Linux kernel: + + - Peripheral Access Memory Unit (PAMU) (* needed only for PPC platforms) + drivers/iommu/fsl_* + - Frame Manager (FMan) + drivers/net/ethernet/freescale/fman + - Queue Manager (QMan), Buffer Manager (BMan) + drivers/soc/fsl/qbman + +A simplified view of the dpaa_eth interfaces mapped to FMan MACs: + + dpaa_eth /eth0\ ... /ethN\ + driver | | | | + ------------- ---- ----------- ---- ------------- + -Ports / Tx Rx \ ... / Tx Rx \ + FMan | | | | + -MACs | MAC0 | | MACN | + / dtsec0 \ ... / dtsecN \ (or tgec) + / \ / \(or memac) + --------- -------------- --- -------------- --------- + FMan, FMan Port, FMan SP, FMan MURAM drivers + --------------------------------------------------------- + FMan HW blocks: MURAM, MACs, Ports, SP + --------------------------------------------------------- + +The dpaa_eth relation to the QMan, BMan and FMan: + ________________________________ + dpaa_eth / eth0 \ + driver / \ + --------- -^- -^- -^- --- --------- + QMan driver / \ / \ / \ \ / | BMan | + |Rx | |Rx | |Tx | |Tx | | driver | + --------- |Dfl| |Err| |Cnf| |FQs| | | + QMan HW |FQ | |FQ | |FQs| | | | | + / \ / \ / \ \ / | | + --------- --- --- --- -v- --------- + | FMan QMI | | + | FMan HW FMan BMI | BMan HW | + ----------------------- -------- + +where the acronyms used above (and in the code) are: +DPAA = Data Path Acceleration Architecture +FMan = DPAA Frame Manager +QMan = DPAA Queue Manager +BMan = DPAA Buffers Manager +QMI = QMan interface in FMan +BMI = BMan interface in FMan +FMan SP = FMan Storage Profiles +MURAM = Multi-user RAM in FMan +FQ = QMan Frame Queue +Rx Dfl FQ = default reception FQ +Rx Err FQ = Rx error frames FQ +Tx Cnf FQ = Tx confirmation FQs +Tx FQs = transmission frame queues +dtsec = datapath three speed Ethernet controller (10/100/1000 Mbps) +tgec = ten gigabit Ethernet controller (10 Gbps) +memac = multirate Ethernet MAC (10/100/1000/10000) + +DPAA Ethernet Supported SoCs +============================ + +The DPAA drivers enable the Ethernet controllers present on the following SoCs: + +# PPC +P1023 +P2041 +P3041 +P4080 +P5020 +P5040 +T1023 +T1024 +T1040 +T1042 +T2080 +T4240 +B4860 + +# ARM +LS1043A +LS1046A + +Configuring DPAA Ethernet in your kernel +======================================== + +To enable the DPAA Ethernet driver, the following Kconfig options are required: + +# common for arch/arm64 and arch/powerpc platforms +CONFIG_FSL_DPAA=y +CONFIG_FSL_FMAN=y +CONFIG_FSL_DPAA_ETH=y +CONFIG_FSL_XGMAC_MDIO=y + +# for arch/powerpc only +CONFIG_FSL_PAMU=y + +# common options needed for the PHYs used on the RDBs +CONFIG_VITESSE_PHY=y +CONFIG_REALTEK_PHY=y +CONFIG_AQUANTIA_PHY=y + +DPAA Ethernet Frame Processing +============================== + +On Rx, buffers for the incoming frames are retrieved from one of the three +existing buffers pools. The driver initializes and seeds these, each with +buffers of different sizes: 1KB, 2KB and 4KB. + +On Tx, all transmitted frames are returned to the driver through Tx +confirmation frame queues. The driver is then responsible for freeing the +buffers. In order to do this properly, a backpointer is added to the buffer +before transmission that points to the skb. When the buffer returns to the +driver on a confirmation FQ, the skb can be correctly consumed. + +DPAA Ethernet Features +====================== + +Currently the DPAA Ethernet driver enables the basic features required for +a Linux Ethernet driver. The support for advanced features will be added +gradually. + +The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx +checksum offload feature is enabled by default and cannot be controlled through +ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS +provides a big performance boost for the forwarding scenarios, allowing +different traffic flows received by one interface to be processed by different +CPUs in parallel. + +The driver has support for multiple prioritized Tx traffic classes. Priorities +range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with +strict priority levels. Each traffic class contains NR_CPU TX queues. By +default, only one traffic class is enabled and the lowest priority Tx queues +are used. Higher priority traffic classes can be enabled with the mqprio +qdisc. For example, all four traffic classes are enabled on an interface with +the following command. Furthermore, skb priority levels are mapped to traffic +classes as follows: + + * priorities 0 to 3 - traffic class 0 (low priority) + * priorities 4 to 7 - traffic class 1 (medium-low priority) + * priorities 8 to 11 - traffic class 2 (medium-high priority) + * priorities 12 to 15 - traffic class 3 (high priority) + +tc qdisc add dev <int> root handle 1: \ + mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1 + +DPAA IRQ Affinity and Receive Side Scaling +========================================== + +Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation +queues is seen by the CPU as ingress traffic on a certain portal. +The DPAA QMan portal interrupts are affined each to a certain CPU. +The same portal interrupt services all the QMan portal consumers. + +By default the DPAA Ethernet driver enables RSS, making use of the +DPAA FMan Parser and Keygen blocks to distribute traffic on 128 +hardware frame queues using a hash on IP v4/v6 source and destination +and L4 source and destination ports, in present in the received frame. +When RSS is disabled, all traffic received by a certain interface is +received on the default Rx frame queue. The default DPAA Rx frame +queues are configured to put the received traffic into a pool channel +that allows any available CPU portal to dequeue the ingress traffic. +The default frame queues have the HOLDACTIVE option set, ensuring that +traffic bursts from a certain queue are serviced by the same CPU. +This ensures a very low rate of frame reordering. A drawback of this +is that only one CPU at a time can service the traffic received by a +certain interface when RSS is not enabled. + +To implement RSS, the DPAA Ethernet driver allocates an extra set of +128 Rx frame queues that are configured to dedicated channels, in a +round-robin manner. The mapping of the frame queues to CPUs is now +hardcoded, there is no indirection table to move traffic for a certain +FQ (hash result) to another CPU. The ingress traffic arriving on one +of these frame queues will arrive at the same portal and will always +be processed by the same CPU. This ensures intra-flow order preservation +and workload distribution for multiple traffic flows. + +RSS can be turned off for a certain interface using ethtool, i.e. + + # ethtool -N fm1-mac9 rx-flow-hash tcp4 "" + +To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6: + + # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn + +There is no independent control for individual protocols, any command +run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is +going to control the rx-flow-hashing for all protocols on that interface. + +Besides using the FMan Keygen computed hash for spreading traffic on the +128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when +the NETIF_F_RXHASH feature is on (active by default). This can be turned +on or off through ethtool, i.e.: + + # ethtool -K fm1-mac9 rx-hashing off + # ethtool -k fm1-mac9 | grep hash + receive-hashing: off + # ethtool -K fm1-mac9 rx-hashing on + Actual changes: + receive-hashing: on + # ethtool -k fm1-mac9 | grep hash + receive-hashing: on + +Please note that Rx hashing depends upon the rx-flow-hashing being on +for that interface - turning off rx-flow-hashing will also disable the +rx-hashing (without ethtool reporting it as off as that depends on the +NETIF_F_RXHASH feature flag). + +Debugging +========= + +The following statistics are exported for each interface through ethtool: + + - interrupt count per CPU + - Rx packets count per CPU + - Tx packets count per CPU + - Tx confirmed packets count per CPU + - Tx S/G frames count per CPU + - Tx error count per CPU + - Rx error count per CPU + - Rx error count per type + - congestion related statistics: + - congestion status + - time spent in congestion + - number of time the device entered congestion + - dropped packets count per cause + +The driver also exports the following information in sysfs: + + - the FQ IDs for each FQ type + /sys/devices/platform/dpaa-ethernet.0/net/<int>/fqids + + - the IDs of the buffer pools in use + /sys/devices/platform/dpaa-ethernet.0/net/<int>/bpids diff --git a/Documentation/networking/dpaa2/dpio-driver.rst b/Documentation/networking/dpaa2/dpio-driver.rst new file mode 100644 index 000000000..135881041 --- /dev/null +++ b/Documentation/networking/dpaa2/dpio-driver.rst @@ -0,0 +1,158 @@ +.. include:: <isonum.txt> + +DPAA2 DPIO (Data Path I/O) Overview +=================================== + +:Copyright: |copy| 2016-2018 NXP + +This document provides an overview of the Freescale DPAA2 DPIO +drivers + +Introduction +============ + +A DPAA2 DPIO (Data Path I/O) is a hardware object that provides +interfaces to enqueue and dequeue frames to/from network interfaces +and other accelerators. A DPIO also provides hardware buffer +pool management for network interfaces. + +This document provides an overview the Linux DPIO driver, its +subcomponents, and its APIs. + +See Documentation/networking/dpaa2/overview.rst for a general overview of DPAA2 +and the general DPAA2 driver architecture in Linux. + +Driver Overview +--------------- + +The DPIO driver is bound to DPIO objects discovered on the fsl-mc bus and +provides services that: + A) allow other drivers, such as the Ethernet driver, to enqueue and dequeue + frames for their respective objects + B) allow drivers to register callbacks for data availability notifications + when data becomes available on a queue or channel + C) allow drivers to manage hardware buffer pools + +The Linux DPIO driver consists of 3 primary components-- + DPIO object driver-- fsl-mc driver that manages the DPIO object + + DPIO service-- provides APIs to other Linux drivers for services + + QBman portal interface-- sends portal commands, gets responses +:: + + fsl-mc other + bus drivers + | | + +---+----+ +------+-----+ + |DPIO obj| |DPIO service| + | driver |---| (DPIO) | + +--------+ +------+-----+ + | + +------+-----+ + | QBman | + | portal i/f | + +------------+ + | + hardware + + +The diagram below shows how the DPIO driver components fit with the other +DPAA2 Linux driver components:: + +------------+ + | OS Network | + | Stack | + +------------+ +------------+ + | Allocator |. . . . . . . | Ethernet | + |(DPMCP,DPBP)| | (DPNI) | + +-.----------+ +---+---+----+ + . . ^ | + . . <data avail, | |<enqueue, + . . tx confirm> | | dequeue> + +-------------+ . | | + | DPRC driver | . +--------+ +------------+ + | (DPRC) | . . |DPIO obj| |DPIO service| + +----------+--+ | driver |-| (DPIO) | + | +--------+ +------+-----+ + |<dev add/remove> +------|-----+ + | | QBman | + +----+--------------+ | portal i/f | + | MC-bus driver | +------------+ + | | | + | /soc/fsl-mc | | + +-------------------+ | + | + =========================================|=========|======================== + +-+--DPIO---|-----------+ + | | | + | QBman Portal | + +-----------------------+ + + ============================================================================ + + +DPIO Object Driver (dpio-driver.c) +---------------------------------- + + The dpio-driver component registers with the fsl-mc bus to handle objects of + type "dpio". The implementation of probe() handles basic initialization + of the DPIO including mapping of the DPIO regions (the QBman SW portal) + and initializing interrupts and registering irq handlers. The dpio-driver + registers the probed DPIO with dpio-service. + +DPIO service (dpio-service.c, dpaa2-io.h) +------------------------------------------ + + The dpio service component provides queuing, notification, and buffers + management services to DPAA2 drivers, such as the Ethernet driver. A system + will typically allocate 1 DPIO object per CPU to allow queuing operations + to happen simultaneously across all CPUs. + + Notification handling + dpaa2_io_service_register() + + dpaa2_io_service_deregister() + + dpaa2_io_service_rearm() + + Queuing + dpaa2_io_service_pull_fq() + + dpaa2_io_service_pull_channel() + + dpaa2_io_service_enqueue_fq() + + dpaa2_io_service_enqueue_qd() + + dpaa2_io_store_create() + + dpaa2_io_store_destroy() + + dpaa2_io_store_next() + + Buffer pool management + dpaa2_io_service_release() + + dpaa2_io_service_acquire() + +QBman portal interface (qbman-portal.c) +--------------------------------------- + + The qbman-portal component provides APIs to do the low level hardware + bit twiddling for operations such as: + -initializing Qman software portals + + -building and sending portal commands + + -portal interrupt configuration and processing + + The qbman-portal APIs are not public to other drivers, and are + only used by dpio-service. + +Other (dpaa2-fd.h, dpaa2-global.h) +---------------------------------- + + Frame descriptor and scatter-gather definitions and the APIs used to + manipulate them are defined in dpaa2-fd.h. + + Dequeue result struct and parsing APIs are defined in dpaa2-global.h. diff --git a/Documentation/networking/dpaa2/index.rst b/Documentation/networking/dpaa2/index.rst new file mode 100644 index 000000000..10bea113a --- /dev/null +++ b/Documentation/networking/dpaa2/index.rst @@ -0,0 +1,9 @@ +=================== +DPAA2 Documentation +=================== + +.. toctree:: + :maxdepth: 1 + + overview + dpio-driver diff --git a/Documentation/networking/dpaa2/overview.rst b/Documentation/networking/dpaa2/overview.rst new file mode 100644 index 000000000..d638b5a8a --- /dev/null +++ b/Documentation/networking/dpaa2/overview.rst @@ -0,0 +1,405 @@ +.. include:: <isonum.txt> + +========================================================= +DPAA2 (Data Path Acceleration Architecture Gen2) Overview +========================================================= + +:Copyright: |copy| 2015 Freescale Semiconductor Inc. +:Copyright: |copy| 2018 NXP + +This document provides an overview of the Freescale DPAA2 architecture +and how it is integrated into the Linux kernel. + +Introduction +============ + +DPAA2 is a hardware architecture designed for high-speeed network +packet processing. DPAA2 consists of sophisticated mechanisms for +processing Ethernet packets, queue management, buffer management, +autonomous L2 switching, virtual Ethernet bridging, and accelerator +(e.g. crypto) sharing. + +A DPAA2 hardware component called the Management Complex (or MC) manages the +DPAA2 hardware resources. The MC provides an object-based abstraction for +software drivers to use the DPAA2 hardware. +The MC uses DPAA2 hardware resources such as queues, buffer pools, and +network ports to create functional objects/devices such as network +interfaces, an L2 switch, or accelerator instances. +The MC provides memory-mapped I/O command interfaces (MC portals) +which DPAA2 software drivers use to operate on DPAA2 objects. + +The diagram below shows an overview of the DPAA2 resource management +architecture:: + + +--------------------------------------+ + | OS | + | DPAA2 drivers | + | | | + +-----------------------------|--------+ + | + | (create,discover,connect + | config,use,destroy) + | + DPAA2 | + +------------------------| mc portal |-+ + | | | + | +- - - - - - - - - - - - -V- - -+ | + | | | | + | | Management Complex (MC) | | + | | | | + | +- - - - - - - - - - - - - - - -+ | + | | + | Hardware Hardware | + | Resources Objects | + | --------- ------- | + | -queues -DPRC | + | -buffer pools -DPMCP | + | -Eth MACs/ports -DPIO | + | -network interface -DPNI | + | profiles -DPMAC | + | -queue portals -DPBP | + | -MC portals ... | + | ... | + | | + +--------------------------------------+ + + +The MC mediates operations such as create, discover, +connect, configuration, and destroy. Fast-path operations +on data, such as packet transmit/receive, are not mediated by +the MC and are done directly using memory mapped regions in +DPIO objects. + +Overview of DPAA2 Objects +========================= + +The section provides a brief overview of some key DPAA2 objects. +A simple scenario is described illustrating the objects involved +in creating a network interfaces. + +DPRC (Datapath Resource Container) +---------------------------------- + +A DPRC is a container object that holds all the other +types of DPAA2 objects. In the example diagram below there +are 8 objects of 5 types (DPMCP, DPIO, DPBP, DPNI, and DPMAC) +in the container. + +:: + + +---------------------------------------------------------+ + | DPRC | + | | + | +-------+ +-------+ +-------+ +-------+ +-------+ | + | | DPMCP | | DPIO | | DPBP | | DPNI | | DPMAC | | + | +-------+ +-------+ +-------+ +---+---+ +---+---+ | + | | DPMCP | | DPIO | | + | +-------+ +-------+ | + | | DPMCP | | + | +-------+ | + | | + +---------------------------------------------------------+ + +From the point of view of an OS, a DPRC behaves similar to a plug and +play bus, like PCI. DPRC commands can be used to enumerate the contents +of the DPRC, discover the hardware objects present (including mappable +regions and interrupts). + +:: + + DPRC.1 (bus) + | + +--+--------+-------+-------+-------+ + | | | | | + DPMCP.1 DPIO.1 DPBP.1 DPNI.1 DPMAC.1 + DPMCP.2 DPIO.2 + DPMCP.3 + +Hardware objects can be created and destroyed dynamically, providing +the ability to hot plug/unplug objects in and out of the DPRC. + +A DPRC has a mappable MMIO region (an MC portal) that can be used +to send MC commands. It has an interrupt for status events (like +hotplug). +All objects in a container share the same hardware "isolation context". +This means that with respect to an IOMMU the isolation granularity +is at the DPRC (container) level, not at the individual object +level. + +DPRCs can be defined statically and populated with objects +via a config file passed to the MC when firmware starts it. + +DPAA2 Objects for an Ethernet Network Interface +----------------------------------------------- + +A typical Ethernet NIC is monolithic-- the NIC device contains TX/RX +queuing mechanisms, configuration mechanisms, buffer management, +physical ports, and interrupts. DPAA2 uses a more granular approach +utilizing multiple hardware objects. Each object provides specialized +functions. Groups of these objects are used by software to provide +Ethernet network interface functionality. This approach provides +efficient use of finite hardware resources, flexibility, and +performance advantages. + +The diagram below shows the objects needed for a simple +network interface configuration on a system with 2 CPUs. + +:: + + +---+---+ +---+---+ + CPU0 CPU1 + +---+---+ +---+---+ + | | + +---+---+ +---+---+ + DPIO DPIO + +---+---+ +---+---+ + \ / + \ / + \ / + +---+---+ + DPNI --- DPBP,DPMCP + +---+---+ + | + | + +---+---+ + DPMAC + +---+---+ + | + port/PHY + +Below the objects are described. For each object a brief description +is provided along with a summary of the kinds of operations the object +supports and a summary of key resources of the object (MMIO regions +and IRQs). + +DPMAC (Datapath Ethernet MAC) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Represents an Ethernet MAC, a hardware device that connects to an Ethernet +PHY and allows physical transmission and reception of Ethernet frames. + +- MMIO regions: none +- IRQs: DPNI link change +- commands: set link up/down, link config, get stats, + IRQ config, enable, reset + +DPNI (Datapath Network Interface) +Contains TX/RX queues, network interface configuration, and RX buffer pool +configuration mechanisms. The TX/RX queues are in memory and are identified +by queue number. + +- MMIO regions: none +- IRQs: link state +- commands: port config, offload config, queue config, + parse/classify config, IRQ config, enable, reset + +DPIO (Datapath I/O) +~~~~~~~~~~~~~~~~~~~ +Provides interfaces to enqueue and dequeue +packets and do hardware buffer pool management operations. The DPAA2 +architecture separates the mechanism to access queues (the DPIO object) +from the queues themselves. The DPIO provides an MMIO interface to +enqueue/dequeue packets. To enqueue something a descriptor is written +to the DPIO MMIO region, which includes the target queue number. +There will typically be one DPIO assigned to each CPU. This allows all +CPUs to simultaneously perform enqueue/dequeued operations. DPIOs are +expected to be shared by different DPAA2 drivers. + +- MMIO regions: queue operations, buffer management +- IRQs: data availability, congestion notification, buffer + pool depletion +- commands: IRQ config, enable, reset + +DPBP (Datapath Buffer Pool) +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Represents a hardware buffer pool. + +- MMIO regions: none +- IRQs: none +- commands: enable, reset + +DPMCP (Datapath MC Portal) +~~~~~~~~~~~~~~~~~~~~~~~~~~ +Provides an MC command portal. +Used by drivers to send commands to the MC to manage +objects. + +- MMIO regions: MC command portal +- IRQs: command completion +- commands: IRQ config, enable, reset + +Object Connections +================== +Some objects have explicit relationships that must +be configured: + +- DPNI <--> DPMAC +- DPNI <--> DPNI +- DPNI <--> L2-switch-port + + A DPNI must be connected to something such as a DPMAC, + another DPNI, or L2 switch port. The DPNI connection + is made via a DPRC command. + +:: + + +-------+ +-------+ + | DPNI | | DPMAC | + +---+---+ +---+---+ + | | + +==========+ + +- DPNI <--> DPBP + + A network interface requires a 'buffer pool' (DPBP + object) which provides a list of pointers to memory + where received Ethernet data is to be copied. The + Ethernet driver configures the DPBPs associated with + the network interface. + +Interrupts +========== +All interrupts generated by DPAA2 objects are message +interrupts. At the hardware level message interrupts +generated by devices will normally have 3 components-- +1) a non-spoofable 'device-id' expressed on the hardware +bus, 2) an address, 3) a data value. + +In the case of DPAA2 devices/objects, all objects in the +same container/DPRC share the same 'device-id'. +For ARM-based SoC this is the same as the stream ID. + + +DPAA2 Linux Drivers Overview +============================ + +This section provides an overview of the Linux kernel drivers for +DPAA2-- 1) the bus driver and associated "DPAA2 infrastructure" +drivers and 2) functional object drivers (such as Ethernet). + +As described previously, a DPRC is a container that holds the other +types of DPAA2 objects. It is functionally similar to a plug-and-play +bus controller. +Each object in the DPRC is a Linux "device" and is bound to a driver. +The diagram below shows the Linux drivers involved in a networking +scenario and the objects bound to each driver. A brief description +of each driver follows. + +:: + + +------------+ + | OS Network | + | Stack | + +------------+ +------------+ + | Allocator |. . . . . . . | Ethernet | + |(DPMCP,DPBP)| | (DPNI) | + +-.----------+ +---+---+----+ + . . ^ | + . . <data avail, | | <enqueue, + . . tx confirm> | | dequeue> + +-------------+ . | | + | DPRC driver | . +---+---V----+ +---------+ + | (DPRC) | . . . . . .| DPIO driver| | MAC | + +----------+--+ | (DPIO) | | (DPMAC) | + | +------+-----+ +-----+---+ + |<dev add/remove> | | + | | | + +--------+----------+ | +--+---+ + | MC-bus driver | | | PHY | + | | | |driver| + | /bus/fsl-mc | | +--+---+ + +-------------------+ | | + | | + ========================= HARDWARE =========|=================|====== + DPIO | + | | + DPNI---DPBP | + | | + DPMAC | + | | + PHY ---------------+ + ============================================|======================== + +A brief description of each driver is provided below. + +MC-bus driver +------------- +The MC-bus driver is a platform driver and is probed from a +node in the device tree (compatible "fsl,qoriq-mc") passed in by boot +firmware. It is responsible for bootstrapping the DPAA2 kernel +infrastructure. +Key functions include: + +- registering a new bus type named "fsl-mc" with the kernel, + and implementing bus call-backs (e.g. match/uevent/dev_groups) +- implementing APIs for DPAA2 driver registration and for device + add/remove +- creates an MSI IRQ domain +- doing a 'device add' to expose the 'root' DPRC, in turn triggering + a bind of the root DPRC to the DPRC driver + +The binding for the MC-bus device-tree node can be consulted at +*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*. +The sysfs bind/unbind interfaces for the MC-bus can be consulted at +*Documentation/ABI/testing/sysfs-bus-fsl-mc*. + +DPRC driver +----------- +The DPRC driver is bound to DPRC objects and does runtime management +of a bus instance. It performs the initial bus scan of the DPRC +and handles interrupts for container events such as hot plug by +re-scanning the DPRC. + +Allocator +--------- +Certain objects such as DPMCP and DPBP are generic and fungible, +and are intended to be used by other drivers. For example, +the DPAA2 Ethernet driver needs: + +- DPMCPs to send MC commands, to configure network interfaces +- DPBPs for network buffer pools + +The allocator driver registers for these allocatable object types +and those objects are bound to the allocator when the bus is probed. +The allocator maintains a pool of objects that are available for +allocation by other DPAA2 drivers. + +DPIO driver +----------- +The DPIO driver is bound to DPIO objects and provides services that allow +other drivers such as the Ethernet driver to enqueue and dequeue data for +their respective objects. +Key services include: + +- data availability notifications +- hardware queuing operations (enqueue and dequeue of data) +- hardware buffer pool management + +To transmit a packet the Ethernet driver puts data on a queue and +invokes a DPIO API. For receive, the Ethernet driver registers +a data availability notification callback. To dequeue a packet +a DPIO API is used. +There is typically one DPIO object per physical CPU for optimum +performance, allowing different CPUs to simultaneously enqueue +and dequeue data. + +The DPIO driver operates on behalf of all DPAA2 drivers +active in the kernel-- Ethernet, crypto, compression, +etc. + +Ethernet driver +--------------- +The Ethernet driver is bound to a DPNI and implements the kernel +interfaces needed to connect the DPAA2 network interface to +the network stack. +Each DPNI corresponds to a Linux network interface. + +MAC driver +---------- +An Ethernet PHY is an off-chip, board specific component and is managed +by the appropriate PHY driver via an mdio bus. The MAC driver +plays a role of being a proxy between the PHY driver and the +MC. It does this proxy via the MC commands to a DPMAC object. +If the PHY driver signals a link change, the MAC driver notifies +the MC via a DPMAC command. If a network interface is brought +up or down, the MC notifies the DPMAC driver via an interrupt and +the driver can take appropriate action. diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt new file mode 100644 index 000000000..da59e2884 --- /dev/null +++ b/Documentation/networking/driver.txt @@ -0,0 +1,93 @@ +Document about softnet driver issues + +Transmit path guidelines: + +1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under + any normal circumstances. It is considered a hard error unless + there is no way your device can tell ahead of time when it's + transmit function will become busy. + + Instead it must maintain the queue properly. For example, + for a driver implementing scatter-gather this means: + + static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, + struct net_device *dev) + { + struct drv *dp = netdev_priv(dev); + + lock_tx(dp); + ... + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { + netif_stop_queue(dev); + unlock_tx(dp); + printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", + dev->name); + return NETDEV_TX_BUSY; + } + + ... queue packet to card ... + ... update tx consumer index ... + + if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) + netif_stop_queue(dev); + + ... + unlock_tx(dp); + ... + return NETDEV_TX_OK; + } + + And then at the end of your TX reclamation event handling: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) + netif_wake_queue(dp->dev); + + For a non-scatter-gather supporting card, the three tests simply become: + + /* This is a hard error log it. */ + if (TX_BUFFS_AVAIL(dp) <= 0) + + and: + + if (TX_BUFFS_AVAIL(dp) == 0) + + and: + + if (netif_queue_stopped(dp->dev) && + TX_BUFFS_AVAIL(dp) > 0) + netif_wake_queue(dp->dev); + +2) An ndo_start_xmit method must not modify the shared parts of a + cloned SKB. + +3) Do not forget that once you return NETDEV_TX_OK from your + ndo_start_xmit method, it is your driver's responsibility to free + up the SKB and in some finite amount of time. + + For example, this means that it is not allowed for your TX + mitigation scheme to let TX packets "hang out" in the TX + ring unreclaimed forever if no new TX packets are sent. + This error can deadlock sockets waiting for send buffer room + to be freed up. + + If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you + must not keep any reference to that SKB and you must not attempt + to free it up. + +Probing guidelines: + +1) Any hardware layer address you obtain for your device should + be verified. For example, for ethernet check it with + linux/etherdevice.h:is_valid_ether_addr() + +Close/stop guidelines: + +1) After the ndo_stop routine has been called, the hardware must + not receive or transmit any data. All in flight packets must + be aborted. If necessary, poll or wait for completion of + any reset commands. + +2) The ndo_stop routine will be called by unregister_netdevice + if device is still UP. diff --git a/Documentation/networking/dsa/bcm_sf2.txt b/Documentation/networking/dsa/bcm_sf2.txt new file mode 100644 index 000000000..eba3a2431 --- /dev/null +++ b/Documentation/networking/dsa/bcm_sf2.txt @@ -0,0 +1,114 @@ +Broadcom Starfighter 2 Ethernet switch driver +============================================= + +Broadcom's Starfighter 2 Ethernet switch hardware block is commonly found and +deployed in the following products: + +- xDSL gateways such as BCM63138 +- streaming/multimedia Set Top Box such as BCM7445 +- Cable Modem/residential gateways such as BCM7145/BCM3390 + +The switch is typically deployed in a configuration involving between 5 to 13 +ports, offering a range of built-in and customizable interfaces: + +- single integrated Gigabit PHY +- quad integrated Gigabit PHY +- quad external Gigabit PHY w/ MDIO multiplexer +- integrated MoCA PHY +- several external MII/RevMII/GMII/RGMII interfaces + +The switch also supports specific congestion control features which allow MoCA +fail-over not to lose packets during a MoCA role re-election, as well as out of +band back-pressure to the host CPU network interface when downstream interfaces +are connected at a lower speed. + +The switch hardware block is typically interfaced using MMIO accesses and +contains a bunch of sub-blocks/registers: + +* SWITCH_CORE: common switch registers +* SWITCH_REG: external interfaces switch register +* SWITCH_MDIO: external MDIO bus controller (there is another one in SWITCH_CORE, + which is used for indirect PHY accesses) +* SWITCH_INDIR_RW: 64-bits wide register helper block +* SWITCH_INTRL2_0/1: Level-2 interrupt controllers +* SWITCH_ACB: Admission control block +* SWITCH_FCB: Fail-over control block + +Implementation details +====================== + +The driver is located in drivers/net/dsa/bcm_sf2.c and is implemented as a DSA +driver; see Documentation/networking/dsa/dsa.txt for details on the subsystem +and what it provides. + +The SF2 switch is configured to enable a Broadcom specific 4-bytes switch tag +which gets inserted by the switch for every packet forwarded to the CPU +interface, conversely, the CPU network interface should insert a similar tag for +packets entering the CPU port. The tag format is described in +net/dsa/tag_brcm.c. + +Overall, the SF2 driver is a fairly regular DSA driver; there are a few +specifics covered below. + +Device Tree probing +------------------- + +The DSA platform device driver is probed using a specific compatible string +provided in net/dsa/dsa.c. The reason for that is because the DSA subsystem gets +registered as a platform device driver currently. DSA will provide the needed +device_node pointers which are then accessible by the switch driver setup +function to setup resources such as register ranges and interrupts. This +currently works very well because none of the of_* functions utilized by the +driver require a struct device to be bound to a struct device_node, but things +may change in the future. + +MDIO indirect accesses +---------------------- + +Due to a limitation in how Broadcom switches have been designed, external +Broadcom switches connected to a SF2 require the use of the DSA slave MDIO bus +in order to properly configure them. By default, the SF2 pseudo-PHY address, and +an external switch pseudo-PHY address will both be snooping for incoming MDIO +transactions, since they are at the same address (30), resulting in some kind of +"double" programming. Using DSA, and setting ds->phys_mii_mask accordingly, we +selectively divert reads and writes towards external Broadcom switches +pseudo-PHY addresses. Newer revisions of the SF2 hardware have introduced a +configurable pseudo-PHY address which circumvents the initial design limitation. + +Multimedia over CoAxial (MoCA) interfaces +----------------------------------------- + +MoCA interfaces are fairly specific and require the use of a firmware blob which +gets loaded onto the MoCA processor(s) for packet processing. The switch +hardware contains logic which will assert/de-assert link states accordingly for +the MoCA interface whenever the MoCA coaxial cable gets disconnected or the +firmware gets reloaded. The SF2 driver relies on such events to properly set its +MoCA interface carrier state and properly report this to the networking stack. + +The MoCA interfaces are supported using the PHY library's fixed PHY/emulated PHY +device and the switch driver registers a fixed_link_update callback for such +PHYs which reflects the link state obtained from the interrupt handler. + + +Power Management +---------------- + +Whenever possible, the SF2 driver tries to minimize the overall switch power +consumption by applying a combination of: + +- turning off internal buffers/memories +- disabling packet processing logic +- putting integrated PHYs in IDDQ/low-power +- reducing the switch core clock based on the active port count +- enabling and advertising EEE +- turning off RGMII data processing logic when the link goes down + +Wake-on-LAN +----------- + +Wake-on-LAN is currently implemented by utilizing the host processor Ethernet +MAC controller wake-on logic. Whenever Wake-on-LAN is requested, an intersection +between the user request and the supported host Ethernet interface WoL +capabilities is done and the intersection result gets configured. During +system-wide suspend/resume, only ports not participating in Wake-on-LAN are +disabled. diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt new file mode 100644 index 000000000..25170ad7d --- /dev/null +++ b/Documentation/networking/dsa/dsa.txt @@ -0,0 +1,601 @@ +Distributed Switch Architecture +=============================== + +Introduction +============ + +This document describes the Distributed Switch Architecture (DSA) subsystem +design principles, limitations, interactions with other subsystems, and how to +develop drivers for this subsystem as well as a TODO for developers interested +in joining the effort. + +Design principles +================= + +The Distributed Switch Architecture is a subsystem which was primarily designed +to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) +using Linux, but has since evolved to support other vendors as well. + +The original philosophy behind this design was to be able to use unmodified +Linux tools such as bridge, iproute2, ifconfig to work transparently whether +they configured/queried a switch port network device or a regular network +device. + +An Ethernet switch is typically comprised of multiple front-panel ports, and one +or more CPU or management port. The DSA subsystem currently relies on the +presence of a management port connected to an Ethernet controller capable of +receiving Ethernet frames from the switch. This is a very common setup for all +kinds of Ethernet switches found in Small Home and Office products: routers, +gateways, or even top-of-the rack switches. This host Ethernet controller will +be later referred to as "master" and "cpu" in DSA terminology and code. + +The D in DSA stands for Distributed, because the subsystem has been designed +with the ability to configure and manage cascaded switches on top of each other +using upstream and downstream Ethernet links between switches. These specific +ports are referred to as "dsa" ports in DSA terminology and code. A collection +of multiple switches connected to each other is called a "switch tree". + +For each front-panel port, DSA will create specialized network devices which are +used as controlling and data-flowing endpoints for use by the Linux networking +stack. These specialized network interfaces are referred to as "slave" network +interfaces in DSA terminology and code. + +The ideal case for using DSA is when an Ethernet switch supports a "switch tag" +which is a hardware feature making the switch insert a specific tag for each +Ethernet frames it received to/from specific ports to help the management +interface figure out: + +- what port is this frame coming from +- what was the reason why this frame got forwarded +- how to send CPU originated traffic to specific ports + +The subsystem does support switches not capable of inserting/stripping tags, but +the features might be slightly limited in that case (traffic separation relies +on Port-based VLAN IDs). + +Note that DSA does not currently create network interfaces for the "cpu" and +"dsa" ports because: + +- the "cpu" port is the Ethernet switch facing side of the management + controller, and as such, would create a duplication of feature, since you + would get two interfaces for the same conduit: master netdev, and "cpu" netdev + +- the "dsa" port(s) are just conduits between two or more switches, and as such + cannot really be used as proper network interfaces either, only the + downstream, or the top-most upstream interface makes sense with that model + +Switch tagging protocols +------------------------ + +DSA currently supports 5 different tagging protocols, and a tag-less mode as +well. The different protocols are implemented in: + +net/dsa/tag_trailer.c: Marvell's 4 trailer tag mode (legacy) +net/dsa/tag_dsa.c: Marvell's original DSA tag +net/dsa/tag_edsa.c: Marvell's enhanced DSA tag +net/dsa/tag_brcm.c: Broadcom's 4 bytes tag +net/dsa/tag_qca.c: Qualcomm's 2 bytes tag + +The exact format of the tag protocol is vendor specific, but in general, they +all contain something which: + +- identifies which port the Ethernet frame came from/should be sent to +- provides a reason why this frame was forwarded to the management interface + +Master network devices +---------------------- + +Master network devices are regular, unmodified Linux network device drivers for +the CPU/management Ethernet interface. Such a driver might occasionally need to +know whether DSA is enabled (e.g.: to enable/disable specific offload features), +but the DSA subsystem has been proven to work with industry standard drivers: +e1000e, mv643xx_eth etc. without having to introduce modifications to these +drivers. Such network devices are also often referred to as conduit network +devices since they act as a pipe between the host processor and the hardware +Ethernet switch. + +Networking stack hooks +---------------------- + +When a master netdev is used with DSA, a small hook is placed in in the +networking stack is in order to have the DSA subsystem process the Ethernet +switch specific tagging protocol. DSA accomplishes this by registering a +specific (and fake) Ethernet type (later becoming skb->protocol) with the +networking stack, this is also known as a ptype or packet_type. A typical +Ethernet Frame receive sequence looks like this: + +Master network device (e.g.: e1000e): + +Receive interrupt fires: +- receive function is invoked +- basic packet processing is done: getting length, status etc. +- packet is prepared to be processed by the Ethernet layer by calling + eth_type_trans + +net/ethernet/eth.c: + +eth_type_trans(skb, dev) + if (dev->dsa_ptr != NULL) + -> skb->protocol = ETH_P_XDSA + +drivers/net/ethernet/*: + +netif_receive_skb(skb) + -> iterate over registered packet_type + -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv() + +net/dsa/dsa.c: + -> dsa_switch_rcv() + -> invoke switch tag specific protocol handler in + net/dsa/tag_*.c + +net/dsa/tag_*.c: + -> inspect and strip switch tag protocol to determine originating port + -> locate per-port network device + -> invoke eth_type_trans() with the DSA slave network device + -> invoked netif_receive_skb() + +Past this point, the DSA slave network devices get delivered regular Ethernet +frames that can be processed by the networking stack. + +Slave network devices +--------------------- + +Slave network devices created by DSA are stacked on top of their master network +device, each of these network interfaces will be responsible for being a +controlling and data-flowing end-point for each front-panel port of the switch. +These interfaces are specialized in order to: + +- insert/remove the switch tag protocol (if it exists) when sending traffic + to/from specific switch ports +- query the switch for ethtool operations: statistics, link state, + Wake-on-LAN, register dumps... +- external/internal PHY management: link, auto-negotiation etc. + +These slave network devices have custom net_device_ops and ethtool_ops function +pointers which allow DSA to introduce a level of layering between the networking +stack/ethtool, and the switch driver implementation. + +Upon frame transmission from these slave network devices, DSA will look up which +switch tagging protocol is currently registered with these network devices, and +invoke a specific transmit routine which takes care of adding the relevant +switch tag in the Ethernet frames. + +These frames are then queued for transmission using the master network device +ndo_start_xmit() function, since they contain the appropriate switch tag, the +Ethernet switch will be able to process these incoming frames from the +management interface and delivers these frames to the physical switch port. + +Graphical representation +------------------------ + +Summarized, this is basically how DSA looks like from a network device +perspective: + + + |--------------------------- + | CPU network device (eth0)| + ---------------------------- + | <tag added by switch | + | | + | | + | tag added by CPU> | + |--------------------------------------------| + | Switch driver | + |--------------------------------------------| + || || || + |-------| |-------| |-------| + | sw0p0 | | sw0p1 | | sw0p2 | + |-------| |-------| |-------| + +Slave MDIO bus +-------------- + +In order to be able to read to/from a switch PHY built into it, DSA creates a +slave MDIO bus which allows a specific switch driver to divert and intercept +MDIO reads/writes towards specific PHY addresses. In most MDIO-connected +switches, these functions would utilize direct or indirect PHY addressing mode +to return standard MII registers from the switch builtin PHYs, allowing the PHY +library and/or to return link status, link partner pages, auto-negotiation +results etc.. + +For Ethernet switches which have both external and internal MDIO busses, the +slave MII bus can be utilized to mux/demux MDIO reads and writes towards either +internal or external MDIO devices this switch might be connected to: internal +PHYs, external PHYs, or even external switches. + +Data structures +--------------- + +DSA data structures are defined in include/net/dsa.h as well as +net/dsa/dsa_priv.h. + +dsa_chip_data: platform data configuration for a given switch device, this +structure describes a switch device's parent device, its address, as well as +various properties of its ports: names/labels, and finally a routing table +indication (when cascading switches) + +dsa_platform_data: platform device configuration data which can reference a +collection of dsa_chip_data structure if multiples switches are cascaded, the +master network device this switch tree is attached to needs to be referenced + +dsa_switch_tree: structure assigned to the master network device under +"dsa_ptr", this structure references a dsa_platform_data structure as well as +the tagging protocol supported by the switch tree, and which receive/transmit +function hooks should be invoked, information about the directly attached switch +is also provided: CPU port. Finally, a collection of dsa_switch are referenced +to address individual switches in the tree. + +dsa_switch: structure describing a switch device in the tree, referencing a +dsa_switch_tree as a backpointer, slave network devices, master network device, +and a reference to the backing dsa_switch_ops + +dsa_switch_ops: structure referencing function pointers, see below for a full +description. + +Design limitations +================== + +DSA is a platform device driver +------------------------------- + +DSA is implemented as a DSA platform device driver which is convenient because +it will register the entire DSA switch tree attached to a master network device +in one-shot, facilitating the device creation and simplifying the device driver +model a bit, this comes however with a number of limitations: + +- building DSA and its switch drivers as modules is currently not working +- the device driver parenting does not necessarily reflect the original + bus/device the switch can be created from +- supporting non-MDIO and non-MMIO (platform) switches is not possible + +Limits on the number of devices and ports +----------------------------------------- + +DSA currently limits the number of maximum switches within a tree to 4 +(DSA_MAX_SWITCHES), and the number of ports per switch to 12 (DSA_MAX_PORTS). +These limits could be extended to support larger configurations would this need +arise. + +Lack of CPU/DSA network devices +------------------------------- + +DSA does not currently create slave network devices for the CPU or DSA ports, as +described before. This might be an issue in the following cases: + +- inability to fetch switch CPU port statistics counters using ethtool, which + can make it harder to debug MDIO switch connected using xMII interfaces + +- inability to configure the CPU port link parameters based on the Ethernet + controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/ + +- inability to configure specific VLAN IDs / trunking VLANs between switches + when using a cascaded setup + +Common pitfalls using DSA setups +-------------------------------- + +Once a master network device is configured to use DSA (dev->dsa_ptr becomes +non-NULL), and the switch behind it expects a tagging protocol, this network +interface can only exclusively be used as a conduit interface. Sending packets +directly through this interface (e.g.: opening a socket using this interface) +will not make us go through the switch tagging protocol transmit function, so +the Ethernet switch on the other end, expecting a tag will typically drop this +frame. + +Slave network devices check that the master network device is UP before allowing +you to administratively bring UP these slave network devices. A common +configuration mistake is forgetting to bring UP the master network device first. + +Interactions with other subsystems +================================== + +DSA currently leverages the following subsystems: + +- MDIO/PHY library: drivers/net/phy/phy.c, mdio_bus.c +- Switchdev: net/switchdev/* +- Device Tree for various of_* functions + +MDIO/PHY library +---------------- + +Slave network devices exposed by DSA may or may not be interfacing with PHY +devices (struct phy_device as defined in include/linux/phy.h), but the DSA +subsystem deals with all possible combinations: + +- internal PHY devices, built into the Ethernet switch hardware +- external PHY devices, connected via an internal or external MDIO bus +- internal PHY devices, connected via an internal MDIO bus +- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a + fixed PHYs + +The PHY configuration is done by the dsa_slave_phy_setup() function and the +logic basically looks like this: + +- if Device Tree is used, the PHY device is looked up using the standard + "phy-handle" property, if found, this PHY device is created and registered + using of_phy_connect() + +- if Device Tree is used, and the PHY device is "fixed", that is, conforms to + the definition of a non-MDIO managed PHY as defined in + Documentation/devicetree/bindings/net/fixed-link.txt, the PHY is registered + and connected transparently using the special fixed MDIO bus driver + +- finally, if the PHY is built into the switch, as is very common with + standalone switch packages, the PHY is probed using the slave MII bus created + by DSA + + +SWITCHDEV +--------- + +DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and +more specifically with its VLAN filtering portion when configuring VLANs on top +of per-port slave network devices. Since DSA primarily deals with +MDIO-connected switches, although not exclusively, SWITCHDEV's +prepare/abort/commit phases are often simplified into a prepare phase which +checks whether the operation is supported by the DSA switch driver, and a commit +phase which applies the changes. + +As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN +objects. + +Device Tree +----------- + +DSA features a standardized binding which is documented in +Documentation/devicetree/bindings/net/dsa/dsa.txt. PHY/MDIO library helper +functions such as of_get_phy_mode(), of_phy_connect() are also used to query +per-port PHY specific details: interface connection, MDIO bus location etc.. + +Driver development +================== + +DSA switch drivers need to implement a dsa_switch_ops structure which will +contain the various members described below. + +register_switch_driver() registers this dsa_switch_ops in its internal list +of drivers to probe for. unregister_switch_driver() does the exact opposite. + +Unless requested differently by setting the priv_size member accordingly, DSA +does not allocate any driver private context space. + +Switch configuration +-------------------- + +- tag_protocol: this is to indicate what kind of tagging protocol is supported, + should be a valid value from the dsa_tag_protocol enum + +- probe: probe routine which will be invoked by the DSA platform device upon + registration to test for the presence/absence of a switch device. For MDIO + devices, it is recommended to issue a read towards internal registers using + the switch pseudo-PHY and return whether this is a supported device. For other + buses, return a non-NULL string + +- setup: setup function for the switch, this function is responsible for setting + up the dsa_switch_ops private structure with all it needs: register maps, + interrupts, mutexes, locks etc.. This function is also expected to properly + configure the switch to separate all network interfaces from each other, that + is, they should be isolated by the switch hardware itself, typically by creating + a Port-based VLAN ID for each port and allowing only the CPU port and the + specific port to be in the forwarding vector. Ports that are unused by the + platform should be disabled. Past this function, the switch is expected to be + fully configured and ready to serve any kind of request. It is recommended + to issue a software reset of the switch during this setup function in order to + avoid relying on what a previous software agent such as a bootloader/firmware + may have previously configured. + +PHY devices and link management +------------------------------- + +- get_phy_flags: Some switches are interfaced to various kinds of Ethernet PHYs, + if the PHY library PHY driver needs to know about information it cannot obtain + on its own (e.g.: coming from switch memory mapped registers), this function + should return a 32-bits bitmask of "flags", that is private between the switch + driver and the Ethernet PHY driver in drivers/net/phy/*. + +- phy_read: Function invoked by the DSA slave MDIO bus when attempting to read + the switch port MDIO registers. If unavailable, return 0xffff for each read. + For builtin switch Ethernet PHYs, this function should allow reading the link + status, auto-negotiation results, link partner pages etc.. + +- phy_write: Function invoked by the DSA slave MDIO bus when attempting to write + to the switch port MDIO registers. If unavailable return a negative error + code. + +- adjust_link: Function invoked by the PHY library when a slave network device + is attached to a PHY device. This function is responsible for appropriately + configuring the switch port link parameters: speed, duplex, pause based on + what the phy_device is providing. + +- fixed_link_update: Function invoked by the PHY library, and specifically by + the fixed PHY driver asking the switch driver for link parameters that could + not be auto-negotiated, or obtained by reading the PHY registers through MDIO. + This is particularly useful for specific kinds of hardware such as QSGMII, + MoCA or other kinds of non-MDIO managed PHYs where out of band link + information is obtained + +Ethtool operations +------------------ + +- get_strings: ethtool function used to query the driver's strings, will + typically return statistics strings, private flags strings etc. + +- get_ethtool_stats: ethtool function used to query per-port statistics and + return their values. DSA overlays slave network devices general statistics: + RX/TX counters from the network device, with switch driver specific statistics + per port + +- get_sset_count: ethtool function used to query the number of statistics items + +- get_wol: ethtool function used to obtain Wake-on-LAN settings per-port, this + function may, for certain implementations also query the master network device + Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN + +- set_wol: ethtool function used to configure Wake-on-LAN settings per-port, + direct counterpart to set_wol with similar restrictions + +- set_eee: ethtool function which is used to configure a switch port EEE (Green + Ethernet) settings, can optionally invoke the PHY library to enable EEE at the + PHY level if relevant. This function should enable EEE at the switch port MAC + controller and data-processing logic + +- get_eee: ethtool function which is used to query a switch port EEE settings, + this function should return the EEE state of the switch port MAC controller + and data-processing logic as well as query the PHY for its currently configured + EEE settings + +- get_eeprom_len: ethtool function returning for a given switch the EEPROM + length/size in bytes + +- get_eeprom: ethtool function returning for a given switch the EEPROM contents + +- set_eeprom: ethtool function writing specified data to a given switch EEPROM + +- get_regs_len: ethtool function returning the register length for a given + switch + +- get_regs: ethtool function returning the Ethernet switch internal register + contents. This function might require user-land code in ethtool to + pretty-print register values and registers + +Power management +---------------- + +- suspend: function invoked by the DSA platform device when the system goes to + suspend, should quiesce all Ethernet switch activities, but keep ports + participating in Wake-on-LAN active as well as additional wake-up logic if + supported + +- resume: function invoked by the DSA platform device when the system resumes, + should resume all Ethernet switch activities and re-configure the switch to be + in a fully active state + +- port_enable: function invoked by the DSA slave network device ndo_open + function when a port is administratively brought up, this function should be + fully enabling a given switch port. DSA takes care of marking the port with + BR_STATE_BLOCKING if the port is a bridge member, or BR_STATE_FORWARDING if it + was not, and propagating these changes down to the hardware + +- port_disable: function invoked by the DSA slave network device ndo_close + function when a port is administratively brought down, this function should be + fully disabling a given switch port. DSA takes care of marking the port with + BR_STATE_DISABLED and propagating changes to the hardware if this port is + disabled while being a bridge member + +Bridge layer +------------ + +- port_bridge_join: bridge layer function invoked when a given switch port is + added to a bridge, this function should be doing the necessary at the switch + level to permit the joining port from being added to the relevant logical + domain for it to ingress/egress traffic with other members of the bridge. + +- port_bridge_leave: bridge layer function invoked when a given switch port is + removed from a bridge, this function should be doing the necessary at the + switch level to deny the leaving port from ingress/egress traffic from the + remaining bridge members. When the port leaves the bridge, it should be aged + out at the switch hardware for the switch to (re) learn MAC addresses behind + this port. + +- port_stp_state_set: bridge layer function invoked when a given switch port STP + state is computed by the bridge layer and should be propagated to switch + hardware to forward/block/learn traffic. The switch driver is responsible for + computing a STP state change based on current and asked parameters and perform + the relevant ageing based on the intersection results + +Bridge VLAN filtering +--------------------- + +- port_vlan_filtering: bridge layer function invoked when the bridge gets + configured for turning on or off VLAN filtering. If nothing specific needs to + be done at the hardware level, this callback does not need to be implemented. + When VLAN filtering is turned on, the hardware must be programmed with + rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed + VLAN ID map/rules. If there is no PVID programmed into the switch port, + untagged frames must be rejected as well. When turned off the switch must + accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are + allowed. + +- port_vlan_prepare: bridge layer function invoked when the bridge prepares the + configuration of a VLAN on the given port. If the operation is not supported + by the hardware, this function should return -EOPNOTSUPP to inform the bridge + code to fallback to a software implementation. No hardware setup must be done + in this function. See port_vlan_add for this and details. + +- port_vlan_add: bridge layer function invoked when a VLAN is configured + (tagged or untagged) for the given switch port + +- port_vlan_del: bridge layer function invoked when a VLAN is removed from the + given switch port + +- port_vlan_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each VLAN the given port is a member + of. A switchdev object is used to carry the VID and bridge flags. + +- port_fdb_prepare: bridge layer function invoked when the bridge prepares the + installation of a Forwarding Database entry. If the operation is not + supported, this function should return -EOPNOTSUPP to inform the bridge code + to fallback to a software implementation. No hardware setup must be done in + this function. See port_fdb_add for this and details. + +- port_fdb_add: bridge layer function invoked when the bridge wants to install a + Forwarding Database entry, the switch hardware should be programmed with the + specified address in the specified VLAN Id in the forwarding database + associated with this VLAN ID + +Note: VLAN ID 0 corresponds to the port private database, which, in the context +of DSA, would be the its port-based VLAN, used by the associated bridge device. + +- port_fdb_del: bridge layer function invoked when the bridge wants to remove a + Forwarding Database entry, the switch hardware should be programmed to delete + the specified MAC address from the specified VLAN ID if it was mapped into + this port forwarding database + +- port_fdb_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each MAC address known to be behind + the given port. A switchdev object is used to carry the VID and FDB info. + +- port_mdb_prepare: bridge layer function invoked when the bridge prepares the + installation of a multicast database entry. If the operation is not supported, + this function should return -EOPNOTSUPP to inform the bridge code to fallback + to a software implementation. No hardware setup must be done in this function. + See port_fdb_add for this and details. + +- port_mdb_add: bridge layer function invoked when the bridge wants to install + a multicast database entry, the switch hardware should be programmed with the + specified address in the specified VLAN ID in the forwarding database + associated with this VLAN ID. + +Note: VLAN ID 0 corresponds to the port private database, which, in the context +of DSA, would be the its port-based VLAN, used by the associated bridge device. + +- port_mdb_del: bridge layer function invoked when the bridge wants to remove a + multicast database entry, the switch hardware should be programmed to delete + the specified MAC address from the specified VLAN ID if it was mapped into + this port forwarding database. + +- port_mdb_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each MAC address known to be behind + the given port. A switchdev object is used to carry the VID and MDB info. + +TODO +==== + +Making SWITCHDEV and DSA converge towards an unified codebase +------------------------------------------------------------- + +SWITCHDEV properly takes care of abstracting the networking stack with offload +capable hardware, but does not enforce a strict switch device driver model. On +the other DSA enforces a fairly strict device driver model, and deals with most +of the switch specific. At some point we should envision a merger between these +two subsystems and get the best of both worlds. + +Other hanging fruits +-------------------- + +- making the number of ports fully dynamic and not dependent on DSA_MAX_PORTS +- allowing more than one CPU/management interface: + http://comments.gmane.org/gmane.linux.network/365657 +- porting more drivers from other vendors: + http://comments.gmane.org/gmane.linux.network/365510 diff --git a/Documentation/networking/dsa/lan9303.txt b/Documentation/networking/dsa/lan9303.txt new file mode 100644 index 000000000..144b02b95 --- /dev/null +++ b/Documentation/networking/dsa/lan9303.txt @@ -0,0 +1,37 @@ +LAN9303 Ethernet switch driver +============================== + +The LAN9303 is a three port 10/100 Mbps ethernet switch with integrated phys for +the two external ethernet ports. The third port is an RMII/MII interface to a +host master network interface (e.g. fixed link). + + +Driver details +============== + +The driver is implemented as a DSA driver, see +Documentation/networking/dsa/dsa.txt. + +See Documentation/devicetree/bindings/net/dsa/lan9303.txt for device tree +binding. + +The LAN9303 can be managed both via MDIO and I2C, both supported by this driver. + +At startup the driver configures the device to provide two separate network +interfaces (which is the default state of a DSA device). Due to HW limitations, +no HW MAC learning takes place in this mode. + +When both user ports are joined to the same bridge, the normal HW MAC learning +is enabled. This means that unicast traffic is forwarded in HW. Broadcast and +multicast is flooded in HW. STP is also supported in this mode. The driver +support fdb/mdb operations as well, meaning IGMP snooping is supported. + +If one of the user ports leave the bridge, the ports goes back to the initial +separated operation. + + +Driver limitations +================== + + - Support for VLAN filtering is not implemented + - The HW does not support VLAN-specific fdb entries diff --git a/Documentation/networking/e100.rst b/Documentation/networking/e100.rst new file mode 100644 index 000000000..f81111eba --- /dev/null +++ b/Documentation/networking/e100.rst @@ -0,0 +1,186 @@ +============================================================== +Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters +============================================================== + +June 1, 2018 + +Contents +======== + +- In This Release +- Identifying Your Adapter +- Building and Installation +- Driver Configuration Parameters +- Additional Configurations +- Known Issues +- Support + + +In This Release +=============== + +This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of +Adapters. This driver includes support for Itanium(R)2-based systems. + +For questions related to hardware requirements, refer to the documentation +supplied with your Intel PRO/100 adapter. + +The following features are now available in supported kernels: + - Native VLANs + - Channel Bonding (teaming) + - SNMP + +Channel Bonding documentation can be found in the Linux kernel source: +/Documentation/networking/bonding.txt + + +Identifying Your Adapter +======================== + +For information on how to identify your adapter, and for the latest Intel +network drivers, refer to the Intel Support website: +http://www.intel.com/support + +Driver Configuration Parameters +=============================== + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +Rx Descriptors: + Number of receive descriptors. A receive descriptor is a data + structure that describes a receive buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to write + data from the controller to host memory. In the 3.x.x driver the valid range + for this parameter is 64-256. The default value is 256. This parameter can be + changed using the command:: + + ethtool -G eth? rx n + + Where n is the number of desired Rx descriptors. + +Tx Descriptors: + Number of transmit descriptors. A transmit descriptor is a data + structure that describes a transmit buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to read + data from the host memory to the controller. In the 3.x.x driver the valid + range for this parameter is 64-256. The default value is 128. This parameter + can be changed using the command:: + + ethtool -G eth? tx n + + Where n is the number of desired Tx descriptors. + +Speed/Duplex: + The driver auto-negotiates the link speed and duplex settings by + default. The ethtool utility can be used as follows to force speed/duplex.:: + + ethtool -s eth? autoneg off speed {10|100} duplex {full|half} + + NOTE: setting the speed/duplex to incorrect values will cause the link to + fail. + +Event Log Message Level: + The driver uses the message level flag to log events + to syslog. The message level can be set at driver load time. It can also be + set using the command:: + + ethtool -s eth? msglvl n + + +Additional Configurations +========================= + +Configuring the Driver on Different Distributions +------------------------------------------------- + +Configuring a network driver to load properly when the system is started +is distribution dependent. Typically, the configuration process involves +adding an alias line to `/etc/modprobe.d/*.conf` as well as editing other +system startup scripts and/or configuration files. Many popular Linux +distributions ship with tools to make these changes for you. To learn +the proper way to configure a network device for your system, refer to +your distribution documentation. If during this process you are asked +for the driver or module name, the name for the Linux Base Driver for +the Intel PRO/100 Family of Adapters is e100. + +As an example, if you install the e100 driver for two PRO/100 adapters +(eth0 and eth1), add the following to a configuration file in +/etc/modprobe.d/:: + + alias eth0 e100 + alias eth1 e100 + +Viewing Link Messages +--------------------- + +In order to see link messages and other Intel driver information on your +console, you must set the dmesg level up to six. This can be done by +entering the following on the command line before loading the e100 +driver:: + + dmesg -n 6 + +If you wish to see all messages issued by the driver, including debug +messages, set the dmesg level to eight. + +NOTE: This setting is not saved across reboots. + +ethtool +------- + +The driver utilizes the ethtool interface for driver configuration and +diagnostics, as well as displaying statistical information. The ethtool +version 1.6 or later is required for this functionality. + +The latest release of ethtool can be found from +https://www.kernel.org/pub/software/network/ethtool/ + +Enabling Wake on LAN* (WoL) +--------------------------- +WoL is provided through the ethtool* utility. For instructions on +enabling WoL with ethtool, refer to the ethtool man page. WoL will be +enabled on the system during the next shut down or reboot. For this +driver version, in order to enable WoL, the e100 driver must be loaded +when shutting down or rebooting the system. + +NAPI +---- + +NAPI (Rx polling mode) is supported in the e100 driver. + +See https://wiki.linuxfoundation.org/networking/napi for more +information on NAPI. + +Multiple Interfaces on Same Ethernet Broadcast Network +------------------------------------------------------ + +Due to the default ARP behavior on Linux, it is not possible to have one +system on two IP networks in the same Ethernet broadcast domain +(non-partitioned switch) behave as expected. All Ethernet interfaces +will respond to IP traffic for any IP address assigned to the system. +This results in unbalanced receive traffic. + +If you have multiple interfaces in a server, either turn on ARP +filtering by + +(1) entering:: + + echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + + (this only works if your kernel's version is higher than 2.4.5), or + +(2) installing the interfaces in separate broadcast domains (either + in different switches or in a switch partitioned to VLANs). + + +Support +======= +For general information, go to the Intel support website at: +http://www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: +http://sourceforge.net/projects/e1000 +If an issue is identified with the released source code on a supported kernel +with a supported adapter, email the specific information related to the issue +to e1000-devel@lists.sf.net. diff --git a/Documentation/networking/e1000.rst b/Documentation/networking/e1000.rst new file mode 100644 index 000000000..f10dd4086 --- /dev/null +++ b/Documentation/networking/e1000.rst @@ -0,0 +1,461 @@ +=========================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Command Line Parameters +- Speed and Duplex Configuration +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://support.intel.com/support/go/network/adapter/home.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +NOTES: + For more information about the AutoNeg, Duplex, and Speed + parameters, see the "Speed and Duplex Configuration" section in + this document. + + For more information about the InterruptThrottleRate, + RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay + parameters, see the application note at: + http://www.intel.com/design/network/applnots/ap450.htm + +AutoNeg +------- + +(Supported only on adapters with copper connections) + +:Valid Range: 0x01-0x0F, 0x20-0x2F +:Default Value: 0x2F + +This parameter is a bit-mask that specifies the speed and duplex settings +advertised by the adapter. When this parameter is used, the Speed and +Duplex parameters must not be specified. + +NOTE: + Refer to the Speed and Duplex section of this readme for more + information on the AutoNeg parameter. + +Duplex +------ + +(Supported only on adapters with copper connections) + +:Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) +:Default Value: 0 + +This defines the direction in which data is allowed to flow. Can be +either one or two-directional. If both Duplex and the link partner are +set to auto-negotiate, the board auto-detects the correct duplex. If the +link partner is forced (either full or half), Duplex defaults to half- +duplex. + +FlowControl +----------- + +:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) +:Default Value: Reads flow control settings from the EEPROM + +This parameter controls the automatic generation(Tx) and response(Rx) +to Ethernet PAUSE frames. + +InterruptThrottleRate +--------------------- + +(not supported on Intel(R) 82542, 82543 or 82544-based adapters) + +:Valid Range: + 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, + 4=simplified balancing) +:Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types,but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the +interrupt rate will drop as low as 2000 interrupts per second. If the +traffic is mostly transmit or mostly receive, the interrupt rate could +be as high as 8000. + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. + +NOTE: + InterruptThrottleRate takes precedence over the TxAbsIntDelay and + RxAbsIntDelay parameters. In other words, minimizing the receive + and/or transmit absolute delays does not force the controller to + generate more interrupts than what the Interrupt Throttle Rate + allows. + +CAUTION: + If you are using the Intel(R) PRO/1000 CT Network Connection + (controller 82547), setting InterruptThrottleRate to a value + greater than 75,000, may hang (stop transmitting) adapters + under certain network conditions. If this occurs a NETDEV + WATCHDOG message is logged in the system event log. In + addition, the controller is automatically reset, restoring + the network connection. To eliminate the potential for the + hang, ensure that InterruptThrottleRate is set no greater + than 75,000 and is not set to 0. + +NOTE: + When e1000 is loaded with default settings and multiple adapters + are in use simultaneously, the CPU utilization may increase non- + linearly. In order to limit the CPU utilization without impacting + the overall throughput, we recommend that you load the driver as + follows:: + + modprobe e1000 InterruptThrottleRate=3000,3000,3000 + + This sets the InterruptThrottleRate to 3000 interrupts/sec for + the first, second, and third instances of the driver. The range + of 2000 to 3000 interrupts per second works on a majority of + systems and is a good starting point, but the optimal value will + be platform-specific. If CPU utilization is not a concern, use + RX_POLLING (NAPI) and default driver settings. + +RxDescriptors +------------- + +:Valid Range: + - 48-256 for 82542 and 82543-based adapters + - 48-4096 for all other supported adapters +:Default Value: 256 + +This value specifies the number of receive buffer descriptors allocated +by the driver. Increasing this value allows the driver to buffer more +incoming packets, at the expense of increased system memory utilization. + +Each descriptor is 16 bytes. A receive buffer is also allocated for each +descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending +on the MTU setting. The maximum MTU size is 16110. + +NOTE: + MTU designates the frame size. It only needs to be set for Jumbo + Frames. Depending on the available system resources, the request + for a higher number of receive descriptors may be denied. In this + case, use a lower number. + +RxIntDelay +---------- + +:Valid Range: 0-65535 (0=off) +:Default Value: 0 + +This value delays the generation of receive interrupts in units of 1.024 +microseconds. Receive interrupt reduction can improve CPU efficiency if +properly tuned for specific network traffic. Increasing this value adds +extra latency to frame reception and can end up decreasing the throughput +of TCP traffic. If the system is reporting dropped receives, this value +may be set too high, causing the driver to run out of available receive +descriptors. + +CAUTION: + When setting RxIntDelay to a value other than 0, adapters may + hang (stop transmitting) under certain network conditions. If + this occurs a NETDEV WATCHDOG message is logged in the system + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential + for the hang ensure that RxIntDelay is set to 0. + +RxAbsIntDelay +------------- + +(This parameter is supported only on 82540, 82545 and later adapters.) + +:Valid Range: 0-65535 (0=off) +:Default Value: 128 + +This value, in units of 1.024 microseconds, limits the delay in which a +receive interrupt is generated. Useful only if RxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is received within the set amount of time. Proper tuning, +along with RxIntDelay, may improve traffic throughput in specific network +conditions. + +Speed +----- + +(This parameter is supported only on adapters with copper connections.) + +:Valid Settings: 0, 10, 100, 1000 +:Default Value: 0 (auto-negotiate at all supported speeds) + +Speed forces the line speed to the specified value in megabits per second +(Mbps). If this parameter is not specified or is set to 0 and the link +partner is set to auto-negotiate, the board will auto-detect the correct +speed. Duplex should also be set when Speed is set to either 10 or 100. + +TxDescriptors +------------- + +:Valid Range: + - 48-256 for 82542 and 82543-based adapters + - 48-4096 for all other supported adapters +:Default Value: 256 + +This value is the number of transmit descriptors allocated by the driver. +Increasing this value allows the driver to queue more transmits. Each +descriptor is 16 bytes. + +NOTE: + Depending on the available system resources, the request for a + higher number of transmit descriptors may be denied. In this case, + use a lower number. + +TxIntDelay +---------- + +:Valid Range: 0-65535 (0=off) +:Default Value: 8 + +This value delays the generation of transmit interrupts in units of +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the +system is reporting dropped transmits, this value may be set too high +causing the driver to run out of available transmit descriptors. + +TxAbsIntDelay +------------- + +(This parameter is supported only on 82540, 82545 and later adapters.) + +:Valid Range: 0-65535 (0=off) +:Default Value: 32 + +This value, in units of 1.024 microseconds, limits the delay in which a +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is sent on the wire within the set amount of time. Proper tuning, +along with TxIntDelay, may improve traffic throughput in specific +network conditions. + +XsumRX +------ + +(This parameter is NOT supported on the 82542-based adapter.) + +:Valid Range: 0-1 +:Default Value: 1 + +A value of '1' indicates that the driver should enable IP checksum +offload for received packets (both UDP and TCP) to the adapter hardware. + +Copybreak +--------- + +:Valid Range: 0-xxxxxxx (0=off) +:Default Value: 256 +:Usage: modprobe e1000.ko copybreak=128 + +Driver copies all packets below or equaling this size to a fresh RX +buffer before handing it up the stack. + +This parameter is different than other parameters, in that it is a +single (not 1,1,1 etc.) parameter applied to all driver instances and +it is also available during runtime at +/sys/module/e1000/parameters/copybreak + +SmartPowerDownEnable +-------------------- + +:Valid Range: 0-1 +:Default Value: 0 (disabled) + +Allows PHY to turn off in lower power states. The user can turn off +this parameter in supported chipsets. + +Speed and Duplex Configuration +============================== + +Three keywords are used to control the speed and duplex configuration. +These keywords are Speed, Duplex, and AutoNeg. + +If the board uses a fiber interface, these keywords are ignored, and the +fiber interface board only links at 1000 Mbps full-duplex. + +For copper-based boards, the keywords interact as follows: + +- The default operation is auto-negotiate. The board advertises all + supported speed and duplex combinations, and it links at the highest + common speed and duplex mode IF the link partner is set to auto-negotiate. + +- If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps + is advertised (The 1000BaseT spec requires auto-negotiation.) + +- If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- + negotiation is disabled, and the AutoNeg parameter is ignored. Partner + SHOULD also be forced. + +The AutoNeg parameter is used when more control is required over the +auto-negotiation process. It should be used when you wish to control which +speed and duplex combinations are advertised during the auto-negotiation +process. + +The parameter may be specified as either a decimal or hexadecimal value as +determined by the bitmap below. + +============== ====== ====== ======= ======= ====== ====== ======= ====== +Bit position 7 6 5 4 3 2 1 0 +Decimal Value 128 64 32 16 8 4 2 1 +Hex value 80 40 20 10 8 4 2 1 +Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10 +Duplex Full Full Half Full Half +============== ====== ====== ======= ======= ====== ====== ======= ====== + +Some examples of using AutoNeg:: + + modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half) + modprobe e1000 AutoNeg=1 (Same as above) + modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full) + modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full) + modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half) + modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100 + Half) + modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full) + modprobe e1000 AutoNeg=32 (Same as above) + +Note that when this parameter is used, Speed and Duplex must not be specified. + +If the link partner is forced to a specific speed and duplex, then this +parameter should not be used. Instead, use the Speed and Duplex parameters +previously mentioned to force the adapter to the same speed and duplex. + +Additional Configurations +========================= + +Jumbo Frames +------------ + + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example:: + + ifconfig eth<x> mtu 9000 up + + This setting is not saved across reboots. It can be made permanent if + you add:: + + MTU=9000 + + to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example + applies to the Red Hat distributions; other distributions may store this + setting in a different location. + +Notes: + Degradation in throughput performance may be observed in some Jumbo frames + environments. If this is observed, increasing the application's socket buffer + size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. + See the specific application manual and /usr/src/linux*/Documentation/ + networking/ip-sysctl.txt for more details. + + - The maximum MTU setting for Jumbo Frames is 16110. This value coincides + with the maximum Jumbo Frames size of 16128. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + - Adapters based on the Intel(R) 82542 and 82573V/E controller do not + support Jumbo Frames. These correspond to the following product names:: + + Intel(R) PRO/1000 Gigabit Server Adapter + Intel(R) PRO/1000 PM Network Connection + +ethtool +------- + + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 1.6 or later is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool/ + +Enabling Wake on LAN* (WoL) +--------------------------- + + WoL is configured through the ethtool* utility. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the e1000 driver must be + loaded when shutting down or rebooting the system. + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/e1000e.txt b/Documentation/networking/e1000e.txt new file mode 100644 index 000000000..12089547b --- /dev/null +++ b/Documentation/networking/e1000e.txt @@ -0,0 +1,312 @@ +Linux* Driver for Intel(R) Ethernet Network Connection +====================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Command Line Parameters +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +The e1000e driver supports all PCI Express Intel(R) Gigabit Network +Connections, except those that are 82575, 82576 and 82580-based*. + +* NOTE: The Intel(R) PRO/1000 P Dual Port Server Adapter is supported by + the e1000 driver, not the e1000e driver due to the 82546 part being used + behind a PCI Express bridge. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://support.intel.com/support/go/network/adapter/home.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +NOTES: For more information about the InterruptThrottleRate, + RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay + parameters, see the application note at: + http://www.intel.com/design/network/applnots/ap450.htm + +InterruptThrottleRate +--------------------- +Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, + 4=simplified balancing) +Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types, but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +The driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the +interrupt rate will drop as low as 2000 interrupts per second. If the +traffic is mostly transmit or mostly receive, the interrupt rate could +be as high as 8000. + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. + +NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and + RxAbsIntDelay parameters. In other words, minimizing the receive + and/or transmit absolute delays does not force the controller to + generate more interrupts than what the Interrupt Throttle Rate + allows. + +NOTE: When e1000e is loaded with default settings and multiple adapters + are in use simultaneously, the CPU utilization may increase non- + linearly. In order to limit the CPU utilization without impacting + the overall throughput, we recommend that you load the driver as + follows: + + modprobe e1000e InterruptThrottleRate=3000,3000,3000 + + This sets the InterruptThrottleRate to 3000 interrupts/sec for + the first, second, and third instances of the driver. The range + of 2000 to 3000 interrupts per second works on a majority of + systems and is a good starting point, but the optimal value will + be platform-specific. If CPU utilization is not a concern, use + RX_POLLING (NAPI) and default driver settings. + +RxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 0 + +This value delays the generation of receive interrupts in units of 1.024 +microseconds. Receive interrupt reduction can improve CPU efficiency if +properly tuned for specific network traffic. Increasing this value adds +extra latency to frame reception and can end up decreasing the throughput +of TCP traffic. If the system is reporting dropped receives, this value +may be set too high, causing the driver to run out of available receive +descriptors. + +CAUTION: When setting RxIntDelay to a value other than 0, adapters may + hang (stop transmitting) under certain network conditions. If + this occurs a NETDEV WATCHDOG message is logged in the system + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential + for the hang ensure that RxIntDelay is set to 0. + +RxAbsIntDelay +------------- +Valid Range: 0-65535 (0=off) +Default Value: 8 + +This value, in units of 1.024 microseconds, limits the delay in which a +receive interrupt is generated. Useful only if RxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is received within the set amount of time. Proper tuning, +along with RxIntDelay, may improve traffic throughput in specific network +conditions. + +TxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 8 + +This value delays the generation of transmit interrupts in units of +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the +system is reporting dropped transmits, this value may be set too high +causing the driver to run out of available transmit descriptors. + +TxAbsIntDelay +------------- +Valid Range: 0-65535 (0=off) +Default Value: 32 + +This value, in units of 1.024 microseconds, limits the delay in which a +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is sent on the wire within the set amount of time. Proper tuning, +along with TxIntDelay, may improve traffic throughput in specific +network conditions. + +Copybreak +--------- +Valid Range: 0-xxxxxxx (0=off) +Default Value: 256 + +Driver copies all packets below or equaling this size to a fresh RX +buffer before handing it up the stack. + +This parameter is different than other parameters, in that it is a +single (not 1,1,1 etc.) parameter applied to all driver instances and +it is also available during runtime at +/sys/module/e1000e/parameters/copybreak + +SmartPowerDownEnable +-------------------- +Valid Range: 0-1 +Default Value: 0 (disabled) + +Allows PHY to turn off in lower power states. The user can set this parameter +in supported chipsets. + +KumeranLockLoss +--------------- +Valid Range: 0-1 +Default Value: 1 (enabled) + +This workaround skips resetting the PHY at shutdown for the initial +silicon releases of ICH8 systems. + +IntMode +------- +Valid Range: 0-2 (0=legacy, 1=MSI, 2=MSI-X) +Default Value: 2 + +Allows changing the interrupt mode at module load time, without requiring a +recompile. If the driver load fails to enable a specific interrupt mode, the +driver will try other interrupt modes, from least to most compatible. The +interrupt order is MSI-X, MSI, Legacy. If specifying MSI (IntMode=1) +interrupts, only MSI and Legacy will be attempted. + +CrcStripping +------------ +Valid Range: 0-1 +Default Value: 1 (enabled) + +Strip the CRC from received packets before sending up the network stack. If +you have a machine with a BMC enabled but cannot receive IPMI traffic after +loading or enabling the driver, try disabling this feature. + +WriteProtectNVM +--------------- +Valid Range: 0,1 +Default Value: 1 + +If set to 1, configure the hardware to ignore all write/erase cycles to the +GbE region in the ICHx NVM (in order to prevent accidental corruption of the +NVM). This feature can be disabled by setting the parameter to 0 during initial +driver load. +NOTE: The machine must be power cycled (full off/on) when enabling NVM writes +via setting the parameter to zero. Once the NVM has been locked (via the +parameter at 1 when the driver loads) it cannot be unlocked except via power +cycle. + +Additional Configurations +========================= + + Jumbo Frames + ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example: + + ifconfig eth<x> mtu 9000 up + + This setting is not saved across reboots. + + Notes: + + - The maximum MTU setting for Jumbo Frames is 9216. This value coincides + with the maximum Jumbo Frames size of 9234 bytes. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + - Some adapters limit Jumbo Frames sized packets to a maximum of + 4096 bytes and some adapters do not support Jumbo Frames. + + - Jumbo Frames cannot be configured on an 82579-based Network device, if + MACSec is enabled on the system. + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. We + strongly recommend downloading the latest version of ethtool at: + + https://kernel.org/pub/software/network/ethtool/ + + NOTE: When validating enable/disable tests on some parts (82578, for example) + you need to add a few seconds between tests when working with ethtool. + + Speed and Duplex + ---------------- + Speed and Duplex are configured through the ethtool* utility. For + instructions, refer to the ethtool man page. + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is configured through the ethtool* utility. For instructions on + enabling WoL with ethtool, refer to the ethtool man page. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the e1000e driver must be + loaded when shutting down or rebooting the system. + + In most cases Wake On LAN is only supported on port A for multiple port + adapters. To verify if a port supports Wake on Lan run ethtool eth<X>. + +Support +======= + +For general information, go to the Intel support website at: + + www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ena.txt b/Documentation/networking/ena.txt new file mode 100644 index 000000000..2b4b6f57e --- /dev/null +++ b/Documentation/networking/ena.txt @@ -0,0 +1,305 @@ +Linux kernel driver for Elastic Network Adapter (ENA) family: +============================================================= + +Overview: +========= +ENA is a networking interface designed to make good use of modern CPU +features and system architectures. + +The ENA device exposes a lightweight management interface with a +minimal set of memory mapped registers and extendable command set +through an Admin Queue. + +The driver supports a range of ENA devices, is link-speed independent +(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has +a negotiated and extendable feature set. + +Some ENA devices support SR-IOV. This driver is used for both the +SR-IOV Physical Function (PF) and Virtual Function (VF) devices. + +ENA devices enable high speed and low overhead network traffic +processing by providing multiple Tx/Rx queue pairs (the maximum number +is advertised by the device via the Admin Queue), a dedicated MSI-X +interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, +and CPU cacheline optimized data placement. + +The ENA driver supports industry standard TCP/IP offload features such +as checksum offload and TCP transmit segmentation offload (TSO). +Receive-side scaling (RSS) is supported for multi-core scaling. + +The ENA driver and its corresponding devices implement health +monitoring mechanisms such as watchdog, enabling the device and driver +to recover in a manner transparent to the application, as well as +debug logs. + +Some of the ENA devices support a working mode called Low-latency +Queue (LLQ), which saves several more microseconds. + +Supported PCI vendor ID/device IDs: +=================================== +1d0f:0ec2 - ENA PF +1d0f:1ec2 - ENA PF with LLQ support +1d0f:ec20 - ENA VF +1d0f:ec21 - ENA VF with LLQ support + +ENA Source Code Directory Structure: +==================================== +ena_com.[ch] - Management communication layer. This layer is + responsible for the handling all the management + (admin) communication between the device and the + driver. +ena_eth_com.[ch] - Tx/Rx data path. +ena_admin_defs.h - Definition of ENA management interface. +ena_eth_io_defs.h - Definition of ENA data path interface. +ena_common_defs.h - Common definitions for ena_com layer. +ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. +ena_netdev.[ch] - Main Linux kernel driver. +ena_syfsfs.[ch] - Sysfs files. +ena_ethtool.c - ethtool callbacks. +ena_pci_id_tbl.h - Supported device IDs. + +Management Interface: +===================== +ENA management interface is exposed by means of: +- PCIe Configuration Space +- Device Registers +- Admin Queue (AQ) and Admin Completion Queue (ACQ) +- Asynchronous Event Notification Queue (AENQ) + +ENA device MMIO Registers are accessed only during driver +initialization and are not involved in further normal device +operation. + +AQ is used for submitting management commands, and the +results/responses are reported asynchronously through ACQ. + +ENA introduces a very small set of management commands with room for +vendor-specific extensions. Most of the management operations are +framed in a generic Get/Set feature command. + +The following admin queue commands are supported: +- Create I/O submission queue +- Create I/O completion queue +- Destroy I/O submission queue +- Destroy I/O completion queue +- Get feature +- Set feature +- Configure AENQ +- Get statistics + +Refer to ena_admin_defs.h for the list of supported Get/Set Feature +properties. + +The Asynchronous Event Notification Queue (AENQ) is a uni-directional +queue used by the ENA device to send to the driver events that cannot +be reported using ACQ. AENQ events are subdivided into groups. Each +group may have multiple syndromes, as shown below + +The events are: + Group Syndrome + Link state change - X - + Fatal error - X - + Notification Suspend traffic + Notification Resume traffic + Keep-Alive - X - + +ACQ and AENQ share the same MSI-X vector. + +Keep-Alive is a special mechanism that allows monitoring of the +device's health. The driver maintains a watchdog (WD) handler which, +if fired, logs the current state and statistics then resets and +restarts the ENA device and driver. A Keep-Alive event is delivered by +the device every second. The driver re-arms the WD upon reception of a +Keep-Alive event. A missed Keep-Alive event causes the WD handler to +fire. + +Data Path Interface: +==================== +I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx +SQ correspondingly). Each SQ has a completion queue (CQ) associated +with it. + +The SQs and CQs are implemented as descriptor rings in contiguous +physical memory. + +The ENA driver supports two Queue Operation modes for Tx SQs: +- Regular mode + * In this mode the Tx SQs reside in the host's memory. The ENA + device fetches the ENA Tx descriptors and packet data from host + memory. +- Low Latency Queue (LLQ) mode or "push-mode". + * In this mode the driver pushes the transmit descriptors and the + first 128 bytes of the packet directly to the ENA device memory + space. The rest of the packet payload is fetched by the + device. For this operation mode, the driver uses a dedicated PCI + device memory BAR, which is mapped with write-combine capability. + +The Rx SQs support only the regular mode. + +Note: Not all ENA devices support LLQ, and this feature is negotiated + with the device upon initialization. If the ENA device does not + support LLQ mode, the driver falls back to the regular mode. + +The driver supports multi-queue for both Tx and Rx. This has various +benefits: +- Reduced CPU/thread/process contention on a given Ethernet interface. +- Cache miss rate on completion is reduced, particularly for data + cache lines that hold the sk_buff structures. +- Increased process-level parallelism when handling received packets. +- Increased data cache hit rate, by steering kernel processing of + packets to the CPU, where the application thread consuming the + packet is running. +- In hardware interrupt re-direction. + +Interrupt Modes: +================ +The driver assigns a single MSI-X vector per queue pair (for both Tx +and Rx directions). The driver assigns an additional dedicated MSI-X vector +for management (for ACQ and AENQ). + +Management interrupt registration is performed when the Linux kernel +probes the adapter, and it is de-registered when the adapter is +removed. I/O queue interrupt registration is performed when the Linux +interface of the adapter is opened, and it is de-registered when the +interface is closed. + +The management interrupt is named: + ena-mgmnt@pci:<PCI domain:bus:slot.function> +and for each queue pair, an interrupt is named: + <interface name>-Tx-Rx-<queue index> + +The ENA device operates in auto-mask and auto-clear interrupt +modes. That is, once MSI-X is delivered to the host, its Cause bit is +automatically cleared and the interrupt is masked. The interrupt is +unmasked by the driver after NAPI processing is complete. + +Interrupt Moderation: +===================== +ENA driver and device can operate in conventional or adaptive interrupt +moderation mode. + +In conventional mode the driver instructs device to postpone interrupt +posting according to static interrupt delay value. The interrupt delay +value can be configured through ethtool(8). The following ethtool +parameters are supported by the driver: tx-usecs, rx-usecs + +In adaptive interrupt moderation mode the interrupt delay value is +updated by the driver dynamically and adjusted every NAPI cycle +according to the traffic nature. + +By default ENA driver applies adaptive coalescing on Rx traffic and +conventional coalescing on Tx traffic. + +Adaptive coalescing can be switched on/off through ethtool(8) +adaptive_rx on|off parameter. + +The driver chooses interrupt delay value according to the number of +bytes and packets received between interrupt unmasking and interrupt +posting. The driver uses interrupt delay table that subdivides the +range of received bytes/packets into 5 levels and assigns interrupt +delay value to each level. + +The user can enable/disable adaptive moderation, modify the interrupt +delay table and restore its default values through sysfs. + +The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK +and can be configured by the ETHTOOL_STUNABLE command of the +SIOCETHTOOL ioctl. + +SKB: +The driver-allocated SKB for frames received from Rx handling using +NAPI context. The allocation method depends on the size of the packet. +If the frame length is larger than rx_copybreak, napi_get_frags() +is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer +content is copied (by CPU) to the SKB, and the buffer is recycled. + +Statistics: +=========== +The user can obtain ENA device and driver statistics using ethtool. +The driver can collect regular or extended statistics (including +per-queue stats) from the device. + +In addition the driver logs the stats to syslog upon device reset. + +MTU: +==== +The driver supports an arbitrarily large MTU with a maximum that is +negotiated with the device. The driver configures MTU using the +SetFeature command (ENA_ADMIN_MTU property). The user can change MTU +via ip(8) and similar legacy tools. + +Stateless Offloads: +=================== +The ENA driver supports: +- TSO over IPv4/IPv6 +- TSO with ECN +- IPv4 header checksum offload +- TCP/UDP over IPv4/IPv6 checksum offloads + +RSS: +==== +- The ENA device supports RSS that allows flexible Rx traffic + steering. +- Toeplitz and CRC32 hash functions are supported. +- Different combinations of L2/L3/L4 fields can be configured as + inputs for hash functions. +- The driver configures RSS settings using the AQ SetFeature command + (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and + ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). +- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash + function delivered in the Rx CQ descriptor is set in the received + SKB. +- The user can provide a hash key, hash function, and configure the + indirection table through ethtool(8). + +DATA PATH: +========== +Tx: +--- +end_start_xmit() is called by the stack. This function does the following: +- Maps data buffers (skb->data and frags). +- Populates ena_buf for the push buffer (if the driver and device are + in push mode.) +- Prepares ENA bufs for the remaining frags. +- Allocates a new request ID from the empty req_id ring. The request + ID is the index of the packet in the Tx info. This is used for + out-of-order TX completions. +- Adds the packet to the proper place in the Tx ring. +- Calls ena_com_prepare_tx(), an ENA communication layer that converts + the ena_bufs to ENA descriptors (and adds meta ENA descriptors as + needed.) + * This function also copies the ENA descriptors and the push buffer + to the Device memory space (if in push mode.) +- Writes doorbell to the ENA device. +- When the ENA device finishes sending the packet, a completion + interrupt is raised. +- The interrupt handler schedules NAPI. +- The ena_clean_tx_irq() function is called. This function handles the + completion descriptors generated by the ENA, with a single + completion descriptor per completed packet. + * req_id is retrieved from the completion descriptor. The tx_info of + the packet is retrieved via the req_id. The data buffers are + unmapped and req_id is returned to the empty req_id ring. + * The function stops when the completion descriptors are completed or + the budget is reached. + +Rx: +--- +- When a packet is received from the ENA device. +- The interrupt handler schedules NAPI. +- The ena_clean_rx_irq() function is called. This function calls + ena_rx_pkt(), an ENA communication layer function, which returns the + number of descriptors used for a new unhandled packet, and zero if + no new packet is found. +- Then it calls the ena_clean_rx_irq() function. +- ena_eth_rx_skb() checks packet length: + * If the packet is small (len < rx_copybreak), the driver allocates + a SKB for the new packet, and copies the packet payload into the + SKB data buffer. + - In this way the original data buffer is not passed to the stack + and is reused for future Rx packets. + * Otherwise the function unmaps the Rx buffer, then allocates the + new SKB structure and hooks the Rx buffer to the SKB frags. +- The new SKB is updated with the necessary information (protocol, + checksum hw verify result, etc.), and then passed to the network + stack, using the NAPI interface function napi_gro_receive(). diff --git a/Documentation/networking/eql.txt b/Documentation/networking/eql.txt new file mode 100644 index 000000000..0f1550150 --- /dev/null +++ b/Documentation/networking/eql.txt @@ -0,0 +1,528 @@ + EQL Driver: Serial IP Load Balancing HOWTO + Simon "Guru Aleph-Null" Janes, simon@ncm.com + v1.1, February 27, 1995 + + This is the manual for the EQL device driver. EQL is a software device + that lets you load-balance IP serial links (SLIP or uncompressed PPP) + to increase your bandwidth. It will not reduce your latency (i.e. ping + times) except in the case where you already have lots of traffic on + your link, in which it will help them out. This driver has been tested + with the 1.1.75 kernel, and is known to have patched cleanly with + 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch + which was only created to patch cleanly in the very latest kernel + source trees. (Yes, it worked fine.) + + 1. Introduction + + Which is worse? A huge fee for a 56K leased line or two phone lines? + It's probably the former. If you find yourself craving more bandwidth, + and have a ISP that is flexible, it is now possible to bind modems + together to work as one point-to-point link to increase your + bandwidth. All without having to have a special black box on either + side. + + + The eql driver has only been tested with the Livingston PortMaster-2e + terminal server. I do not know if other terminal servers support load- + balancing, but I do know that the PortMaster does it, and does it + almost as well as the eql driver seems to do it (-- Unfortunately, in + my testing so far, the Livingston PortMaster 2e's load-balancing is a + good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps + and 14.4 Kbps connection. However, I am not sure that it really is + the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's + TCP implementation is pretty fast though.--) + + + I suggest to ISPs out there that it would probably be fair to charge + a load-balancing client 75% of the cost of the second line and 50% of + the cost of the third line etc... + + + Hey, we can all dream you know... + + + 2. Kernel Configuration + + Here I describe the general steps of getting a kernel up and working + with the eql driver. From patching, building, to installing. + + + 2.1. Patching The Kernel + + If you do not have or cannot get a copy of the kernel with the eql + driver folded into it, get your copy of the driver from + ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz. + Unpack this archive someplace obvious like /usr/local/src/. It will + create the following files: + + + + ______________________________________________________________________ + -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY + -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch + -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave + -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c + ______________________________________________________________________ + + Unpack a recent kernel (something after 1.1.92) someplace convenient + like say /usr/src/linux-1.1.92.eql. Use symbolic links to point + /usr/src/linux to this development directory. + + + Apply the patch by running the commands: + + + ______________________________________________________________________ + cd /usr/src + patch </usr/local/src/eql-1.1/eql-1.1.patch + ______________________________________________________________________ + + + + + + 2.2. Building The Kernel + + After patching the kernel, run make config and configure the kernel + for your hardware. + + + After configuration, make and install according to your habit. + + + 3. Network Configuration + + So far, I have only used the eql device with the DSLIP SLIP connection + manager by Matt Dillon (-- "The man who sold his soul to code so much + so quickly."--) . How you configure it for other "connection" + managers is up to you. Most other connection managers that I've seen + don't do a very good job when it comes to handling more than one + connection. + + + 3.1. /etc/rc.d/rc.inet1 + + In rc.inet1, ifconfig the eql device to the IP address you usually use + for your machine, and the MTU you prefer for your SLIP lines. One + could argue that MTU should be roughly half the usual size for two + modems, one-third for three, one-fourth for four, etc... But going + too far below 296 is probably overkill. Here is an example ifconfig + command that sets up the eql device: + + + + ______________________________________________________________________ + ifconfig eql 198.67.33.239 mtu 1006 + ______________________________________________________________________ + + + + + + Once the eql device is up and running, add a static default route to + it in the routing table using the cool new route syntax that makes + life so much easier: + + + + ______________________________________________________________________ + route add default eql + ______________________________________________________________________ + + + 3.2. Enslaving Devices By Hand + + Enslaving devices by hand requires two utility programs: eql_enslave + and eql_emancipate (-- eql_emancipate hasn't been written because when + an enslaved device "dies", it is automatically taken out of the queue. + I haven't found a good reason to write it yet... other than for + completeness, but that isn't a good motivator is it?--) + + + The syntax for enslaving a device is "eql_enslave <master-name> + <slave-name> <estimated-bps>". Here are some example enslavings: + + + + ______________________________________________________________________ + eql_enslave eql sl0 28800 + eql_enslave eql ppp0 14400 + eql_enslave eql sl1 57600 + ______________________________________________________________________ + + + + + + When you want to free a device from its life of slavery, you can + either down the device with ifconfig (eql will automatically bury the + dead slave and remove it from its queue) or use eql_emancipate to free + it. (-- Or just ifconfig it down, and the eql driver will take it out + for you.--) + + + + ______________________________________________________________________ + eql_emancipate eql sl0 + eql_emancipate eql ppp0 + eql_emancipate eql sl1 + ______________________________________________________________________ + + + + + + 3.3. DSLIP Configuration for the eql Device + + The general idea is to bring up and keep up as many SLIP connections + as you need, automatically. + + + 3.3.1. /etc/slip/runslip.conf + + Here is an example runslip.conf: + + + + + + + + + + + + + + + + ______________________________________________________________________ + name sl-line-1 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua2-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua2 + + name sl-line-2 + enabled + baud 38400 + mtu 576 + ducmd -e /etc/slip/dialout/cua3-288.xp -t 9 + command eql_enslave eql $interface 28800 + address 198.67.33.239 + line /dev/cua3 + ______________________________________________________________________ + + + + + + 3.4. Using PPP and the eql Device + + I have not yet done any load-balancing testing for PPP devices, mainly + because I don't have a PPP-connection manager like SLIP has with + DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance: + make sure you have asyncmap set to something so that control + characters are not escaped. + + + I tried to fix up a PPP script/system for redialing lost PPP + connections for use with the eql driver the weekend of Feb 25-26 '95 + (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this + year. + + + 4. About the Slave Scheduler Algorithm + + The slave scheduler probably could be replaced with a dozen other + things and push traffic much faster. The formula in the current set + up of the driver was tuned to handle slaves with wildly different + bits-per-second "priorities". + + + All testing I have done was with two 28.8 V.FC modems, one connecting + at 28800 bps or slower, and the other connecting at 14400 bps all the + time. + + + One version of the scheduler was able to push 5.3 K/s through the + 28800 and 14400 connections, but when the priorities on the links were + very wide apart (57600 vs. 14400) the "faster" modem received all + traffic and the "slower" modem starved. + + + 5. Testers' Reports + + Some people have experimented with the eql device with newer + kernels (than 1.1.75). I have since updated the driver to patch + cleanly in newer kernels because of the removal of the old "slave- + balancing" driver config option. + + + o icee from LinuxNET patched 1.1.86 without any rejects and was able + to boot the kernel and enslave a couple of ISDN PPP links. + + 5.1. Randolph Bentson's Test Report + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995 + Date: Tue, 7 Feb 95 22:57 PST + From: Randolph Bentson <bentson@grieg.seaslug.org> + To: guru@ncm.com + Subject: EQL driver tests + + + I have been checking out your eql driver. (Nice work, that!) + Although you may already done this performance testing, here + are some data I've discovered. + + Randolph Bentson + bentson@grieg.seaslug.org + + --------------------------------------------------------- + + + A pseudo-device driver, EQL, written by Simon Janes, can be used + to bundle multiple SLIP connections into what appears to be a + single connection. This allows one to improve dial-up network + connectivity gradually, without having to buy expensive DSU/CSU + hardware and services. + + I have done some testing of this software, with two goals in + mind: first, to ensure it actually works as described and + second, as a method of exercising my device driver. + + The following performance measurements were derived from a set + of SLIP connections run between two Linux systems (1.1.84) using + a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y. + (Ports 0,1,2,3 were used. A later configuration will distribute + port selection across the different Cirrus chips on the boards.) + Once a link was established, I timed a binary ftp transfer of + 289284 bytes of data. If there were no overhead (packet headers, + inter-character and inter-packet delays, etc.) the transfers + would take the following times: + + bits/sec seconds + 345600 8.3 + 234600 12.3 + 172800 16.7 + 153600 18.8 + 76800 37.6 + 57600 50.2 + 38400 75.3 + 28800 100.4 + 19200 150.6 + 9600 301.3 + + A single line running at the lower speeds and with large packets + comes to within 2% of this. Performance is limited for the higher + speeds (as predicted by the Cirrus databook) to an aggregate of + about 160 kbits/sec. The next round of testing will distribute + the load across two or more Cirrus chips. + + The good news is that one gets nearly the full advantage of the + second, third, and fourth line's bandwidth. (The bad news is + that the connection establishment seemed fragile for the higher + speeds. Once established, the connection seemed robust enough.) + + #lines speed mtu seconds theory actual %of + kbit/sec duration speed speed max + 3 115200 900 _ 345600 + 3 115200 400 18.1 345600 159825 46 + 2 115200 900 _ 230400 + 2 115200 600 18.1 230400 159825 69 + 2 115200 400 19.3 230400 149888 65 + 4 57600 900 _ 234600 + 4 57600 600 _ 234600 + 4 57600 400 _ 234600 + 3 57600 600 20.9 172800 138413 80 + 3 57600 900 21.2 172800 136455 78 + 3 115200 600 21.7 345600 133311 38 + 3 57600 400 22.5 172800 128571 74 + 4 38400 900 25.2 153600 114795 74 + 4 38400 600 26.4 153600 109577 71 + 4 38400 400 27.3 153600 105965 68 + 2 57600 900 29.1 115200 99410.3 86 + 1 115200 900 30.7 115200 94229.3 81 + 2 57600 600 30.2 115200 95789.4 83 + 3 38400 900 30.3 115200 95473.3 82 + 3 38400 600 31.2 115200 92719.2 80 + 1 115200 600 31.3 115200 92423 80 + 2 57600 400 32.3 115200 89561.6 77 + 1 115200 400 32.8 115200 88196.3 76 + 3 38400 400 33.5 115200 86353.4 74 + 2 38400 900 43.7 76800 66197.7 86 + 2 38400 600 44 76800 65746.4 85 + 2 38400 400 47.2 76800 61289 79 + 4 19200 900 50.8 76800 56945.7 74 + 4 19200 400 53.2 76800 54376.7 70 + 4 19200 600 53.7 76800 53870.4 70 + 1 57600 900 54.6 57600 52982.4 91 + 1 57600 600 56.2 57600 51474 89 + 3 19200 900 60.5 57600 47815.5 83 + 1 57600 400 60.2 57600 48053.8 83 + 3 19200 600 62 57600 46658.7 81 + 3 19200 400 64.7 57600 44711.6 77 + 1 38400 900 79.4 38400 36433.8 94 + 1 38400 600 82.4 38400 35107.3 91 + 2 19200 900 84.4 38400 34275.4 89 + 1 38400 400 86.8 38400 33327.6 86 + 2 19200 600 87.6 38400 33023.3 85 + 2 19200 400 91.2 38400 31719.7 82 + 4 9600 900 94.7 38400 30547.4 79 + 4 9600 400 106 38400 27290.9 71 + 4 9600 600 110 38400 26298.5 68 + 3 9600 900 118 28800 24515.6 85 + 3 9600 600 120 28800 24107 83 + 3 9600 400 131 28800 22082.7 76 + 1 19200 900 155 19200 18663.5 97 + 1 19200 600 161 19200 17968 93 + 1 19200 400 170 19200 17016.7 88 + 2 9600 600 176 19200 16436.6 85 + 2 9600 900 180 19200 16071.3 83 + 2 9600 400 181 19200 15982.5 83 + 1 9600 900 305 9600 9484.72 98 + 1 9600 600 314 9600 9212.87 95 + 1 9600 400 332 9600 8713.37 90 + + + + + + 5.2. Anthony Healy's Report + + + + + + + + Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST) + From: Antony Healey <ahealey@st.nepean.uws.edu.au> + To: Simon Janes <guru@ncm.com> + Subject: Re: Load Balancing + + Hi Simon, + I've installed your patch and it works great. I have trialed + it over twin SL/IP lines, just over null modems, but I was + able to data at over 48Kb/s [ISDN link -Simon]. I managed a + transfer of up to 7.5 Kbyte/s on one go, but averaged around + 6.4 Kbyte/s, which I think is pretty cool. :) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst new file mode 100644 index 000000000..f0c8483cd --- /dev/null +++ b/Documentation/networking/failover.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +FAILOVER +======== + +Overview +======== + +The failover module provides a generic interface for paravirtual drivers +to register a netdev and a set of ops with a failover instance. The ops +are used as event handlers that get called to handle netdev register/ +unregister/link change/name change events on slave pci ethernet devices +with the same mac address as the failover netdev. + +This enables paravirtual drivers to use a VF as an accelerated low latency +datapath. It also allows live migration of VMs with direct attached VFs by +failing over to the paravirtual datapath when the VF is unplugged. diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt new file mode 100644 index 000000000..fe7193885 --- /dev/null +++ b/Documentation/networking/fib_trie.txt @@ -0,0 +1,145 @@ + LC-trie implementation notes. + +Node types +---------- +leaf + An end node with data. This has a copy of the relevant key, along + with 'hlist' with routing table entries sorted by prefix length. + See struct leaf and struct leaf_info. + +trie node or tnode + An internal node, holding an array of child (leaf or tnode) pointers, + indexed through a subset of the key. See Level Compression. + +A few concepts explained +------------------------ +Bits (tnode) + The number of bits in the key segment used for indexing into the + child array - the "child index". See Level Compression. + +Pos (tnode) + The position (in the key) of the key segment used for indexing into + the child array. See Path Compression. + +Path Compression / skipped bits + Any given tnode is linked to from the child array of its parent, using + a segment of the key specified by the parent's "pos" and "bits" + In certain cases, this tnode's own "pos" will not be immediately + adjacent to the parent (pos+bits), but there will be some bits + in the key skipped over because they represent a single path with no + deviations. These "skipped bits" constitute Path Compression. + Note that the search algorithm will simply skip over these bits when + searching, making it necessary to save the keys in the leaves to + verify that they actually do match the key we are searching for. + +Level Compression / child arrays + the trie is kept level balanced moving, under certain conditions, the + children of a full child (see "full_children") up one level, so that + instead of a pure binary tree, each internal node ("tnode") may + contain an arbitrarily large array of links to several children. + Conversely, a tnode with a mostly empty child array (see empty_children) + may be "halved", having some of its children moved downwards one level, + in order to avoid ever-increasing child arrays. + +empty_children + the number of positions in the child array of a given tnode that are + NULL. + +full_children + the number of children of a given tnode that aren't path compressed. + (in other words, they aren't NULL or leaves and their "pos" is equal + to this tnode's "pos"+"bits"). + + (The word "full" here is used more in the sense of "complete" than + as the opposite of "empty", which might be a tad confusing.) + +Comments +--------- + +We have tried to keep the structure of the code as close to fib_hash as +possible to allow verification and help up reviewing. + +fib_find_node() + A good start for understanding this code. This function implements a + straightforward trie lookup. + +fib_insert_node() + Inserts a new leaf node in the trie. This is bit more complicated than + fib_find_node(). Inserting a new node means we might have to run the + level compression algorithm on part of the trie. + +trie_leaf_remove() + Looks up a key, deletes it and runs the level compression algorithm. + +trie_rebalance() + The key function for the dynamic trie after any change in the trie + it is run to optimize and reorganize. It will walk the trie upwards + towards the root from a given tnode, doing a resize() at each step + to implement level compression. + +resize() + Analyzes a tnode and optimizes the child array size by either inflating + or shrinking it repeatedly until it fulfills the criteria for optimal + level compression. This part follows the original paper pretty closely + and there may be some room for experimentation here. + +inflate() + Doubles the size of the child array within a tnode. Used by resize(). + +halve() + Halves the size of the child array within a tnode - the inverse of + inflate(). Used by resize(); + +fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() + The route manipulation functions. Should conform pretty closely to the + corresponding functions in fib_hash. + +fn_trie_flush() + This walks the full trie (using nextleaf()) and searches for empty + leaves which have to be removed. + +fn_trie_dump() + Dumps the routing table ordered by prefix length. This is somewhat + slower than the corresponding fib_hash function, as we have to walk the + entire trie for each prefix length. In comparison, fib_hash is organized + as one "zone"/hash per prefix length. + +Locking +------- + +fib_lock is used for an RW-lock in the same way that this is done in fib_hash. +However, the functions are somewhat separated for other possible locking +scenarios. It might conceivably be possible to run trie_rebalance via RCU +to avoid read_lock in the fn_trie_lookup() function. + +Main lookup mechanism +--------------------- +fn_trie_lookup() is the main lookup function. + +The lookup is in its simplest form just like fib_find_node(). We descend the +trie, key segment by key segment, until we find a leaf. check_leaf() does +the fib_semantic_match in the leaf's sorted prefix hlist. + +If we find a match, we are done. + +If we don't find a match, we enter prefix matching mode. The prefix length, +starting out at the same as the key length, is reduced one step at a time, +and we backtrack upwards through the trie trying to find a longest matching +prefix. The goal is always to reach a leaf and get a positive result from the +fib_semantic_match mechanism. + +Inside each tnode, the search for longest matching prefix consists of searching +through the child array, chopping off (zeroing) the least significant "1" of +the child index until we find a match or the child index consists of nothing but +zeros. + +At this point we backtrack (t->stats.backtrack++) up the trie, continuing to +chop off part of the key in order to find the longest matching prefix. + +At this point we will repeatedly descend subtries to look for a match, and there +are some optimizations available that can provide us with "shortcuts" to avoid +descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. + +To alleviate any doubts about the correctness of the route selection process, +a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which +gives userland access to fib_lookup(). diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt new file mode 100644 index 000000000..e6b4ebb2b --- /dev/null +++ b/Documentation/networking/filter.txt @@ -0,0 +1,1476 @@ +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= + +Introduction +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + + [1] Documentation/userspace-api/seccomp_filter.rst + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include <linux/filter.h> which contains the +following relevant structures: + +struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ +}; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code. + +struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; +}; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +#include <sys/socket.h> +#include <sys/types.h> +#include <arpa/inet.h> +#include <linux/if_ether.h> +/* ... */ + +/* From the example above: tcpdump -i em1 port 22 -dd */ +struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, +}; + +struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, +}; + +sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); +if (sock < 0) + /* ... bail out ... */ + +ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); +if (ret < 0) + /* ... bail out ... */ + +/* ... */ +close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/bpf/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + Element Description + + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned): + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + Instruction Addressing mode Description + + ld 1, 2, 3, 4, 10 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 10 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8 Jump on A == k + jneq 8 Jump on A != k + jne 8 Jump on A != k + jlt 8 Jump on A < k + jle 8 Jump on A <= k + jgt 7, 8 Jump on A > k + jge 7, 8 Jump on A >= k + jset 7, 8 Jump on A & k + + add 0, 4 A + <x> + sub 0, 4 A - <x> + mul 0, 4 A * <x> + div 0, 4 A / <x> + mod 0, 4 A % <x> + neg !A + and 0, 4 A & <x> + or 0, 4 A | <x> + xor 0, 4 A ^ <x> + lsh 0, 4 A << <x> + rsh 0, 4 A >> <x> + + tax Copy A into X + txa Copy X into A + + ret 4, 9 Return + +The next table shows addressing formats from the 2nd column: + + Addressing mode Syntax Description + + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 #k,Lt Jump to Lt if predicate is true + 9 a/%a Accumulator A + 10 extension BPF extension + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + Extension Description + + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci skb_vlan_tag_get(skb) + vlan_avail skb_vlan_tag_present(skb) + vlan_tpid skb->vlan_proto + rand prandom_u32() + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +** ARP packets: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +** IPv4 TCP packets: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +** (Accelerated) VLAN w/ id 10: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +** icmp random packet sampling, 1 in 4 + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +** SECCOMP filter example: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code: + +$ ./bpf_asm foo +4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output: + +$ ./bpf_asm -c foo +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 1, 0x00000806 }, +{ 0x06, 0, 0, 0xffffffff }, +{ 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing: + +# ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +> load pcap foo.pcap + Loads standard tcpdump pcap file. + +> run [<n>] +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +> disassemble +l0: ldh [12] +l1: jeq #0x800, l2, l5 +l2: ldb [23] +l3: jeq #0x1, l4, l5 +l4: ret #0xffff +l5: ret #0 + Prints out BPF code disassembly. + +> dump +/* { op, jt, jf, k }, */ +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 3, 0x00000800 }, +{ 0x30, 0, 0, 0x00000017 }, +{ 0x15, 0, 1, 0x00000001 }, +{ 0x06, 0, 0, 0x0000ffff }, +{ 0x06, 0, 0, 0000000000 }, + Prints out C-style BPF code dump. + +> breakpoint 0 +breakpoint at: l0: ldh [12] +> breakpoint 1 +breakpoint at: l1: jeq #0x800, l2, l5 + ... + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + > run + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + +> breakpoint +breakpoints: 0 1 + Prints currently set breakpoints. + +> step [-<n>, +<n>] + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +> select <n> + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +> quit +# + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, +ARM, ARM64, MIPS and s390 and can be enabled through CONFIG_BPF_JIT. The JIT +compiler is transparently invoked for each attached filter from user space +or for internal kernel users if it has been previously enabled by root: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg: + +[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f +[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 +[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 +[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 +[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 +[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump: + +# ./bpf_jit_disasm +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + +Issuing option `-o` will "annotate" opcodes to resulting assembler +instructions, which can be very useful for JIT developers: + +# ./bpf_jit_disasm -o +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using bpf_prog_create() for setting up the filter, resp. +bpf_prog_destroy() for destroying it. The macro +BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct bpf_prog that we +got from bpf_prog_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from bpf_check_classic() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most 32-bit +architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform +JIT compilation from eBPF instruction set. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as "if (cond) jump_true; + else jump_false;", they are being replaced into alternative constructs like + "if (cond) jump_true; /* else fall-through */". + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into '%rax' which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + An in-kernel eBPF verifier is used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using 'exclusive add' instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can used as generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + Classic BPF classes: eBPF classes: + + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 [ class 6 unused, for future if needed ] + BPF_MISC 0x07 BPF_ALU64 0x07 + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: + + BPF_JA 0x00 + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF only: function call */ + BPF_EXIT 0x90 /* eBPF only: function return */ + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single 'ret' +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently +unused and reserved for future use. + +For load and store instructions the 8-bit 'code' field is divided as: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of: + + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and +(BPF_IND | <size> | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to 'struct sk_buff' and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations: + +BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg +BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 +BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) +BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg +BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + +eBPF verifier +------------- +The safety of the eBPF program is determined in two steps. + +First step does DAG check to disallow loops and other CFG validation. +In particular it will detect programs that have unreachable instructions. +(though classic BPF checker allows them) + +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. + +At the start of the program the register R1 contains a pointer to context +and has type PTR_TO_CTX. +If verifier sees an insn that does R2=R1, then R2 has now type +PTR_TO_CTX as well and can be used on the right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, +since addition of two valid pointers makes invalid pointer. +(In 'secure' mode verifier will reject any type of pointer arithmetic to make +sure that kernel addresses don't leak to unprivileged users) + +If register was never written to, it's not readable: + bpf_mov R0 = R2 + bpf_exit +will be rejected, since R2 is unreadable at the start of the program. + +After kernel function call, R1-R5 are reset to unreadable and +R0 has a return type of the function. + +Since R6-R9 are callee saved, their state is preserved across the call. + bpf_mov R6 = 1 + bpf_call foo + bpf_mov R0 = R6 + bpf_exit +is a correct program. If there was R1 instead of R6, it would have +been rejected. + +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. +For example: + bpf_mov R1 = 1 + bpf_mov R2 = 2 + bpf_xadd *(u32 *)(R1 + 3) += R2 + bpf_exit +will be rejected, since R1 doesn't have a valid pointer type at the time of +execution of instruction bpf_xadd. + +At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context') +A callback is used to customize verifier to restrict eBPF program access to only +certain fields within ctx structure with specified size and alignment. + +For example, the following insn: + bpf_ld R0 = *(u32 *)(R6 + 8) +intends to load a word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know +that offset 8 of size 4 bytes can be accessed for reading, otherwise +the verifier will reject the program. +If R6=PTR_TO_STACK, then access should be aligned and be within +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, +so it will fail verification, since it's out of bounds. + +The verifier will allow eBPF program to read data from stack only after +it wrote into it. +Classic BPF verifier does similar check with M[0-15] memory slots. +For example: + bpf_ld R0 = *(u32 *)(R10 - 4) + bpf_exit +is invalid program. +Though R10 is correct read-only register and has type PTR_TO_STACK +and R10 - 4 is within stack bounds, there were no stores into that location. + +Pointer register spill/fill is tracked as well, since four (R6-R9) +callee saved registers may not be enough for some programs. + +Allowed function calls are customized with bpf_verifier_ops->get_func_proto() +The eBPF verifier will check that registers match argument constraints. +After the call register R0 will be set to return type of the function. + +Function calls is a main mechanism to extend functionality of eBPF programs. +Socket filters may let programs to call one set of functions, whereas tracing +filters may allow completely different set. + +If a function made accessible to eBPF program, it needs to be thought through +from safety point of view. The verifier will guarantee that the function is +called with valid arguments. + +seccomp vs socket filters have different security restrictions for classic BPF. +Seccomp solves this by two stage verifier: classic BPF verifier is followed +by seccomp verifier. In case of eBPF one configurable verifier is shared for +all use cases. + +See details of eBPF verifier in kernel/bpf/verifier.c + +Register value tracking +----------------------- +In order to determine the safety of an eBPF program, the verifier must track +the range of possible values in each register and also in each stack slot. +This is done with 'struct bpf_reg_state', defined in include/linux/ +bpf_verifier.h, which unifies tracking of scalar and pointer values. Each +register state has a type, which is either NOT_INIT (the register has not been +written to), SCALAR_VALUE (some value which is not usable as a pointer), or a +pointer type. The types of pointers describe their base, as follows: + PTR_TO_CTX Pointer to bpf_context. + CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic + on these pointers is forbidden. + PTR_TO_MAP_VALUE Pointer to the value stored in a map element. + PTR_TO_MAP_VALUE_OR_NULL + Either a pointer to a map value, or NULL; map accesses + (see section 'eBPF maps', below) return this type, + which becomes a PTR_TO_MAP_VALUE when checked != NULL. + Arithmetic on these pointers is forbidden. + PTR_TO_STACK Frame pointer. + PTR_TO_PACKET skb->data. + PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. +However, a pointer may be offset from this base (as a result of pointer +arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable +offset'. The former is used when an exactly-known value (e.g. an immediate +operand) is added to a pointer, while the latter is used for values which are +not exactly known. The variable offset is also used in SCALAR_VALUEs, to track +the range of possible values in the register. +The verifier's knowledge about the variable offset consists of: +* minimum and maximum values as unsigned +* minimum and maximum values as signed +* knowledge of the values of individual bits, in the form of a 'tnum': a u64 +'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; +1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both +mask and value; no bit should ever be 1 in both. For example, if a byte is read +into a register from memory, the register's top 56 bits are known zero, while +the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we +then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; +0x1ff), because of potential carries. + +Besides arithmetic, the register state can also be updated by conditional +branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch +it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' +branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or +BPF_JSGE) would instead update the signed minimum/maximum values. Information +from the signed and unsigned bounds can be combined; for instance if a value is +first tested < 8 and then tested s> 4, the verifier will conclude that the value +is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + +PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all +pointers sharing that same variable offset. This is important for packet range +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + +The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of +the pointer returned from a map lookup. This means that when one copy is +checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. +As well as range-checking, the tracked information is also used for enforcing +alignment of pointer accesses. For instance, on most systems the packet pointer +is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump +over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 +bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through +that pointer are safe. + +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex: +1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ +2: r3 = *(u32 *)(r1 +76) /* load skb->data */ +3: r5 = r3 +4: r5 += 14 +5: if r5 > r4 goto pc+16 +R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp +6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like: + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ +10: r3 += r4 +11: r2 = r1 +12: r2 <<= 48 +13: r2 >>= 48 +14: r3 += r2 +15: r2 = r3 +16: r2 += 8 +17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ +18: if r2 > r1 goto pc+2 + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp +19: r1 = *(u8 *)(r3 +4) +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some +offset within a packet and since the program author did +'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add'/'sub' operations on packet registers. Any other +operation will set the register state to 'SCALAR_VALUE' and it won't be +available for direct packet access. +Operation 'r3 += rX' may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So when it sees 'r3 += rX' +instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 +against skb->data_end will not give us 'range' information, so attempts to read +through the pointer will give "invalid access to packet" error. +Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is +R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits +of the register are guaranteed to be zero, and nothing is known about the lower +8 bits. After insn 'r4 *= 14' the state becomes +R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit +value by constant 14 will keep upper 52 bits as zero, also the least significant +bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make +R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign +extending. This logic is implemented in adjust_reg_min_max_vals() function, +which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice +versa) and adjust_scalar_min_max_vals() for operations on two scalars. + +The end result is that bpf program author can access packet directly +using normal C code as: + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + +eBPF maps +--------- +'maps' is a generic storage of different types for sharing data between kernel +and userspace. + +The maps are accessed from user space via BPF syscall, which has commands: +- create a map with given type and attributes + map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) + using attr->map_type, attr->key_size, attr->value_size, attr->max_entries + returns process-local file descriptor or negative error + +- lookup key in a given map + err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) + using attr->map_fd, attr->key, attr->value + returns zero and stores found elem into value or negative error + +- create or update key/value pair in a given map + err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) + using attr->map_fd, attr->key, attr->value + returns zero or negative error + +- find and delete element by key in a given map + err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) + using attr->map_fd, attr->key + +- to delete map: close(fd) + Exiting process will delete maps automatically + +userspace programs use this syscall to create/access maps that eBPF programs +are concurrently updating. + +maps can have different types: hash, array, bloom filter, radix-tree, etc. + +The map is defined by: + . type + . max number of elements + . key size in bytes + . value size in bytes + +Pruning +------- +The verifier does not actually walk all possible paths through the program. For +each new branch to analyse, the verifier looks at all the states it's previously +been in when at this instruction. If any of them contain the current state as a +subset, the branch is 'pruned' - that is, the fact that the previous state was +accepted implies the current state would be as well. For instance, if in the +previous state, r1 held a packet-pointer, and in the current state, r1 holds a +packet-pointer with a range as long or longer and at least as strict an +alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't +have been used by any path from that point, so any value in r2 (including +another NOT_INIT) is safe. The implementation is in the function regsafe(). +Pruning considers not only the registers but also the stack (and any spilled +registers it may hold). They must all be safe for the branch to be pruned. +This is implemented in states_equal(). + +Understanding eBPF verifier messages +------------------------------------ + +The following are few examples of invalid eBPF programs and verifier error +messages as seen in the log: + +Program with unreachable instructions: +static struct bpf_insn prog[] = { + BPF_EXIT_INSN(), + BPF_EXIT_INSN(), +}; +Error: + unreachable insn 1 + +Program that reads uninitialized register: + BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), + BPF_EXIT_INSN(), +Error: + 0: (bf) r0 = r2 + R2 !read_ok + +Program that doesn't initialize R0 before exiting: + BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), + BPF_EXIT_INSN(), +Error: + 0: (bf) r2 = r1 + 1: (95) exit + R0 !read_ok + +Program that accesses stack out of bounds: + BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), + BPF_EXIT_INSN(), +Error: + 0: (7a) *(u64 *)(r10 +8) = 0 + invalid stack off=8 size=8 + +Program that doesn't initialize stack before passing its address into function: + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), +Error: + 0: (bf) r2 = r10 + 1: (07) r2 += -8 + 2: (b7) r1 = 0x0 + 3: (85) call 1 + invalid indirect read from stack off -8+0 size 8 + +Program that uses invalid map_fd=0 while calling to map_lookup_elem() function: + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_EXIT_INSN(), +Error: + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + fd 0 is not pointing to valid bpf_map + +Program that doesn't check return value of map_lookup_elem() before accessing +map element: + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), +Error: + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 0x0 + 4: (85) call 1 + 5: (7a) *(u64 *)(r0 +0) = 0 + R0 invalid mem access 'map_value_or_null' + +Program that correctly checks map_lookup_elem() returned value for NULL, but +accesses the memory with incorrect alignment: + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), + BPF_EXIT_INSN(), +Error: + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+1 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +4) = 0 + misaligned access off 4 size 8 + +Program that correctly checks map_lookup_elem() returned value for NULL and +accesses memory with correct alignment in one side of 'if' branch, but fails +to do so in the other side of 'if' branch: + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), + BPF_EXIT_INSN(), + BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), +Error: + 0: (7a) *(u64 *)(r10 -8) = 0 + 1: (bf) r2 = r10 + 2: (07) r2 += -8 + 3: (b7) r1 = 1 + 4: (85) call 1 + 5: (15) if r0 == 0x0 goto pc+2 + R0=map_ptr R10=fp + 6: (7a) *(u64 *)(r0 +0) = 0 + 7: (95) exit + + from 5 to 8: R0=imm0 R10=fp + 8: (7a) *(u64 *)(r0 +0) = 1 + R0 invalid mem access 'imm' + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +Jay Schulist <jschlst@samba.org> +Daniel Borkmann <daniel@iogearbox.net> +Alexei Starovoitov <ast@kernel.org> diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt new file mode 100644 index 000000000..1f98f62b4 --- /dev/null +++ b/Documentation/networking/fore200e.txt @@ -0,0 +1,64 @@ + +FORE Systems PCA-200E/SBA-200E ATM NIC driver +--------------------------------------------- + +This driver adds support for the FORE Systems 200E-series ATM adapters +to the Linux operating system. It is based on the earlier PCA-200E driver +written by Uwe Dannowski. + +The driver simultaneously supports PCA-200E and SBA-200E adapters on +i386, alpha (untested), powerpc, sparc and sparc64 archs. + +The intent is to enable the use of different models of FORE adapters at the +same time, by hosts that have several bus interfaces (such as PCI+SBUS, +or PCI+EISA). + +Only PCI and SBUS devices are currently supported by the driver, but support +for other bus interfaces such as EISA should not be too hard to add. + + +Firmware Copyright Notice +------------------------- + +Please read the fore200e_firmware_copyright file present +in the linux/drivers/atm directory for details and restrictions. + + +Firmware Updates +---------------- + +The FORE Systems 200E-series driver is shipped with firmware data being +uploaded to the ATM adapters at system boot time or at module loading time. +The supplied firmware images should work with all adapters. + +However, if you encounter problems (the firmware doesn't start or the driver +is unable to read the PROM data), you may consider trying another firmware +version. Alternative binary firmware images can be found somewhere on the +ForeThought CD-ROM supplied with your adapter by FORE Systems. + +You can also get the latest firmware images from FORE Systems at +https://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to +the 'software updates' pages. The firmware binaries are part of +the various ForeThought software distributions. + +Notice that different versions of the PCA-200E firmware exist, depending +on the endianness of the host architecture. The driver is shipped with +both little and big endian PCA firmware images. + +Name and location of the new firmware images can be set at kernel +configuration time: + +1. Copy the new firmware binary files (with .bin, .bin1 or .bin2 suffix) + to some directory, such as linux/drivers/atm. + +2. Reconfigure your kernel to set the new firmware name and location. + Expected pathnames are absolute or relative to the drivers/atm directory. + +3. Rebuild and re-install your kernel or your module. + + +Feedback +-------- + +Feedback is welcome. Please send success stories/bug reports/ +patches/improvement/comments/flames to <lizzi@cnam.fr>. diff --git a/Documentation/networking/framerelay.txt b/Documentation/networking/framerelay.txt new file mode 100644 index 000000000..1a0b72044 --- /dev/null +++ b/Documentation/networking/framerelay.txt @@ -0,0 +1,39 @@ +Frame Relay (FR) support for linux is built into a two tiered system of device +drivers. The upper layer implements RFC1490 FR specification, and uses the +Data Link Connection Identifier (DLCI) as its hardware address. Usually these +are assigned by your network supplier, they give you the number/numbers of +the Virtual Connections (VC) assigned to you. + +Each DLCI is a point-to-point link between your machine and a remote one. +As such, a separate device is needed to accommodate the routing. Within the +net-tools archives is 'dlcicfg'. This program will communicate with the +base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'... +The configuration script will ask you how many DLCIs you need, as well as +how many DLCIs you want to assign to each Frame Relay Access Device (FRAD). + +The DLCI uses a number of function calls to communicate with the FRAD, all +of which are stored in the FRAD's private data area. assoc/deassoc, +activate/deactivate and dlci_config. The DLCI supplies a receive function +to the FRAD to accept incoming packets. + +With this initial offering, only 1 FRAD driver is available. With many thanks +to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E & +S508 are supported. This driver is currently set up for only FR, but as +Sangoma makes more firmware modules available, it can be updated to provide +them as well. + +Configuration of the FRAD makes use of another net-tools program, 'fradcfg'. +This program makes use of a configuration file (which dlcicfg can also read) +to specify the types of boards to be configured as FRADs, as well as perform +any board specific configuration. The Sangoma module of fradcfg loads the +FR firmware into the card, sets the irq/port/memory information, and provides +an initial configuration. + +Additional FRAD device drivers can be added as hardware is available. + +At this time, the dlcicfg and fradcfg programs have not been incorporated into +the net-tools distribution. They can be found at ftp.invlogic.com, in +/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just +use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for +pre-2.0.4 and later. + diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt new file mode 100644 index 000000000..179b18ce4 --- /dev/null +++ b/Documentation/networking/gen_stats.txt @@ -0,0 +1,119 @@ +Generic networking statistics for netlink users +====================================================================== + +Statistic counters are grouped into structs: + +Struct TLV type Description +---------------------------------------------------------------------- +gnet_stats_basic TCA_STATS_BASIC Basic statistics +gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator +gnet_stats_queue TCA_STATS_QUEUE Queue statistics +none TCA_STATS_APP Application specific + + +Collecting: +----------- + +Declare the statistic structs you need: +struct mystruct { + struct gnet_stats_basic bstats; + struct gnet_stats_queue qstats; + ... +}; + +Update statistics, in dequeue() methods only, (while owning qdisc->running) +mystruct->tstats.packet++; +mystruct->qstats.backlog += skb->pkt_len; + + +Export to userspace (Dump): +--------------------------- + +my_dumping_routine(struct sk_buff *skb, ...) +{ + struct gnet_dump dump; + + if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + + if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || + gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 || + gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0) + goto rtattr_failure; + + if (gnet_stats_finish_copy(&dump) < 0) + goto rtattr_failure; + ... +} + +TCA_STATS/TCA_XSTATS backward compatibility: +-------------------------------------------- + +Prior users of struct tc_stats and xstats can maintain backward +compatibility by calling the compat wrappers to keep providing the +existing TLV types. + +my_dumping_routine(struct sk_buff *skb, ...) +{ + if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, + TCA_XSTATS, &mystruct->lock, &dump, + TCA_PAD) < 0) + goto rtattr_failure; + ... +} + +A struct tc_stats will be filled out during gnet_stats_copy_* calls +and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app +was called. + + +Locking: +-------- + +Locks are taken before writing and released once all statistics have +been written. Locks are always released in case of an error. You +are responsible for making sure that the lock is initialized. + + +Rate Estimator: +-------------- + +0) Prepare an estimator attribute. Most likely this would be in user + space. The value of this TLV should contain a tc_estimator structure. + As usual, such a TLV needs to be 32 bit aligned and therefore the + length needs to be appropriately set, etc. The estimator interval + and ewma log need to be converted to the appropriate values. + tc_estimator.c::tc_setup_estimator() is advisable to be used as the + conversion routine. It does a few clever things. It takes a time + interval in microsecs, a time constant also in microsecs and a struct + tc_estimator to be populated. The returned tc_estimator can be + transported to the kernel. Transfer such a structure in a TLV of type + TCA_RATE to your code in the kernel. + +In the kernel when setting up: +1) make sure you have basic stats and rate stats setup first. +2) make sure you have initialized stats lock that is used to setup such + stats. +3) Now initialize a new estimator: + + int ret = gen_new_estimator(my_basicstats,my_rate_est_stats, + mystats_lock, attr_with_tcestimator_struct); + + if ret == 0 + success + else + failed + +From now on, every time you dump my_rate_est_stats it will contain +up-to-date info. + +Once you are done, call gen_kill_estimator(my_basicstats, +my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats +are still valid (i.e still exist) at the time of making this call. + + +Authors: +-------- +Thomas Graf <tgraf@suug.ch> +Jamal Hadi Salim <hadi@cyberus.ca> diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.txt new file mode 100644 index 000000000..4eb3cc40b --- /dev/null +++ b/Documentation/networking/generic-hdlc.txt @@ -0,0 +1,132 @@ +Generic HDLC layer +Krzysztof Halasa <khc@pm.waw.pl> + + +Generic HDLC layer currently supports: +1. Frame Relay (ANSI, CCITT, Cisco and no LMI) + - Normal (routed) and Ethernet-bridged (Ethernet device emulation) + interfaces can share a single PVC. + - ARP support (no InARP support in the kernel - there is an + experimental InARP user-space daemon available on: + http://www.kernel.org/pub/linux/utils/net/hdlc/). +2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation +3. Cisco HDLC +4. PPP +5. X.25 (uses X.25 routines). + +Generic HDLC is a protocol driver only - it needs a low-level driver +for your particular hardware. + +Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible +with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging). + + +Make sure the hdlc.o and the hardware driver are loaded. It should +create a number of "hdlc" (hdlc0 etc) network devices, one for each +WAN port. You'll need the "sethdlc" utility, get it from: + http://www.kernel.org/pub/linux/utils/net/hdlc/ + +Compile sethdlc.c utility: + gcc -O2 -Wall -o sethdlc sethdlc.c +Make sure you're using a correct version of sethdlc for your kernel. + +Use sethdlc to set physical interface, clock rate, HDLC mode used, +and add any required PVCs if using Frame Relay. +Usually you want something like: + + sethdlc hdlc0 clock int rate 128000 + sethdlc hdlc0 cisco interval 10 timeout 25 +or + sethdlc hdlc0 rs232 clock ext + sethdlc hdlc0 fr lmi ansi + sethdlc hdlc0 create 99 + ifconfig hdlc0 up + ifconfig pvc0 localIP pointopoint remoteIP + +In Frame Relay mode, ifconfig master hdlc device up (without assigning +any IP address to it) before using pvc devices. + + +Setting interface: + +* v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port + if the card has software-selectable interfaces + loopback - activate hardware loopback (for testing only) +* clock ext - both RX clock and TX clock external +* clock int - both RX clock and TX clock internal +* clock txint - RX clock external, TX clock internal +* clock txfromrx - RX clock external, TX clock derived from RX clock +* rate - sets clock rate in bps (for "int" or "txint" clock only) + + +Setting protocol: + +* hdlc - sets raw HDLC (IP-only) mode + nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code + no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu + crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity + +* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding + as above. + +* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported) + interval - time in seconds between keepalive packets + timeout - time in seconds after last received keepalive packet before + we assume the link is down + +* ppp - sets synchronous PPP mode + +* x25 - sets X.25 mode + +* fr - Frame Relay mode + lmi ansi / ccitt / cisco / none - LMI (link management) type + dce - Frame Relay DCE (network) side LMI instead of default DTE (user). + It has nothing to do with clocks! + t391 - link integrity verification polling timer (in seconds) - user + t392 - polling verification timer (in seconds) - network + n391 - full status polling counter - user + n392 - error threshold - both user and network + n393 - monitored events count - both user and network + +Frame-Relay only: +* create n | delete n - adds / deletes PVC interface with DLCI #n. + Newly created interface will be named pvc0, pvc1 etc. + +* create ether n | delete ether n - adds a device for Ethernet-bridged + frames. The device will be named pvceth0, pvceth1 etc. + + + + +Board-specific issues +--------------------- + +n2.o and c101.o need parameters to work: + + insmod n2 hw=io,irq,ram,ports[:io,irq,...] +example: + insmod n2 hw=0x300,10,0xD0000,01 + +or + insmod c101 hw=irq,ram[:irq,...] +example: + insmod c101 hw=9,0xdc000 + +If built into the kernel, these drivers need kernel (command line) parameters: + n2.hw=io,irq,ram,ports:... +or + c101.hw=irq,ram:... + + + +If you have a problem with N2, C101 or PLX200SYN card, you can issue the +"private" command to see port's packet descriptor rings (in kernel logs): + + sethdlc hdlc0 private + +The hardware driver has to be build with #define DEBUG_RINGS. +Attaching this info to bug reports would be helpful. Anyway, let me know +if you have problems using this. + +For patches and other info look at: +<http://www.kernel.org/pub/linux/utils/net/hdlc/>. diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt new file mode 100644 index 000000000..3e071115c --- /dev/null +++ b/Documentation/networking/generic_netlink.txt @@ -0,0 +1,3 @@ +A wiki document on how to use Generic Netlink can be found here: + + * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt new file mode 100644 index 000000000..ba1daea7f --- /dev/null +++ b/Documentation/networking/gianfar.txt @@ -0,0 +1,42 @@ +The Gianfar Ethernet Driver + +Author: Andy Fleming <afleming@freescale.com> +Updated: 2005-07-28 + + +CHECKSUM OFFLOADING + +The eTSEC controller (first included in parts from late 2005 like +the 8548) has the ability to perform TCP, UDP, and IP checksums +in hardware. The Linux kernel only offloads the TCP and UDP +checksums (and always performs the pseudo header checksums), so +the driver only supports checksumming for TCP/IP and UDP/IP +packets. Use ethtool to enable or disable this feature for RX +and TX. + +VLAN + +In order to use VLAN, please consult Linux documentation on +configuring VLANs. The gianfar driver supports hardware insertion and +extraction of VLAN headers, but not filtering. Filtering will be +done by the kernel. + +MULTICASTING + +The gianfar driver supports using the group hash table on the +TSEC (and the extended hash table on the eTSEC) for multicast +filtering. On the eTSEC, the exact-match MAC registers are used +before the hash tables. See Linux documentation on how to join +multicast groups. + +PADDING + +The gianfar driver supports padding received frames with 2 bytes +to align the IP header to a 16-byte boundary, when supported by +hardware. + +ETHTOOL + +The gianfar driver supports the use of ethtool for many +configuration options. You must run ethtool only on currently +open interfaces. See ethtool documentation for details. diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt new file mode 100644 index 000000000..6966bbec1 --- /dev/null +++ b/Documentation/networking/gtp.txt @@ -0,0 +1,230 @@ +The Linux kernel GTP tunneling module +====================================================================== +Documentation by Harald Welte <laforge@gnumonks.org> and + Andreas Schultz <aschultz@tpip.net> + +In 'drivers/net/gtp.c' you are finding a kernel-level implementation +of a GTP tunnel endpoint. + +== What is GTP == + +GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for +tunneling User-IP payload between a mobile station (phone, modem) +and the interconnection between an external packet data network (such +as the internet). + +So when you start a 'data connection' from your mobile phone, the +phone will use the control plane to signal for the establishment of +such a tunnel between that external data network and the phone. The +tunnel endpoints thus reside on the phone and in the gateway. All +intermediate nodes just transport the encapsulated packet. + +The phone itself does not implement GTP but uses some other +technology-dependent protocol stack for transmitting the user IP +payload, such as LLC/SNDCP/RLC/MAC. + +At some network element inside the cellular operator infrastructure +(SGSN in case of GPRS/EGPRS or classic UMTS, hNodeB in case of a 3G +femtocell, eNodeB in case of 4G/LTE), the cellular protocol stacking +is translated into GTP *without breaking the end-to-end tunnel*. So +intermediate nodes just perform some specific relay function. + +At some point the GTP packet ends up on the so-called GGSN (GSM/UMTS) +or P-GW (LTE), which terminates the tunnel, decapsulates the packet +and forwards it onto an external packet data network. This can be +public internet, but can also be any private IP network (or even +theoretically some non-IP network like X.25). + +You can find the protocol specification in 3GPP TS 29.060, available +publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm + +A direct PDF link to v13.6.0 is provided for convenience below: +http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf + +== The Linux GTP tunnelling module == + +The module implements the function of a tunnel endpoint, i.e. it is +able to decapsulate tunneled IP packets in the uplink originated by +the phone, and encapsulate raw IP packets received from the external +packet network in downlink towards the phone. + +It *only* implements the so-called 'user plane', carrying the User-IP +payload, called GTP-U. It does not implement the 'control plane', +which is a signaling protocol used for establishment and teardown of +GTP tunnels (GTP-C). + +So in order to have a working GGSN/P-GW setup, you will need a +userspace program that implements the GTP-C protocol and which then +uses the netlink interface provided by the GTP-U module in the kernel +to configure the kernel module. + +This split architecture follows the tunneling modules of other +protocols, e.g. PPPoE or L2TP, where you also run a userspace daemon +to handle the tunnel establishment, authentication etc. and only the +data plane is accelerated inside the kernel. + +Don't be confused by terminology: The GTP User Plane goes through +kernel accelerated path, while the GTP Control Plane goes to +Userspace :) + +The official homepage of the module is at +https://osmocom.org/projects/linux-kernel-gtp-u/wiki + +== Userspace Programs with Linux Kernel GTP-U support == + +At the time of this writing, there are at least two Free Software +implementations that implement GTP-C and can use the netlink interface +to make use of the Linux kernel GTP-U support: + +* OpenGGSN (classic 2G/3G GGSN in C): + https://osmocom.org/projects/openggsn/wiki/OpenGGSN + +* ergw (GGSN + P-GW in Erlang): + https://github.com/travelping/ergw + +== Userspace Library / Command Line Utilities == + +There is a userspace library called 'libgtpnl' which is based on +libmnl and which implements a C-language API towards the netlink +interface provided by the Kernel GTP module: + +http://git.osmocom.org/libgtpnl/ + +== Protocol Versions == + +There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1 +[3GPP TS 29.281]. Both are implemented in the Kernel GTP module. +Version 0 is a legacy version, and deprecated from recent 3GPP +specifications. + +GTP-U uses UDP for transporting PDUs. The receiving UDP port is 2151 +for GTPv1-U and 3386 for GTPv0-U. + +There are three versions of GTP-C: v0, v1, and v2. As the kernel +doesn't implement GTP-C, we don't have to worry about this. It's the +responsibility of the control plane implementation in userspace to +implement that. + +== IPv6 == + +The 3GPP specifications indicate either IPv4 or IPv6 can be used both +on the inner (user) IP layer, or on the outer (transport) layer. + +Unfortunately, the Kernel module currently supports IPv6 neither for +the User IP payload, nor for the outer IP layer. Patches or other +Contributions to fix this are most welcome! + +== Mailing List == + +If yo have questions regarding how to use the Kernel GTP module from +your own software, or want to contribute to the code, please use the +osmocom-net-grps mailing list for related discussion. The list can be +reached at osmocom-net-gprs@lists.osmocom.org and the mailman +interface for managing your subscription is at +https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs + +== Issue Tracker == + +The Osmocom project maintains an issue tracker for the Kernel GTP-U +module at +https://osmocom.org/projects/linux-kernel-gtp-u/issues + +== History / Acknowledgements == + +The Module was originally created in 2012 by Harald Welte, but never +completed. Pablo came in to finish the mess Harald left behind. But +doe to a lack of user interest, it never got merged. + +In 2015, Andreas Schultz came to the rescue and fixed lots more bugs, +extended it with new features and finally pushed all of us to get it +mainline, where it was merged in 4.7.0. + +== Architectural Details == + +=== Local GTP-U entity and tunnel identification === + +GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152 +for GTPv1-U and 3386 for GTPv0-U. + +There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW +instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique +per GTP-U entity. + +A specific tunnel is only defined by the destination entity. Since the +destination port is constant, only the destination IP and TEID define +a tunnel. The source IP and Port have no meaning for the tunnel. + +Therefore: + + * when sending, the remote entity is defined by the remote IP and + the tunnel endpoint id. The source IP and port have no meaning and + can be changed at any time. + + * when receiving the local entity is defined by the local + destination IP and the tunnel endpoint id. The source IP and port + have no meaning and can change at any time. + +[3GPP TS 29.281] Section 4.3.0 defines this so: + +> The TEID in the GTP-U header is used to de-multiplex traffic +> incoming from remote tunnel endpoints so that it is delivered to the +> User plane entities in a way that allows multiplexing of different +> users, different packet protocols and different QoS levels. +> Therefore no two remote GTP-U endpoints shall send traffic to a +> GTP-U protocol entity using the same TEID value except +> for data forwarding as part of mobility procedures. + +The definition above only defines that two remote GTP-U endpoints +*should not* send to the same TEID, it *does not* forbid or exclude +such a scenario. In fact, the mentioned mobility procedures make it +necessary that the GTP-U entity accepts traffic for TEIDs from +multiple or unknown peers. + +Therefore, the receiving side identifies tunnels exclusively based on +TEIDs, not based on the source IP! + +== APN vs. Network Device == + +The GTP-U driver creates a Linux network device for each Gi/SGi +interface. + +[3GPP TS 29.281] calls the Gi/SGi reference point an interface. This +may lead to the impression that the GGSN/P-GW can have only one such +interface. + +Correct is that the Gi/SGi reference point defines the interworking +between +the 3GPP packet domain (PDN) based on GTP-U tunnel and IP +based networks. + +There is no provision in any of the 3GPP documents that limits the +number of Gi/SGi interfaces implemented by a GGSN/P-GW. + +[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a +specific Gi/SGi interfaces is made through the Access Point Name +(APN): + +> 2. each private network manages its own addressing. In general this +> will result in different private networks having overlapping +> address ranges. A logically separate connection (e.g. an IP in IP +> tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW +> and each private network. +> +> In this case the IP address alone is not necessarily unique. The +> pair of values, Access Point Name (APN) and IPv4 address and/or +> IPv6 prefixes, is unique. + +In order to support the overlapping address range use case, each APN +is mapped to a separate Gi/SGi interface (network device). + +NOTE: The Access Point Name is purely a control plane (GTP-C) concept. +At the GTP-U level, only Tunnel Endpoint Identifiers are present in +GTP-U packets and network devices are known + +Therefore for a given UE the mapping in IP to PDN network is: + * network device + MS IP -> Peer IP + Peer TEID, + +and from PDN to IP network: + * local GTP-U IP + TEID -> network device + +Furthermore, before a received T-PDU is injected into the network +device the MS IP is checked against the IP recorded in PDP context. diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt new file mode 100644 index 000000000..989366a40 --- /dev/null +++ b/Documentation/networking/hinic.txt @@ -0,0 +1,125 @@ +Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family +============================================================ + +Overview: +========= +HiNIC is a network interface card for the Data Center Area. + +The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). +The driver supports also a negotiated and extendable feature set. + +Some HiNIC devices support SR-IOV. This driver is used for Physical Function +(PF). + +HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and +adaptive interrupt moderation. + +HiNIC devices support also various offload features such as checksum offload, +TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and +LRO(Large Receive Offload). + + +Supported PCI vendor ID/device IDs: +=================================== + +19e5:1822 - HiNIC PF + + +Driver Architecture and Source Code: +==================================== + +hinic_dev - Implement a Logical Network device that is independent from +specific HW details about HW data structure formats. + +hinic_hwdev - Implement the HW details of the device and include the components +for accessing the PCI NIC. + +hinic_hwdev contains the following components: +=============================================== + +HW Interface: +============= + +The interface for accessing the pci device (DMA memory and PCI BARs). +(hinic_hw_if.c, hinic_hw_if.h) + +Configuration Status Registers Area that describes the HW Registers on the +configuration and status BAR0. (hinic_hw_csr.h) + +MGMT components: +================ + +Asynchronous Event Queues(AEQs) - The event queues for receiving messages from +the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Application Programmable Interface commands(API CMD) - Interface for sending +MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) + +Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT +commands to the card and receives notifications from the MGMT modules on the +card by AEQs. Also set the addresses of the IO CMDQs in HW. +(hinic_hw_mgmt.c, hinic_hw_mgmt.h) + +IO components: +============== + +Completion Event Queues(CEQs) - The completion Event Queues that describe IO +tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Work Queues(WQ) - Contain the memory and operations for use by CMD queues and +the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains +pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). +(hinic_hw_wq.c, hinic_hw_wq.h) + +Command Queues(CMDQ) - The queues for sending commands for IO management and is +used to set the QPs addresses in HW. The commands completion events are +accumulated on the CEQ that is configured to receive the CMDQ completion events. +(hinic_hw_cmdq.c, hinic_hw_cmdq.h) + +Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting +Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) + +IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) + +HW device: +========== + +HW device - de/constructs the HW Interface, the MGMT components on the +initialization of the driver and the IO components on the case of Interface +UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) + + +hinic_dev contains the following components: +=============================================== + +PCI ID table - Contains the supported PCI Vendor/Device IDs. +(hinic_pci_tbl.h) + +Port Commands - Send commands to the HW device for port management +(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) + +Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. +The Logical Tx queue is not dependent on the format of the HW Send Queue. +(hinic_tx.c, hinic_tx.h) + +Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. +The Logical Rx queue is not dependent on the format of the HW Receive Queue. +(hinic_rx.c, hinic_rx.h) + +hinic_dev - de/constructs the Logical Tx and Rx Queues. +(hinic_main.c, hinic_dev.h) + + +Miscellaneous: +============= + +Common functions that are used by HW and Logical Device. +(hinic_common.c, hinic_common.h) + + +Support +======= + +If an issue is identified with the released source code on the supported kernel +with a supported adapter, email the specific information related to the issue to +aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt new file mode 100644 index 000000000..c2d6e1824 --- /dev/null +++ b/Documentation/networking/i40e.txt @@ -0,0 +1,190 @@ +Linux Base Driver for the Intel(R) Ethernet Controller XL710 Family +=================================================================== + +Intel i40e Linux driver. +Copyright(c) 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Performance Tuning +- Known Issues +- Support + + +Identifying Your Adapter +======================== + +The driver in this release is compatible with the Intel Ethernet +Controller XL710 Family. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + + +Enabling the driver +=================== + +The driver is enabled via the standard kernel configuration system, +using the make command: + + make config/oldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Intel devices + -> Intel(R) Ethernet Controller XL710 Family + +Additional Configurations +========================= + + Generic Receive Offload (GRO) + ----------------------------- + The driver supports the in-kernel software implementation of GRO. GRO has + shown that by coalescing Rx traffic into larger chunks of data, CPU + utilization can be significantly reduced when under large Rx load. GRO is + an evolution of the previously-used LRO interface. GRO is able to coalesce + other protocols besides TCP. It's also safe to use with configurations that + are problematic for LRO, namely bridging and iSCSI. + + Ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool + + + Flow Director n-ntuple traffic filters (FDir) + --------------------------------------------- + The driver utilizes the ethtool interface for configuring ntuple filters, + via "ethtool -N <device> <filter>". + + The sctp4, ip4, udp4, and tcp4 flow types are supported with the standard + fields including src-ip, dst-ip, src-port and dst-port. The driver only + supports fully enabling or fully masking the fields, so use of the mask + fields for partial matches is not supported. + + Additionally, the driver supports using the action to specify filters for a + Virtual Function. You can specify the action as a 64bit value, where the + lower 32 bits represents the queue number, while the next 8 bits represent + which VF. Note that 0 is the PF, so the VF identifier is offset by 1. For + example: + + ... action 0x800000002 ... + + Would indicate to direct traffic for Virtual Function 7 (8 minus 1) on queue + 2 of that VF. + + The driver also supports using the user-defined field to specify 2 bytes of + arbitrary data to match within the packet payload in addition to the regular + fields. The data is specified in the lower 32bits of the user-def field in + the following way: + + +----------------------------+---------------------------+ + | 31 28 24 20 16 | 15 12 8 4 0| + +----------------------------+---------------------------+ + | offset into packet payload | 2 bytes of flexible data | + +----------------------------+---------------------------+ + + As an example, + + ... user-def 0x4FFFF .... + + means to match the value 0xFFFF 4 bytes into the packet payload. Note that + the offset is based on the beginning of the payload, and not the beginning + of the packet. Thus + + flow-type tcp4 ... user-def 0x8BEAF .... + + would match TCP/IPv4 packets which have the value 0xBEAF 8bytes into the + TCP/IPv4 payload. + + For ICMP, the hardware parses the ICMP header as 4 bytes of header and 4 + bytes of payload, so if you want to match an ICMP frames payload you may need + to add 4 to the offset in order to match the data. + + Furthermore, the offset can only be up to a value of 64, as the hardware + will only read up to 64 bytes of data from the payload. It must also be even + as the flexible data is 2 bytes long and must be aligned to byte 0 of the + packet payload. + + When programming filters, the hardware is limited to using a single input + set for each flow type. This means that it is an error to program two + different filters with the same type that don't match on the same fields. + Thus the second of the following two commands will fail: + + ethtool -N <device> flow-type tcp4 src-ip 192.168.0.7 action 5 + ethtool -N <device> flow-type tcp4 dst-ip 192.168.15.18 action 1 + + This is because the first filter will be accepted and reprogram the input + set for TCPv4 filters, but the second filter will be unable to reprogram the + input set until all the conflicting TCPv4 filters are first removed. + + Note that the user-defined flexible offset is also considered part of the + input set and cannot be programmed separately for multiple filters of the + same type. However, the flexible data is not part of the input set and + multiple filters may use the same offset but match against different data. + + Data Center Bridging (DCB) + -------------------------- + DCB configuration is not currently supported. + + FCoE + ---- + The driver supports Fiber Channel over Ethernet (FCoE) and Data Center + Bridging (DCB) functionality. Configuring DCB and FCoE is outside the scope + of this driver doc. Refer to http://www.open-fcoe.org/ for FCoE project + information and http://www.open-lldp.org/ or email list + e1000-eedc@lists.sourceforge.net for DCB information. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://e1000.sourceforge.net + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sourceforge.net and copy +netdev@vger.kernel.org. diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt new file mode 100644 index 000000000..e9b3035b9 --- /dev/null +++ b/Documentation/networking/i40evf.txt @@ -0,0 +1,54 @@ +Linux* Base Driver for Intel(R) Network Connection +================================================== + +Intel Ethernet Adaptive Virtual Function Linux driver. +Copyright(c) 2013-2017 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Known Issues/Troubleshooting +- Support + +This file describes the i40evf Linux* Base Driver. + +The i40evf driver supports the below mentioned virtual function +devices and can only be activated on kernels running the i40e or +newer Physical Function (PF) driver compiled with CONFIG_PCI_IOV. +The i40evf driver requires CONFIG_PCI_MSI to be enabled. + +The guest OS loading the i40evf driver must support MSI-X interrupts. + +Supported Hardware +================== +Intel XL710 X710 Virtual Function +Intel Ethernet Adaptive Virtual Function +Intel X722 Virtual Function + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the +Adapter & Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Known Issues/Troubleshooting +============================ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ice.txt b/Documentation/networking/ice.txt new file mode 100644 index 000000000..6261c4637 --- /dev/null +++ b/Documentation/networking/ice.txt @@ -0,0 +1,39 @@ +Intel(R) Ethernet Connection E800 Series Linux Driver +=================================================================== + +Intel ice Linux driver. +Copyright(c) 2018 Intel Corporation. + +Contents +======== +- Enabling the driver +- Support + +The driver in this release supports Intel's E800 Series of products. For +more information, visit Intel's support page at http://support.intel.com. + +Enabling the driver +=================== + +The driver is enabled via the standard kernel configuration system, +using the make command: + + Make oldconfig/silentoldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Intel devices + -> Intel(R) Ethernet Connection E800 Series Support + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +If an issue is identified with the released source code, please email +the maintainer listed in the MAINTAINERS file. diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt new file mode 100644 index 000000000..e74d8e1da --- /dev/null +++ b/Documentation/networking/ieee802154.txt @@ -0,0 +1,177 @@ + + Linux IEEE 802.15.4 implementation + + +Introduction +============ +The IEEE 802.15.4 working group focuses on standardization of the bottom +two layers: Medium Access Control (MAC) and Physical access (PHY). And there +are mainly two options available for upper layers: + - ZigBee - proprietary protocol from the ZigBee Alliance + - 6LoWPAN - IPv6 networking over low rate personal area networks + +The goal of the Linux-wpan is to provide a complete implementation +of the IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack +of protocols for organizing Low-Rate Wireless Personal Area Networks. + +The stack is composed of three main parts: + - IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API, + the generic Linux networking stack to transfer IEEE 802.15.4 data + messages and a special protocol over netlink for configuration/management + - MAC - provides access to shared channel and reliable data delivery + - PHY - represents device drivers + + +Socket API +========== + +int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0); +..... + +The address family, socket addresses etc. are defined in the +include/net/af_ieee802154.h header or in the special header +in the userspace package (see either http://wpan.cakelab.org/ or the +git tree at https://github.com/linux-wpan/wpan-tools). + + +Kernel side +============= + +Like with WiFi, there are several types of devices implementing IEEE 802.15.4. +1) 'HardMAC'. The MAC layer is implemented in the device itself, the device + exports a management (e.g. MLME) and data API. +2) 'SoftMAC' or just radio. These types of devices are just radio transceivers + possibly with some kinds of acceleration like automatic CRC computation and + comparation, automagic ACK handling, address matching, etc. + +Those types of devices require different approach to be hooked into Linux kernel. + + +HardMAC +======= + +See the header include/net/ieee802154_netdev.h. You have to implement Linux +net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family +code via plain sk_buffs. On skb reception skb->cb must contain additional +info as described in the struct ieee802154_mac_cb. During packet transmission +the skb->cb is used to provide additional data to device's header_ops->create +function. Be aware that this data can be overridden later (when socket code +submits skb to qdisc), so if you need something from that cb later, you should +store info in the skb->data on your own. + +To hook the MLME interface you have to populate the ml_priv field of your +net_device with a pointer to struct ieee802154_mlme_ops instance. The fields +assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional. +All other fields are required. + + +SoftMAC +======= + +The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it +provides interface for drivers registration and management of slave interfaces. + +NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4 +stack interface for network sniffers (e.g. WireShark). + +This layer is going to be extended soon. + +See header include/net/mac802154.h and several drivers in +drivers/net/ieee802154/. + + +Device drivers API +================== + +The include/net/mac802154.h defines following functions: + - struct ieee802154_hw * + ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops): + allocation of IEEE 802.15.4 compatible hardware device + + - void ieee802154_free_hw(struct ieee802154_hw *hw): + freeing allocated hardware device + + - int ieee802154_register_hw(struct ieee802154_hw *hw): + register PHY which is the allocated hardware device, in the system + + - void ieee802154_unregister_hw(struct ieee802154_hw *hw): + freeing registered PHY + + - void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, + u8 lqi): + telling 802.15.4 module there is a new received frame in the skb with + the RF Link Quality Indicator (LQI) from the hardware device + + - void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, + bool ifs_handling): + telling 802.15.4 module the frame in the skb is or going to be + transmitted through the hardware device + +The device driver must implement the following callbacks in the IEEE 802.15.4 +operations structure at least: +struct ieee802154_ops { + ... + int (*start)(struct ieee802154_hw *hw); + void (*stop)(struct ieee802154_hw *hw); + ... + int (*xmit_async)(struct ieee802154_hw *hw, struct sk_buff *skb); + int (*ed)(struct ieee802154_hw *hw, u8 *level); + int (*set_channel)(struct ieee802154_hw *hw, u8 page, u8 channel); + ... +}; + + - int start(struct ieee802154_hw *hw): + handler that 802.15.4 module calls for the hardware device initialization. + + - void stop(struct ieee802154_hw *hw): + handler that 802.15.4 module calls for the hardware device cleanup. + + - int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb): + handler that 802.15.4 module calls for each frame in the skb going to be + transmitted through the hardware device. + + - int ed(struct ieee802154_hw *hw, u8 *level): + handler that 802.15.4 module calls for Energy Detection from the hardware + device. + + - int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel): + set radio for listening on specific channel of the hardware device. + +Moreover IEEE 802.15.4 device operations structure should be filled. + +Fake drivers +============ + +In addition there is a driver available which simulates a real device with +SoftMAC (fakelb - IEEE 802.15.4 loopback driver) interface. This option +provides a possibility to test and debug the stack without usage of real hardware. + +See sources in drivers/net/ieee802154 folder for more details. + + +6LoWPAN Linux implementation +============================ + +The IEEE 802.15.4 standard specifies an MTU of 127 bytes, yielding about 80 +octets of actual MAC payload once security is turned on, on a wireless link +with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format +[RFC4944] was specified to carry IPv6 datagrams over such constrained links, +taking into account limited bandwidth, memory, or energy resources that are +expected in applications such as wireless Sensor Networks. [RFC4944] defines +a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header +to support the IPv6 minimum MTU requirement [RFC2460], and stateless header +compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the +relatively large IPv6 and UDP headers down to (in the best case) several bytes. + +In September 2011 the standard update was published - [RFC6282]. +It deprecates HC1 and HC2 compression and defines IPHC encoding format which is +used in this Linux implementation. + +All the code related to 6lowpan you may find in files: net/6lowpan/* +and net/ieee802154/6lowpan/* + +To setup a 6LoWPAN interface you need: +1. Add IEEE802.15.4 interface and set channel and PAN ID; +2. Add 6lowpan interface by command like: + # ip link add link wpan0 name lowpan0 type lowpan +3. Bring up 'lowpan0' interface diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt new file mode 100644 index 000000000..f90643ef3 --- /dev/null +++ b/Documentation/networking/igb.txt @@ -0,0 +1,129 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +This driver supports all 82575, 82576 and 82580-based Intel (R) gigabit network +connections. + +For specific information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +max_vfs +------- +Valid Range: 0-7 +Default Value: 0 + +This parameter adds support for SR-IOV. It causes the driver to spawn up to +max_vfs worth of virtual function. + +Additional Configurations +========================= + + Jumbo Frames + ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ip command to increase the MTU size. + For example: + + ip link set dev eth<x> mtu 9000 + + This setting is not saved across reboots. + + Notes: + + - The maximum MTU setting for Jumbo Frames is 9216. This value coincides + with the maximum Jumbo Frames size of 9234 bytes. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + version of ethtool can be found at: + + https://www.kernel.org/pub/software/network/ethtool/ + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is configured through the ethtool* utility. + + For instructions on enabling WoL with ethtool, refer to the ethtool man page. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the igb driver must be + loaded when shutting down or rebooting the system. + + Wake On LAN is only supported on port A of multi-port adapters. + + Wake On LAN is not supported for the Intel(R) Gigabit VT Quad Port Server + Adapter. + + Multiqueue + ---------- + In this mode, a separate MSI-X vector is allocated for each queue and one + for "other" interrupts such as link status change and errors. All + interrupts are throttled via interrupt moderation. Interrupt moderation + must be used to avoid interrupt storms while the driver is processing one + interrupt. The moderation value should be at least as large as the expected + time for the driver to process an interrupt. Multiqueue is off by default. + + REQUIREMENTS: MSI-X support is required for Multiqueue. If MSI-X is not + found, the system will fallback to MSI or to Legacy interrupts. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF(n) + + Where n=the VF that attempted to do the spoofing. + + Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool + ------------------------------------------------------------ + You can set a MAC address of a Virtual Function (VF), a default VLAN and the + rate limit using the IProute2 tool. Download the latest version of the + iproute2 tool from Sourceforge if your version does not have all the + features you require. + + +Support +======= + +For general information, go to the Intel support website at: + + www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/igbvf.txt b/Documentation/networking/igbvf.txt new file mode 100644 index 000000000..bd404735f --- /dev/null +++ b/Documentation/networking/igbvf.txt @@ -0,0 +1,80 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Support + +This file describes the igbvf Linux* Base Driver for Intel Network Connection. + +The igbvf driver supports 82576-based virtual function devices that can only +be activated on kernels that support SR-IOV. SR-IOV requires the correct +platform and OS support. + +The igbvf driver requires the igb driver, version 2.0 or later. The igbvf +driver supports virtual functions generated by the igb driver with a max_vfs +value of 1 or greater. For more information on the max_vfs parameter refer +to the README included with the igb driver. + +The guest OS loading the igbvf driver must support MSI-X interrupts. + +This driver is only supported as a loadable module at this time. Intel is +not supplying patches against the kernel source to allow for static linking +of the driver. For questions related to hardware requirements, refer to the +documentation supplied with your Intel Gigabit adapter. All hardware +requirements listed apply to use with Linux. + +Instructions on updating ethtool can be found in the section "Additional +Configurations" later in this document. + +VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs. + +Identifying Your Adapter +======================== + +The igbvf driver supports 82576-based virtual function devices that can only +be activated on kernels that support SR-IOV. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://downloadcenter.intel.com/scripts-df-external/Support_Intel.aspx + +Additional Configurations +========================= + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 3.0 or later is required for this functionality, although we + strongly recommend downloading the latest version at: + + https://www.kernel.org/pub/software/network/ethtool/ + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt new file mode 100644 index 000000000..a17dac9dc --- /dev/null +++ b/Documentation/networking/ila.txt @@ -0,0 +1,285 @@ +Identifier Locator Addressing (ILA) + + +Introduction +============ + +Identifier-locator addressing (ILA) is a technique used with IPv6 that +differentiates between location and identity of a network node. Part of an +address expresses the immutable identity of the node, and another part +indicates the location of the node which can be dynamic. Identifier-locator +addressing can be used to efficiently implement overlay networks for +network virtualization as well as solutions for use cases in mobility. + +ILA can be thought of as means to implement an overlay network without +encapsulation. This is accomplished by performing network address +translation on destination addresses as a packet traverses a network. To +the network, an ILA translated packet appears to be no different than any +other IPv6 packet. For instance, if the transport protocol is TCP then an +ILA translated packet looks like just another TCP/IPv6 packet. The +advantage of this is that ILA is transparent to the network so that +optimizations in the network, such as ECMP, RSS, GRO, GSO, etc., just work. + +The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila. + + +ILA terminology +=============== + + - Identifier A number that identifies an addressable node in the network + independent of its location. ILA identifiers are sixty-four + bit values. + + - Locator A network prefix that routes to a physical host. Locators + provide the topological location of an addressed node. ILA + locators are sixty-four bit prefixes. + + - ILA mapping + A mapping of an ILA identifier to a locator (or to a + locator and meta data). An ILA domain maintains a database + that contains mappings for all destinations in the domain. + + - SIR address + An IPv6 address composed of a SIR prefix (upper sixty- + four bits) and an identifier (lower sixty-four bits). + SIR addresses are visible to applications and provide a + means for them to address nodes independent of their + location. + + - ILA address + An IPv6 address composed of a locator (upper sixty-four + bits) and an identifier (low order sixty-four bits). ILA + addresses are never visible to an application. + + - ILA host An end host that is capable of performing ILA translations + on transmit or receive. + + - ILA router A network node that performs ILA translation and forwarding + of translated packets. + + - ILA forwarding cache + A type of ILA router that only maintains a working set + cache of mappings. + + - ILA node A network node capable of performing ILA translations. This + can be an ILA router, ILA forwarding cache, or ILA host. + + +Operation +========= + +There are two fundamental operations with ILA: + + - Translate a SIR address to an ILA address. This is performed on ingress + to an ILA overlay. + + - Translate an ILA address to a SIR address. This is performed on egress + from the ILA overlay. + +ILA can be deployed either on end hosts or intermediate devices in the +network; these are provided by "ILA hosts" and "ILA routers" respectively. +Configuration and datapath for these two points of deployment is somewhat +different. + +The diagram below illustrates the flow of packets through ILA as well +as showing ILA hosts and routers. + + +--------+ +--------+ + | Host A +-+ +--->| Host B | + | | | (2) ILA (') | | + +--------+ | ...addressed.... ( ) +--------+ + V +---+--+ . packet . +---+--+ (_) + (1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR + addressed +->|router| . . |router|->-+ addressed + packet +---+--+ . IPv6 . +---+--+ packet + / . Network . + / . . +--+-++--------+ + +--------+ / . . |ILA || Host | + | Host +--+ . .- -|host|| | + | | . . +--+-++--------+ + +--------+ ................ + + +Transport checksum handling +=========================== + +When an address is translated by ILA, an encapsulated transport checksum +that includes the translated address in a pseudo header may be rendered +incorrect on the wire. This is a problem for intermediate devices, +including checksum offload in NICs, that process the checksum. There are +three options to deal with this: + +- no action Allow the checksum to be incorrect on the wire. Before + a receiver verifies a checksum the ILA to SIR address + translation must be done. + +- adjust transport checksum + When ILA translation is performed the packet is parsed + and if a transport layer checksum is found then it is + adjusted to reflect the correct checksum per the + translated address. + +- checksum neutral mapping + When an address is translated the difference can be offset + elsewhere in a part of the packet that is covered by + the checksum. The low order sixteen bits of the identifier + are used. This method is preferred since it doesn't require + parsing a packet beyond the IP header and in most cases the + adjustment can be precomputed and saved with the mapping. + +Note that the checksum neutral adjustment affects the low order sixteen +bits of the identifier. When ILA to SIR address translation is done on +egress the low order bits are restored to the original value which +restores the identifier as it was originally sent. + + +Identifier types +================ + +ILA defines different types of identifiers for different use cases. + +The defined types are: + + 0: interface identifier + + 1: locally unique identifier + + 2: virtual networking identifier for IPv4 address + + 3: virtual networking identifier for IPv6 unicast address + + 4: virtual networking identifier for IPv6 multicast address + + 5: non-local address identifier + +In the current implementation of kernel ILA only locally unique identifiers +(LUID) are supported. LUID allows for a generic, unformatted 64 bit +identifier. + + +Identifier formats +================== + +Kernel ILA supports two optional fields in an identifier for formatting: +"C-bit" and "identifier type". The presence of these fields is determined +by configuration as demonstrated below. + +If the identifier type is present it occupies the three highest order +bits of an identifier. The possible values are given in the above list. + +If the C-bit is present, this is used as an indication that checksum +neutral mapping has been done. The C-bit can only be set in an +ILA address, never a SIR address. + +In the simplest format the identifier types, C-bit, and checksum +adjustment value are not present so an identifier is considered an +unstructured sixty-four bit value. + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + + + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The checksum neutral adjustment may be configured to always be +present using neutral-map-auto. In this case there is no C-bit, but the +checksum adjustment is in the low order 16 bits. The identifier is +still sixty-four bits. + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Identifier | + | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The C-bit may used to explicitly indicate that checksum neutral +mapping has been applied to an ILA address. The format is: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | |C| Identifier | + | +-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +The identifier type field may be present to indicate the identifier +type. If it is not present then the type is inferred based on mapping +configuration. The checksum neutral adjustment may automatically +used with the identifier type as illustrated below. + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type| Identifier | + +-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +If the identifier type and the C-bit can be present simultaneously so +the identifier format would be: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Type|C| Identifier | + +-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | Checksum-neutral adjustment | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + +Configuration +============= + +There are two methods to configure ILA mappings. One is by using LWT routes +and the other is ila_xlat (called from NFHOOK PREROUTING hook). ila_xlat +is intended to be used in the receive path for ILA hosts . + +An ILA router has also been implemented in XDP. Description of that is +outside the scope of this document. + +The usage of for ILA LWT routes is: + +ip route add DEST/128 encap ila LOC csum-mode MODE ident-type TYPE via ADDR + +Destination (DEST) can either be a SIR address (for an ILA host or ingress +ILA router) or an ILA address (egress ILA router). LOC is the sixty-four +bit locator (with format W:X:Y:Z) that overwrites the upper sixty-four +bits of the destination address. Checksum MODE is one of "no-action", +"adj-transport", "neutral-map", and "neutral-map-auto". If neutral-map is +set then the C-bit will be present. Identifier TYPE one of "luid" or +"use-format." In the case of use-format, the identifier type field is +present and the effective type is taken from that. + +The usage of ila_xlat is: + +ip ila add loc_match MATCH loc LOC csum-mode MODE ident-type TYPE + +MATCH indicates the incoming locator that must be matched to apply +a the translaiton. LOC is the locator that overwrites the upper +sixty-four bits of the destination address. MODE and TYPE have the +same meanings as described above. + + +Some examples +============= + +# Configure an ILA route that uses checksum neutral mapping as well +# as type field. Note that the type field is set in the SIR address +# (the 2000 implies type is 1 which is LUID). +ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \ + csum-mode neutral-map ident-type use-format + +# Configure an ILA LWT route that uses auto checksum neutral mapping +# (no C-bit) and configure identifier type to be LUID so that the +# identifier type field will not be present. +ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \ + csum-mode neutral-map-auto ident-type luid + +ila_xlat configuration + +# Configure an ILA to SIR mapping that matches a locator and overwrites +# it with a SIR address (3333:0:0:1 in this example). The C-bit and +# identifier field are used. +ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format + +# Configure an ILA to SIR mapping where checksum neutral is automatically +# set without the C-bit and the identifier type is configured to be LUID +# so that the identifier type field is not present. +ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \ + csum-mode neutral-map-auto ident-type use-format diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst new file mode 100644 index 000000000..fcd710f2c --- /dev/null +++ b/Documentation/networking/index.rst @@ -0,0 +1,30 @@ +Linux Networking Documentation +============================== + +Contents: + +.. toctree:: + :maxdepth: 2 + + netdev-FAQ + af_xdp + batman-adv + can + can_ucan_protocol + dpaa2/index + e100 + e1000 + kapi + z8530book + msg_zerocopy + failover + net_failover + alias + bridge + +.. only:: subproject + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt new file mode 100644 index 000000000..3c617d620 --- /dev/null +++ b/Documentation/networking/ip-sysctl.txt @@ -0,0 +1,2195 @@ +/proc/sys/net/ipv4/* Variables: + +ip_forward - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Forward Packets between interfaces. + + This variable is special, its change resets all configuration + parameters to their default state (RFC1122 for hosts, RFC1812 + for routers) + +ip_default_ttl - INTEGER + Default value of TTL field (Time To Live) for outgoing (but not + forwarded) IP packets. Should be between 1 and 255 inclusive. + Default: 64 (as recommended by RFC1700) + +ip_no_pmtu_disc - INTEGER + Disable Path MTU Discovery. If enabled in mode 1 and a + fragmentation-required ICMP is received, the PMTU to this + destination will be set to min_pmtu (see below). You will need + to raise min_pmtu to the smallest interface MTU on your system + manually if you want to avoid locally generated fragments. + + In mode 2 incoming Path MTU Discovery messages will be + discarded. Outgoing frames are handled the same as in mode 1, + implicitly setting IP_PMTUDISC_DONT on every created socket. + + Mode 3 is a hardened pmtu discover mode. The kernel will only + accept fragmentation-needed errors if the underlying protocol + can verify them besides a plain socket lookup. Current + protocols for which pmtu events will be honored are TCP, SCTP + and DCCP as they verify e.g. the sequence number or the + association. This mode should not be enabled globally but is + only intended to secure e.g. name servers in namespaces where + TCP path mtu must still work but path MTU information of other + protocols should be discarded. If enabled globally this mode + could break other protocols. + + Possible values: 0-3 + Default: FALSE + +min_pmtu - INTEGER + default 552 - minimum discovered Path MTU + +ip_forward_use_pmtu - BOOLEAN + By default we don't trust protocol path MTUs while forwarding + because they could be easily forged and can lead to unwanted + fragmentation by the router. + You only need to enable this if you have user-space software + which tries to discover path mtus by itself and depends on the + kernel honoring this information. This is normally not the + case. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv4 reply packets that are not + associated with a socket for example, TCP RSTs or ICMP echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + Default: 0 + +fib_multipath_use_neigh - BOOLEAN + Use status of existing neighbor entry when determining nexthop for + multipath routes. If disabled, neighbor information is not used and + packets could be directed to a failed nexthop. Only valid for kernels + built with CONFIG_IP_ROUTE_MULTIPATH enabled. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. Only valid + for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. + Default: 0 (Layer 3) + Possible values: + 0 - Layer 3 + 1 - Layer 4 + +ip_forward_update_priority - INTEGER + Whether to update SKB priority from "TOS" field in IPv4 header after it + is forwarded. The new SKB priority is mapped from TOS field value + according to an rt_tos2priority table (see e.g. man tc-prio). + Default: 1 (Update priority.) + Possible values: + 0 - Do not update priority. + 1 - Update priority. + +route/max_size - INTEGER + Maximum number of routes allowed in the kernel. Increase + this when using large numbers of interfaces and/or routes. + From linux kernel 3.6 onwards, this is deprecated for ipv4 + as route cache is no longer used. + +neigh/default/gc_thresh1 - INTEGER + Minimum number of entries to keep. Garbage collector will not + purge entries if there are fewer than this number. + Default: 128 + +neigh/default/gc_thresh2 - INTEGER + Threshold when garbage collector becomes more aggressive about + purging entries. Entries older than 5 seconds will be cleared + when over this number. + Default: 512 + +neigh/default/gc_thresh3 - INTEGER + Maximum number of neighbor entries allowed. Increase this + when using large numbers of interfaces and when communicating + with large numbers of directly-connected peers. + Default: 1024 + +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + Setting negative value is meaningless and will return error. + Default: SK_WMEM_MAX, (same as net.core.wmem_default). + Exact value depends on architecture and kernel options, + but should be enough to allow queuing 256 packets + of medium size. + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + Prior to linux 3.3, the default value is 3 which may cause + unexpected packet loss. The current default value is calculated + according to default value of unres_qlen_bytes and true size of + packet. + Default: 101 + +mtu_expires - INTEGER + Time, in seconds, that cached PMTU information is kept. + +min_adv_mss - INTEGER + The advertised MSS depends on the first hop route MTU, but will + never be lower than this setting. + +IP Fragmentation: + +ipfrag_high_thresh - LONG INTEGER + Maximum memory used to reassemble IP fragments. + +ipfrag_low_thresh - LONG INTEGER + (Obsolete since linux-4.17) + Maximum memory used to reassemble IP fragments before the kernel + begins to remove incomplete fragment queues to free up resources. + The kernel still accepts new fragments for defragmentation. + +ipfrag_time - INTEGER + Time in seconds to keep an IP fragment in memory. + +ipfrag_max_dist - INTEGER + ipfrag_max_dist is a non-negative integer value which defines the + maximum "disorder" which is allowed among fragments which share a + common IP source address. Note that reordering of packets is + not unusual, but if a large number of fragments arrive from a source + IP address while a particular fragment queue remains incomplete, it + probably indicates that one or more fragments belonging to that queue + have been lost. When ipfrag_max_dist is positive, an additional check + is done on fragments before they are added to a reassembly queue - if + ipfrag_max_dist (or more) fragments have arrived from a particular IP + address between additions to any IP fragment queue using that source + address, it's presumed that one or more fragments in the queue are + lost. The existing fragment queue will be dropped, and a new one + started. An ipfrag_max_dist value of zero disables this check. + + Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can + result in unnecessarily dropping fragment queues when normal + reordering of packets occurs, which could lead to poor application + performance. Using a very large value, e.g. 50000, increases the + likelihood of incorrectly reassembling IP fragments that originate + from different IP datagrams, which could result in data corruption. + Default: 64 + +INET peer storage: + +inet_peer_threshold - INTEGER + The approximate size of the storage. Starting from this threshold + entries will be thrown aggressively. This threshold also determines + entries' time-to-live and time intervals between garbage collection + passes. More entries, less time-to-live, less GC interval. + +inet_peer_minttl - INTEGER + Minimum time-to-live of entries. Should be enough to cover fragment + time-to-live on the reassembling side. This minimum time-to-live is + guaranteed if the pool size is less than inet_peer_threshold. + Measured in seconds. + +inet_peer_maxttl - INTEGER + Maximum time-to-live of entries. Unused entries will expire after + this period of time if there is no memory pressure on the pool (i.e. + when the number of entries in the pool is very small). + Measured in seconds. + +TCP variables: + +somaxconn - INTEGER + Limit of socket listen() backlog, known in userspace as SOMAXCONN. + Defaults to 128. See also tcp_max_syn_backlog for additional tuning + for TCP sockets. + +tcp_abort_on_overflow - BOOLEAN + If listening service is too slow to accept new connections, + reset them. Default state is FALSE. It means that if overflow + occurred due to a burst, connection will recover. Enable this + option _only_ if you are really sure that listening daemon + cannot be tuned to accept connections faster. Enabling this + option can harm clients of your server. + +tcp_adv_win_scale - INTEGER + Count buffering overhead as bytes/2^tcp_adv_win_scale + (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), + if it is <= 0. + Possible values are [-31, 31], inclusive. + Default: 1 + +tcp_allowed_congestion_control - STRING + Show/set the congestion control choices available to non-privileged + processes. The list is a subset of those listed in + tcp_available_congestion_control. + Default is "reno" and the default setting (tcp_congestion_control). + +tcp_app_win - INTEGER + Reserve max(window/2^tcp_app_win, mss) of window for application + buffer. Value 0 is special, it means that nothing is reserved. + Default: 31 + +tcp_autocorking - BOOLEAN + Enable TCP auto corking : + When applications do consecutive small write()/sendmsg() system calls, + we try to coalesce these small writes as much as possible, to lower + total amount of sent packets. This is done if at least one prior + packet for the flow is waiting in Qdisc queues or device transmit + queue. Applications can still use TCP_CORK for optimal behavior + when they know how/when to uncork their sockets. + Default : 1 + +tcp_available_congestion_control - STRING + Shows the available congestion control choices that are registered. + More congestion control algorithms may be available as modules, + but not loaded. + +tcp_base_mss - INTEGER + The initial value of search_low to be used by the packetization layer + Path MTU discovery (MTU probing). If MTU probing is enabled, + this is the initial MSS used by the connection. + +tcp_min_snd_mss - INTEGER + TCP SYN and SYNACK messages usually advertise an ADVMSS option, + as described in RFC 1122 and RFC 6691. + If this ADVMSS option is smaller than tcp_min_snd_mss, + it is silently capped to tcp_min_snd_mss. + + Default : 48 (at least 8 bytes of payload per segment) + +tcp_congestion_control - STRING + Set the congestion control algorithm to be used for new + connections. The algorithm "reno" is always available, but + additional choices may be available based on kernel configuration. + Default is set as part of kernel configuration. + For passive connections, the listener congestion control choice + is inherited. + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] + +tcp_dsack - BOOLEAN + Allows TCP to send "duplicate" SACKs. + +tcp_early_retrans - INTEGER + Tail loss probe (TLP) converts RTOs occurring due to tail + losses into fast recovery (draft-ietf-tcpm-rack). Note that + TLP requires RACK to function properly (see tcp_recovery below) + Possible values: + 0 disables TLP + 3 or 4 enables TLP + Default: 3 + +tcp_ecn - INTEGER + Control use of Explicit Congestion Notification (ECN) by TCP. + ECN is used only when both ends of the TCP connection indicate + support for it. This feature is useful in avoiding losses due + to congestion by allowing supporting routers to signal + congestion before having to drop packets. + Possible values are: + 0 Disable ECN. Neither initiate nor accept ECN. + 1 Enable ECN when requested by incoming connections and + also request ECN on outgoing connection attempts. + 2 Enable ECN when requested by incoming connections + but do not request ECN on outgoing connections. + Default: 2 + +tcp_ecn_fallback - BOOLEAN + If the kernel detects that ECN connection misbehaves, enable fall + back to non-ECN. Currently, this knob implements the fallback + from RFC3168, section 6.1.1.1., but we reserve that in future, + additional detection mechanisms could be implemented under this + knob. The value is not used, if tcp_ecn or per route (or congestion + control) ECN settings are disabled. + Default: 1 (fallback enabled) + +tcp_fack - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_fin_timeout - INTEGER + The length of time an orphaned (no longer referenced by any + application) connection will remain in the FIN_WAIT_2 state + before it is aborted at the local end. While a perfectly + valid "receive only" state for an un-orphaned connection, an + orphaned connection in FIN_WAIT_2 state could otherwise wait + forever for the remote to close its end of the connection. + Cf. tcp_max_orphans + Default: 60 seconds + +tcp_frto - INTEGER + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. + F-RTO is an enhanced recovery algorithm for TCP retransmission + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. + +tcp_invalid_ratelimit - INTEGER + Limit the maximal rate for sending duplicate acknowledgments + in response to incoming TCP packets that are for an existing + connection but that are invalid due to any of these reasons: + + (a) out-of-window sequence number, + (b) out-of-window acknowledgment number, or + (c) PAWS (Protection Against Wrapped Sequence numbers) check failure + + This can help mitigate simple "ack loop" DoS attacks, wherein + a buggy or malicious middlebox or man-in-the-middle can + rewrite TCP header fields in manner that causes each endpoint + to think that the other is sending invalid TCP segments, thus + causing each side to send an unterminating stream of duplicate + acknowledgments for invalid segments. + + Using 0 disables rate-limiting of dupacks in response to + invalid segments; otherwise this value specifies the minimal + space between sending such dupacks, in milliseconds. + + Default: 500 (milliseconds). + +tcp_keepalive_time - INTEGER + How often TCP sends out keepalive messages when keepalive is enabled. + Default: 2hours. + +tcp_keepalive_probes - INTEGER + How many keepalive probes TCP sends out, until it decides that the + connection is broken. Default value: 9. + +tcp_keepalive_intvl - INTEGER + How frequently the probes are send out. Multiplied by + tcp_keepalive_probes it is time to kill not responding connection, + after probes started. Default value: 75sec i.e. connection + will be aborted after ~11 minutes of retries. + +tcp_l3mdev_accept - BOOLEAN + Enables child sockets to inherit the L3 master device index. + Enabling this option allows a "global" listen socket to work + across L3 master domains (e.g., VRFs) with connected sockets + derived from the listen socket to be bound to the L3 domain in + which the packets originated. Only valid when the kernel was + compiled with CONFIG_NET_L3_MASTER_DEV. + +tcp_low_latency - BOOLEAN + This is a legacy option, it has no effect anymore. + +tcp_max_orphans - INTEGER + Maximal number of TCP sockets not attached to any user file handle, + held by system. If this number is exceeded orphaned connections are + reset immediately and warning is printed. This limit exists + only to prevent simple DoS attacks, you _must_ not rely on this + or lower the limit artificially, but rather increase it + (probably, after increasing installed memory), + if network conditions require more than default value, + and tune network services to linger and kill such states + more aggressively. Let me to remind again: each orphan eats + up to ~64K of unswappable memory. + +tcp_max_syn_backlog - INTEGER + Maximal number of remembered connection requests, which have not + received an acknowledgment from connecting client. + The minimal value is 128 for low memory machines, and it will + increase in proportion to the memory of machine. + If server suffers from overload, try increasing this number. + +tcp_max_tw_buckets - INTEGER + Maximal number of timewait sockets held by system simultaneously. + If this number is exceeded time-wait socket is immediately destroyed + and warning is printed. This limit exists only to prevent + simple DoS attacks, you _must_ not lower the limit artificially, + but rather increase it (probably, after increasing installed memory), + if network conditions require more than default value. + +tcp_mem - vector of 3 INTEGERs: min, pressure, max + min: below this number of pages TCP is not bothered about its + memory appetite. + + pressure: when amount of memory allocated by TCP exceeds this number + of pages, TCP moderates its memory consumption and enters memory + pressure mode, which is exited when memory consumption falls + under "min". + + max: number of pages allowed for queueing by all TCP sockets. + + Defaults are calculated at boot time from amount of available + memory. + +tcp_min_rtt_wlen - INTEGER + The window length of the windowed min filter to track the minimum RTT. + A shorter window lets a flow more quickly pick up new (higher) + minimum RTT when it is moved to a longer path (e.g., due to traffic + engineering). A longer window makes the filter more resistant to RTT + inflations such as transient congestion. The unit is seconds. + Possible values: 0 - 86400 (1 day) + Default: 300 + +tcp_moderate_rcvbuf - BOOLEAN + If set, TCP performs receive buffer auto-tuning, attempting to + automatically size the buffer (no greater than tcp_rmem[2]) to + match the size required by the path for full throughput. Enabled by + default. + +tcp_mtu_probing - INTEGER + Controls TCP Packetization-Layer Path MTU Discovery. Takes three + values: + 0 - Disabled + 1 - Disabled by default, enabled when an ICMP black hole detected + 2 - Always enabled, use initial MSS of tcp_base_mss. + +tcp_probe_interval - UNSIGNED INTEGER + Controls how often to start TCP Packetization-Layer Path MTU + Discovery reprobe. The default is reprobing every 10 minutes as + per RFC4821. + +tcp_probe_threshold - INTEGER + Controls when TCP Packetization-Layer Path MTU Discovery probing + will stop in respect to the width of search range in bytes. Default + is 8 bytes. + +tcp_no_metrics_save - BOOLEAN + By default, TCP saves various connection metrics in the route cache + when the connection closes, so that connections established in the + near future can use these to set initial conditions. Usually, this + increases overall performance, but may sometimes cause performance + degradation. If set, TCP will not cache metrics on closing + connections. + +tcp_orphan_retries - INTEGER + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. + + The default value is 8. + If your machine is a loaded WEB server, + you should think about lowering this value, such sockets + may consume significant resources. Cf. tcp_max_orphans. + +tcp_recovery - INTEGER + This value is a bitmap to enable various experimental loss recovery + features. + + RACK: 0x1 enables the RACK loss detection for fast detection of lost + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. + RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + RACK: 0x4 disables RACK's DUPACK threshold heuristic + + Default: 0x1 + +tcp_reordering - INTEGER + Initial reordering level of packets in a TCP stream. + TCP stack can then dynamically adjust flow reordering level + between this initial value and tcp_max_reordering + Default: 3 + +tcp_max_reordering - INTEGER + Maximal reordering level of packets in a TCP stream. + 300 is a fairly conservative value, but you might increase it + if paths are using per packet load balancing (like bonding rr mode) + Default: 300 + +tcp_retrans_collapse - BOOLEAN + Bug-to-bug compatibility with some broken printers. + On retransmit try to send bigger packets to work around bugs in + certain TCP stacks. + +tcp_retries1 - INTEGER + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. + + RFC 1122 recommends at least 3 retransmissions, which is the + default. + +tcp_retries2 - INTEGER + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. + + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. + +tcp_rfc1337 - BOOLEAN + If set, the TCP stack behaves conforming to RFC1337. If unset, + we are not conforming to RFC, but prevent TCP TIME_WAIT + assassination. + Default: 0 + +tcp_rmem - vector of 3 INTEGERs: min, default, max + min: Minimal size of receive buffer used by TCP sockets. + It is guaranteed to each TCP socket, even under moderate memory + pressure. + Default: 4K + + default: initial size of receive buffer used by TCP sockets. + This value overrides net.core.rmem_default used by other protocols. + Default: 87380 bytes. This value results in window of 65535 with + default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit + less for default tcp_app_win. See below about these variables. + + max: maximal size of receive buffer allowed for automatically + selected receiver buffers for TCP socket. This value does not override + net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + automatic tuning of that socket's receive buffer size, in which + case this value is ignored. + Default: between 87380B and 6MB, depending on RAM size. + +tcp_sack - BOOLEAN + Enable select acknowledgments (SACKS). + +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + +tcp_comp_sack_nr - INTEGER + Max numer of SACK that can be compressed. + Using 0 disables SACK compression. + + Detault : 44 + +tcp_slow_start_after_idle - BOOLEAN + If set, provide RFC2861 behavior and time out the congestion + window after an idle period. An idle period is defined at + the current RTO. If unset, the congestion window will not + be timed out after an idle period. + Default: 1 + +tcp_stdurg - BOOLEAN + Use the Host requirements interpretation of the TCP urgent pointer field. + Most hosts use the older BSD interpretation, so if you turn this on + Linux might not communicate correctly with them. + Default: FALSE + +tcp_synack_retries - INTEGER + Number of times SYNACKs for a passive TCP connection attempt will + be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to 31seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for a passive TCP connection will happen after 63seconds. + +tcp_syncookies - BOOLEAN + Only valid when the kernel was compiled with CONFIG_SYN_COOKIES + Send out syncookies when the syn backlog queue of a socket + overflows. This is to prevent against the common 'SYN flood attack' + Default: 1 + + Note, that syncookies is fallback facility. + It MUST NOT be used to help highly loaded servers to stand + against legal connection rate. If you see SYN flood warnings + in your logs, but investigation shows that they occur + because of overload with legal connections, you should tune + another parameters until this warning disappear. + See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. + + syncookies seriously violate TCP protocol, do not allow + to use TCP extensions, can result in serious degradation + of some services (f.e. SMTP relaying), visible not by you, + but your clients and relays, contacting you. While you see + SYN flood warnings in logs not being really flooded, your server + is seriously misconfigured. + + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + +tcp_fastopen - INTEGER + Enable TCP Fast Open (RFC7413) to send and accept data in the opening + SYN packet. + + The client support is enabled by flag 0x1 (on by default). The client + then must use sendmsg() or sendto() with the MSG_FASTOPEN flag, + rather than connect() to send data in SYN. + + The server support is enabled by flag 0x2 (off by default). Then + either enable for all listeners with another flag (0x400) or + enable individual listeners via TCP_FASTOPEN socket option with + the option value being the length of the syn-data backlog. + + The values (bitmap) are + 0x1: (client) enables sending data in the opening SYN on the client. + 0x2: (server) enables the server support, i.e., allowing data in + a SYN packet to be accepted and passed to the + application before 3-way handshake finishes. + 0x4: (client) send data in the opening SYN regardless of cookie + availability and without a cookie option. + 0x200: (server) accept data-in-SYN w/o any cookie option present. + 0x400: (server) enable all listeners to support Fast Open by + default without explicit TCP_FASTOPEN socket option. + + Default: 0x1 + + Note that that additional client or server features are only + effective if the basic support (0x1 and 0x2) are enabled respectively. + +tcp_fastopen_blackhole_timeout_sec - INTEGER + Initial time period in second to disable Fastopen on active TCP sockets + when a TFO firewall blackhole issue happens. + This time period will grow exponentially when more blackhole issues + get detected right after Fastopen is re-enabled and will reset to + initial value when the blackhole issue goes away. + 0 to disable the blackhole detection. + By default, it is set to 1hr. + +tcp_syn_retries - INTEGER + Number of times initial SYNs for an active TCP connection attempt + will be retransmitted. Should not be higher than 127. Default value + is 6, which corresponds to 63seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for an active TCP connection attempt will happen after 127seconds. + +tcp_timestamps - INTEGER +Enable timestamps as defined in RFC1323. + 0: Disabled. + 1: Enable timestamps as defined in RFC1323 and use random offset for + each connection rather than only using the current time. + 2: Like 1, but without random offsets. + Default: 1 + +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. + Default: 2 + +tcp_pacing_ss_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in slow start, tcp_pacing_ss_ratio is applied + to let TCP probe for bigger speeds, assuming cwnd can be + doubled every other RTT. + Default: 200 + +tcp_pacing_ca_ratio - INTEGER + sk->sk_pacing_rate is set by TCP stack using a ratio applied + to current rate. (current_rate = cwnd * mss / srtt) + If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio + is applied to conservatively probe for bigger throughput. + Default: 120 + +tcp_tso_win_divisor - INTEGER + This allows control over what percentage of the congestion window + can be consumed by a single TSO frame. + The setting of this parameter is a choice between burstiness and + building larger TSO frames. + Default: 3 + +tcp_tw_reuse - INTEGER + Enable reuse of TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. + 0 - disable + 1 - global enable + 2 - enable for loopback traffic only + It should not be changed without advice/request of technical + experts. + Default: 2 + +tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + +tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP sockets. + Each TCP socket has rights to use it due to fact of its birth. + Default: 4K + + default: initial size of send buffer used by TCP sockets. This + value overrides net.core.wmem_default used by other protocols. + It is usually lower than net.core.wmem_default. + Default: 16K + + max: Maximal amount of memory allowed for automatically tuned + send buffers for TCP sockets. This value does not override + net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables + automatic tuning of that socket's send buffer size, in which case + this value is ignored. + Default: between 64K and 4MB, depending on RAM size. + +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + +tcp_workaround_signed_windows - BOOLEAN + If set, assume no receipt of a window scaling option means the + remote TCP is broken and treats the window as a signed quantity. + If unset, assume the remote TCP is not broken even if we do + not receive a window scaling option from them. + Default: 0 + +tcp_thin_linear_timeouts - BOOLEAN + Enable dynamic triggering of linear timeouts for thin streams. + If set, a check is performed upon retransmission by timeout to + determine if the stream is thin (less than 4 packets in flight). + As long as the stream is found to be thin, up to 6 linear + timeouts may be performed before exponential backoff mode is + initiated. This improves retransmission latency for + non-aggressive thin streams, often found to be time-dependent. + For more information on thin streams, see + Documentation/networking/tcp-thin.txt + Default: 0 + +tcp_limit_output_bytes - INTEGER + Controls TCP Small Queue limit per tcp socket. + TCP bulk sender tends to increase packets in flight until it + gets losses notifications. With SNDBUF autotuning, this can + result in a large amount of packets queued on the local machine + (e.g.: qdiscs, CPU backlog, or device) hurting latency of other + flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes + limits the number of bytes on qdisc or device to reduce artificial + RTT/cwnd and reduce bufferbloat. + Default: 262144 + +tcp_challenge_ack_limit - INTEGER + Limits number of Challenge ACK sent per second, as recommended + in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) + Default: 100 + +UDP variables: + +udp_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + +udp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all UDP sockets. + + min: Below this number of pages UDP is not bothered about its + memory appetite. When amount of memory allocated by UDP exceeds + this number, UDP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all UDP sockets. + + Default is calculated at boot time from amount of available memory. + +udp_rmem_min - INTEGER + Minimal size of receive buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for receiving data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 4K + +udp_wmem_min - INTEGER + Minimal size of send buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for sending data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 4K + +CIPSOv4 Variables: + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + Default: 0 + +IP Variables: + +ip_local_port_range - 2 INTEGERS + Defines the local port range that is used by TCP and UDP to + choose the local port. The first number is the first, the + second the last local port number. + If possible, it is better these numbers have different parity. + (one even and one odd values) + The default values are 32768 and 60999 respectively. + +ip_local_reserved_ports - list of comma separated ranges + Specify the ports which are reserved for known third-party + applications. These ports will not be used by automatic port + assignments (e.g. when calling connect() or bind() with port + number 0). Explicit port allocation behavior is unchanged. + + The format used for both input and output is a comma separated + list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and + 10). Writing to the file will clear all previously reserved + ports and update the current list with the one given in the + input. + + Note that ip_local_port_range and ip_local_reserved_ports + settings are independent and both are considered by the kernel + when determining which ports are available for automatic port + assignments. + + You can reserve ports which are not in the current + ip_local_port_range, e.g.: + + $ cat /proc/sys/net/ipv4/ip_local_port_range + 32000 60999 + $ cat /proc/sys/net/ipv4/ip_local_reserved_ports + 8080,9148 + + although this is redundant. However such a setting is useful + if later the port range is changed to a value that will + include the reserved ports. + + Default: Empty + +ip_unprivileged_port_start - INTEGER + This is a per-namespace sysctl. It defines the first + unprivileged port in the network namespace. Privileged ports + require root or CAP_NET_BIND_SERVICE in order to bind to them. + To disable all privileged ports, set this to 0. It may not + overlap with the ip_local_reserved_ports range. + + Default: 1024 + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IP addresses, + which can be quite useful - but may break some applications. + Default: 0 + +ip_dynaddr - BOOLEAN + If set non-zero, enables support for dynamic addresses. + If set to a non-zero value larger than 1, a kernel log + message will be printed when dynamic address rewriting + occurs. + Default: 0 + +ip_early_demux - BOOLEAN + Optimize input packet processing down to one demux for + certain kinds of local sockets. Currently we only do this + for established TCP and connected UDP sockets. + + It may add an additional cost for pure routing workloads that + reduces overall throughput, in such case you should disable it. + Default: 1 + +tcp_early_demux - BOOLEAN + Enable early demux for established TCP sockets. + Default: 1 + +udp_early_demux - BOOLEAN + Enable early demux for connected UDP sockets. Disable this if + your system could experience more unconnected load. + Default: 1 + +icmp_echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it. + Default: 0 + +icmp_echo_ignore_broadcasts - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO and + TIMESTAMP requests sent to it via broadcast/multicast. + Default: 1 + +icmp_ratelimit - INTEGER + Limit the maximal rates for sending ICMP packets whose type matches + icmp_ratemask (see below) to specific targets. + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Note that another sysctl, icmp_msgs_per_sec limits the number + of ICMP packets sent on all targets. + Default: 1000 + +icmp_msgs_per_sec - INTEGER + Limit maximal number of ICMP packets sent per second from this host. + Only messages whose type matches icmp_ratemask (see below) are + controlled by this limit. For security reasons, the precise count + of messages per second is randomized. + Default: 1000 + +icmp_msgs_burst - INTEGER + icmp_msgs_per_sec controls number of ICMP packets sent per second, + while icmp_msgs_burst controls the burst size of these packets. + For security reasons, the precise burst size is randomized. + Default: 50 + +icmp_ratemask - INTEGER + Mask made of ICMP types for which rates are being limited. + Significant bits: IHGFEDCBA9876543210 + Default mask: 0000001100000011000 (6168) + + Bit definitions (see include/linux/icmp.h): + 0 Echo Reply + 3 Destination Unreachable * + 4 Source Quench * + 5 Redirect + 8 Echo Request + B Time Exceeded * + C Parameter Problem * + D Timestamp Request + E Timestamp Reply + F Info Request + G Info Reply + H Address Mask Request + I Address Mask Reply + + * These are rate limited by default (see default mask above) + +icmp_ignore_bogus_error_responses - BOOLEAN + Some routers violate RFC1122 by sending bogus responses to broadcast + frames. Such violations are normally logged via a kernel warning. + If this is set to TRUE, the kernel will not give such warnings, which + will avoid log file clutter. + Default: 1 + +icmp_errors_use_inbound_ifaddr - BOOLEAN + + If zero, icmp error messages are sent with the primary address of + the exiting interface. + + If non-zero, the message will be sent with the primary address of + the interface that received the packet that caused the icmp error. + This is the behaviour network many administrators will expect from + a router. And it can make debugging complicated network layouts + much easier. + + Note that if no primary address exists for the interface selected, + then the primary address of the first non-loopback interface that + has one will be used regardless of this setting. + + Default: 0 + +igmp_max_memberships - INTEGER + Change the maximum number of multicast groups we can subscribe to. + Default: 20 + + Theoretical maximum value is bounded by having to send a membership + report in a single datagram (i.e. the report can't span multiple + datagrams, or risk confusing the switch and leaving groups you don't + intend to). + + The number of supported groups 'M' is bounded by the number of group + report entries you can fit into a single datagram of 65535 bytes. + + M = 65536-sizeof (ip header)/(sizeof(Group record)) + + Group records are variable length, with a minimum of 12 bytes. + So net.ipv4.igmp_max_memberships should not be set higher than: + + (65536-24) / 12 = 5459 + + The value 5459 assumes no IP header options, so in practice + this number may be lower. + +igmp_max_msf - INTEGER + Maximum number of addresses allowed in the source filter list for a + multicast group. + Default: 10 + +igmp_qrv - INTEGER + Controls the IGMP query robustness variable (see RFC2236 8.1). + Default: 2 (as specified by RFC2236 8.1) + Minimum: 1 (as specified by RFC6636 4.5) + +force_igmp_version - INTEGER + 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback + allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier + Present timer expires. + 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if + receive IGMPv2/v3 query. + 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive + IGMPv1 query message. Will reply report if receive IGMPv3 query. + 3 - Enforce to use IGMP version 3. The same react with default 0. + + Note: this is not the same with force_mld_version because IGMPv3 RFC3376 + Security Considerations does not have clear description that we could + ignore other version messages completely as MLDv2 RFC3810. So make + this value as default 0 is recommended. + +conf/interface/* changes special settings per interface (where +"interface" is the name of your network interface) + +conf/all/* is special, changes the settings for all interfaces + +log_martians - BOOLEAN + Log packets with impossible addresses to kernel log. + log_martians for the interface will be enabled if at least one of + conf/{all,interface}/log_martians is set to TRUE, + it will be disabled otherwise + +accept_redirects - BOOLEAN + Accept ICMP redirect messages. + accept_redirects for the interface will be enabled if: + - both conf/{all,interface}/accept_redirects are TRUE in the case + forwarding for the interface is enabled + or + - at least one of conf/{all,interface}/accept_redirects is TRUE in the + case forwarding for the interface is disabled + accept_redirects for the interface will be disabled otherwise + default TRUE (host) + FALSE (router) + +forwarding - BOOLEAN + Enable IP forwarding on this interface. This controls whether packets + received _on_ this interface can be forwarded. + +mc_forwarding - BOOLEAN + Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE + and a multicast routing daemon is required. + conf/all/mc_forwarding must also be set to TRUE to enable multicast + routing for the interface + +medium_id - INTEGER + Integer value used to differentiate the devices by the medium they + are attached to. Two devices can have different id values when + the broadcast packets are received only on one of them. + The default value 0 means that the device is the only interface + to its medium, value of -1 means that medium is not known. + + Currently, it is used to change the proxy_arp behavior: + the proxy_arp feature is enabled for packets forwarded between + two devices attached to different media. + +proxy_arp - BOOLEAN + Do proxy arp. + proxy_arp for the interface will be enabled if at least one of + conf/{all,interface}/proxy_arp is set to TRUE, + it will be disabled otherwise + +proxy_arp_pvlan - BOOLEAN + Private VLAN proxy arp. + Basically allow proxy arp replies back to the same interface + (from which the ARP request/solicitation was received). + + This is done to support (ethernet) switch features, like RFC + 3069, where the individual ports are NOT allowed to + communicate with each other, but they are allowed to talk to + the upstream router. As described in RFC 3069, it is possible + to allow these hosts to communicate through the upstream + router by proxy_arp'ing. Don't need to be used together with + proxy_arp. + + This technology is known by different names: + In RFC 3069 it is called VLAN Aggregation. + Cisco and Allied Telesyn call it Private VLAN. + Hewlett-Packard call it Source-Port filtering or port-isolation. + Ericsson call it MAC-Forced Forwarding (RFC Draft). + +shared_media - BOOLEAN + Send(router) or accept(host) RFC1620 shared media redirects. + Overrides secure_redirects. + shared_media for the interface will be enabled if at least one of + conf/{all,interface}/shared_media is set to TRUE, + it will be disabled otherwise + default TRUE + +secure_redirects - BOOLEAN + Accept ICMP redirect messages only to gateways listed in the + interface's current gateway list. Even if disabled, RFC1122 redirect + rules still apply. + Overridden by shared_media. + secure_redirects for the interface will be enabled if at least one of + conf/{all,interface}/secure_redirects is set to TRUE, + it will be disabled otherwise + default TRUE + +send_redirects - BOOLEAN + Send redirects, if router. + send_redirects for the interface will be enabled if at least one of + conf/{all,interface}/send_redirects is set to TRUE, + it will be disabled otherwise + Default: TRUE + +bootp_relay - BOOLEAN + Accept packets with source address 0.b.c.d destined + not to this host as local ones. It is supposed, that + BOOTP relay daemon will catch and forward such packets. + conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay + for the interface + default FALSE + Not Implemented Yet. + +accept_source_route - BOOLEAN + Accept packets with SRR option. + conf/all/accept_source_route must also be set to TRUE to accept packets + with SRR option on the interface + default TRUE (router) + FALSE (host) + +accept_local - BOOLEAN + Accept packets with local source addresses. In combination with + suitable routing, this can be used to direct packets between two + local interfaces over the wire and have them accepted properly. + default FALSE + +route_localnet - BOOLEAN + Do not consider loopback addresses as martian source or destination + while routing. This enables the use of 127/8 for local routing purposes. + default FALSE + +rp_filter - INTEGER + 0 - No source validation. + 1 - Strict mode as defined in RFC3704 Strict Reverse Path + Each incoming packet is tested against the FIB and if the interface + is not the best reverse path the packet check will fail. + By default failed packets are discarded. + 2 - Loose mode as defined in RFC3704 Loose Reverse Path + Each incoming packet's source address is also tested against the FIB + and if the source address is not reachable via any interface + the packet check will fail. + + Current recommended practice in RFC3704 is to enable strict mode + to prevent IP spoofing from DDos attacks. If using asymmetric routing + or other complicated routing, then loose mode is recommended. + + The max value from conf/{all,interface}/rp_filter is used + when doing source validation on the {interface}. + + Default value is 0. Note that some distributions enable it + in startup scripts. + +arp_filter - BOOLEAN + 1 - Allows you to have multiple network interfaces on the same + subnet, and have the ARPs for each interface be answered + based on whether or not the kernel would route a packet from + the ARP'd IP out that interface (therefore you must use source + based routing for this to work). In other words it allows control + of which cards (usually 1) will respond to an arp request. + + 0 - (default) The kernel can respond to arp requests with addresses + from other interfaces. This may seem wrong but it usually makes + sense, because it increases the chance of successful communication. + IP addresses are owned by the complete host on Linux, not by + particular interfaces. Only for more complex setups like load- + balancing, does this behaviour cause problems. + + arp_filter for the interface will be enabled if at least one of + conf/{all,interface}/arp_filter is set to TRUE, + it will be disabled otherwise + +arp_announce - INTEGER + Define different restriction levels for announcing the local + source IP address from IP packets in ARP requests sent on + interface: + 0 - (default) Use any local address, configured on any interface + 1 - Try to avoid local addresses that are not in the target's + subnet for this interface. This mode is useful when target + hosts reachable via this interface require the source IP + address in ARP requests to be part of their logical network + configured on the receiving interface. When we generate the + request we will check all our subnets that include the + target IP and will preserve the source address if it is from + such subnet. If there is no such subnet we select source + address according to the rules for level 2. + 2 - Always use the best local address for this target. + In this mode we ignore the source address in the IP packet + and try to select local address that we prefer for talks with + the target host. Such local address is selected by looking + for primary IP addresses on all our subnets on the outgoing + interface that include the target IP address. If no suitable + local address is found we select the first local address + we have on the outgoing interface or on all other interfaces, + with the hope we will receive reply for our request and + even sometimes no matter the source IP address we announce. + + The max value from conf/{all,interface}/arp_announce is used. + + Increasing the restriction level gives more chance for + receiving answer from the resolved target while decreasing + the level announces more valid sender's information. + +arp_ignore - INTEGER + Define different modes for sending replies in response to + received ARP requests that resolve local target IP addresses: + 0 - (default): reply for any local target IP address, configured + on any interface + 1 - reply only if the target IP address is local address + configured on the incoming interface + 2 - reply only if the target IP address is local address + configured on the incoming interface and both with the + sender's IP address are part from same subnet on this interface + 3 - do not reply for local addresses configured with scope host, + only resolutions for global and link addresses are replied + 4-7 - reserved + 8 - do not reply for all local addresses + + The max value from conf/{all,interface}/arp_ignore is used + when ARP request is received on the {interface} + +arp_notify - BOOLEAN + Define mode for notification of address and device changes. + 0 - (default): do nothing + 1 - Generate gratuitous arp requests when device is brought up + or hardware address changes. + +arp_accept - BOOLEAN + Define behavior for gratuitous ARP frames who's IP is not + already present in the ARP table: + 0 - don't create new entries in the ARP table + 1 - create new entries in the ARP table + + Both replies and requests type gratuitous arp will trigger the + ARP table to be updated, if this setting is on. + + If the ARP table already contains the IP address of the + gratuitous arp frame, the arp table will be updated regardless + if this setting is on or off. + +mcast_solicit - INTEGER + The maximum number of multicast probes in INCOMPLETE state, + when the associated hardware address is unknown. Defaults + to 3. + +ucast_solicit - INTEGER + The maximum number of unicast probes in PROBE state, when + the hardware address is being reconfirmed. Defaults to 3. + +app_solicit - INTEGER + The maximum number of probes to send to the user space ARP daemon + via netlink before dropping back to multicast probes (see + mcast_resolicit). Defaults to 0. + +mcast_resolicit - INTEGER + The maximum number of multicast probes after unicast and + app probes in PROBE state. Defaults to 0. + +disable_policy - BOOLEAN + Disable IPSEC policy (SPD) for this interface + +disable_xfrm - BOOLEAN + Disable IPSEC encryption on this interface, whatever the policy + +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + Default: 10000 (10 seconds) + +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + Default: 1000 (1 seconds) + +promote_secondaries - BOOLEAN + When a primary IP address is removed from this interface + promote a corresponding secondary IP address instead of + removing all the corresponding secondary IP addresses. + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IP packets that are received in link-layer + multicast (or broadcast) frames. + This behavior (for multicast) is actually a SHOULD in RFC + 1122, but is disabled by default for compatibility reasons. + Default: off (0) + +drop_gratuitous_arp - BOOLEAN + Drop all gratuitous ARP frames, for example if there's a known + good ARP proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + Default: off (0) + + +tag - INTEGER + Allows you to write a number, which can be used as required. + Default value is 0. + +xfrm4_gc_thresh - INTEGER + The threshold at which we will start garbage collecting for IPv4 + destination cache entries. At twice this value the system will + refuse new allocations. + +igmp_link_local_mcast_reports - BOOLEAN + Enable IGMP reports for link local multicast groups in the + 224.0.0.X range. + Default TRUE + +Alexey Kuznetsov. +kuznet@ms2.inr.ac.ru + +Updated by: +Andi Kleen +ak@muc.de +Nicolas Delon +delon.nicolas@wanadoo.fr + + + + +/proc/sys/net/ipv6/* Variables: + +IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also +apply to IPv6 [XXX?]. + +bindv6only - BOOLEAN + Default value for IPV6_V6ONLY socket option, + which restricts use of the IPv6 socket to IPv6 communication + only. + TRUE: disable IPv4-mapped address feature + FALSE: enable IPv4-mapped address feature + + Default: FALSE (as specified in RFC3493) + +flowlabel_consistency - BOOLEAN + Protect the consistency (and unicity) of flow label. + You have to disable it to use IPV6_FL_F_REFLECT flag on the + flow label manager. + TRUE: enabled + FALSE: disabled + Default: TRUE + +auto_flowlabels - INTEGER + Automatically generate flow labels based on a flow hash of the + packet. This allows intermediate devices, such as routers, to + identify packet flows for mechanisms like Equal Cost Multipath + Routing (see RFC 6438). + 0: automatic flow labels are completely disabled + 1: automatic flow labels are enabled by default, they can be + disabled on a per socket basis using the IPV6_AUTOFLOWLABEL + socket option + 2: automatic flow labels are allowed, they may be enabled on a + per socket basis using the IPV6_AUTOFLOWLABEL socket option + 3: automatic flow labels are enabled and enforced, they cannot + be disabled by the socket option + Default: 1 + +flowlabel_state_ranges - BOOLEAN + Split the flow label number space into two ranges. 0-0x7FFFF is + reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF + is reserved for stateless flow labels as described in RFC6437. + TRUE: enabled + FALSE: disabled + Default: true + +flowlabel_reflect - BOOLEAN + Automatically reflect the flow label. Needed for Path MTU + Discovery to work with Equal Cost Multipath Routing in anycast + environments. See RFC 7690 and: + https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 + TRUE: enabled + FALSE: disabled + Default: FALSE + +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. + Default: 0 (Layer 3) + Possible values: + 0 - Layer 3 (source and destination addresses plus flow label) + 1 - Layer 4 (standard 5-tuple) + +anycast_src_echo_reply - BOOLEAN + Controls the use of anycast addresses as source addresses for ICMPv6 + echo reply + TRUE: enabled + FALSE: disabled + Default: FALSE + +idgen_delay - INTEGER + Controls the delay in seconds after which time to retry + privacy stable address generation if a DAD conflict is + detected. + Default: 1 (as specified in RFC7217) + +idgen_retries - INTEGER + Controls the number of retries to generate a stable privacy + address if a DAD conflict is detected. + Default: 3 (as specified in RFC7217) + +mld_qrv - INTEGER + Controls the MLD query robustness variable (see RFC3810 9.1). + Default: 2 (as specified by RFC3810 9.1) + Minimum: 1 (as specified by RFC6636 4.5) + +max_dst_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Destination + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max_hbh_opts_number - INTEGER + Maximum number of non-padding TLVs allowed in a Hop-by-Hop + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max_dst_opts_length - INTEGER + Maximum length allowed for a Destination options extension + header. + Default: INT_MAX (unlimited) + +max_hbh_length - INTEGER + Maximum length allowed for a Hop-by-Hop options extension + header. + Default: INT_MAX (unlimited) + +IPv6 Fragmentation: + +ip6frag_high_thresh - INTEGER + Maximum memory used to reassemble IPv6 fragments. When + ip6frag_high_thresh bytes of memory is allocated for this purpose, + the fragment handler will toss packets until ip6frag_low_thresh + is reached. + +ip6frag_low_thresh - INTEGER + See ip6frag_high_thresh + +ip6frag_time - INTEGER + Time in seconds to keep an IPv6 fragment in memory. + +IPv6 Segment Routing: + +seg6_flowlabel - INTEGER + Controls the behaviour of computing the flowlabel of outer + IPv6 header in case of SR T.encaps + + -1 set flowlabel to zero. + 0 copy flowlabel from Inner packet in case of Inner IPv6 + (Set flowlabel to 0 in case IPv4/L2) + 1 Compute the flowlabel using seg6_make_flowlabel() + + Default is 0. + +conf/default/*: + Change the interface-specific default settings. + + +conf/all/*: + Change all the interface-specific settings. + + [XXX: Other special features than forwarding?] + +conf/all/forwarding - BOOLEAN + Enable global IPv6 forwarding between all interfaces. + + IPv4 and IPv6 work differently here; e.g. netfilter must be used + to control which interfaces may forward packets and which not. + + This also sets all interfaces' Host/Router setting + 'forwarding' to the specified value. See below for details. + + This referred to as global forwarding. + +proxy_ndp - BOOLEAN + Do proxy ndp. + +fwmark_reflect - BOOLEAN + Controls the fwmark of kernel-generated IPv6 reply packets that are not + associated with a socket for example, TCP RSTs or ICMPv6 echo replies). + If unset, these packets have a fwmark of zero. If set, they have the + fwmark of the packet they are replying to. + Default: 0 + +conf/interface/*: + Change special settings per interface. + + The functional behaviour for certain settings is different + depending on whether local forwarding is enabled or not. + +accept_ra - INTEGER + Accept Router Advertisements; autoconfigure using them. + + It also determines whether or not to transmit Router + Solicitations. If and only if the functional setting is to + accept Router Advertisements, Router Solicitations will be + transmitted. + + Possible values are: + 0 Do not accept Router Advertisements. + 1 Accept Router Advertisements if forwarding is disabled. + 2 Overrule forwarding behaviour. Accept Router Advertisements + even if forwarding is enabled. + + Functional default: enabled if local forwarding is disabled. + disabled if local forwarding is enabled. + +accept_ra_defrtr - BOOLEAN + Learn default router in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_from_local - BOOLEAN + Accept RA with source-address that is found on local machine + if the RA is otherwise proper and able to be accepted. + Default is to NOT accept these as it may be an un-intended + network loop. + + Functional default: + enabled if accept_ra_from_local is enabled + on a specific interface. + disabled if accept_ra_from_local is disabled + on a specific interface. + +accept_ra_min_hop_limit - INTEGER + Minimum hop limit Information in Router Advertisement. + + Hop limit Information in Router Advertisement less than this + variable shall be ignored. + + Default: 1 + +accept_ra_pinfo - BOOLEAN + Learn Prefix Information in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_rt_info_min_plen - INTEGER + Minimum prefix length of Route Information in RA. + + Route Information w/ prefix smaller than this variable shall + be ignored. + + Functional default: 0 if accept_ra_rtr_pref is enabled. + -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rt_info_max_plen - INTEGER + Maximum prefix length of Route Information in RA. + + Route Information w/ prefix larger than this variable shall + be ignored. + + Functional default: 0 if accept_ra_rtr_pref is enabled. + -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rtr_pref - BOOLEAN + Accept Router Preference in RA. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_mtu - BOOLEAN + Apply the MTU value specified in RA option 5 (RFC4861). If + disabled, the MTU specified in the RA will be ignored. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_redirects - BOOLEAN + Accept Redirects. + + Functional default: enabled if local forwarding is disabled. + disabled if local forwarding is enabled. + +accept_source_route - INTEGER + Accept source routing (routing extension header). + + >= 0: Accept only routing header type 2. + < 0: Do not accept routing header. + + Default: 0 + +autoconf - BOOLEAN + Autoconfigure addresses using Prefix Information in Router + Advertisements. + + Functional default: enabled if accept_ra_pinfo is enabled. + disabled if accept_ra_pinfo is disabled. + +dad_transmits - INTEGER + The amount of Duplicate Address Detection probes to send. + Default: 1 + +forwarding - INTEGER + Configure interface-specific Host/Router behaviour. + + Note: It is recommended to have the same setting on all + interfaces; mixed router/host scenarios are rather uncommon. + + Possible values are: + 0 Forwarding disabled + 1 Forwarding enabled + + FALSE (0): + + By default, Host behaviour is assumed. This means: + + 1. IsRouter flag is not set in Neighbour Advertisements. + 2. If accept_ra is TRUE (default), transmit Router + Solicitations. + 3. If accept_ra is TRUE (default), accept Router + Advertisements (and do autoconfiguration). + 4. If accept_redirects is TRUE (default), accept Redirects. + + TRUE (1): + + If local forwarding is enabled, Router behaviour is assumed. + This means exactly the reverse from the above: + + 1. IsRouter flag is set in Neighbour Advertisements. + 2. Router Solicitations are not sent unless accept_ra is 2. + 3. Router Advertisements are ignored unless accept_ra is 2. + 4. Redirects are ignored. + + Default: 0 (disabled) if global forwarding is disabled (default), + otherwise 1 (enabled). + +hop_limit - INTEGER + Default Hop Limit to set. + Default: 64 + +mtu - INTEGER + Default Maximum Transfer Unit + Default: 1280 (IPv6 required minimum) + +ip_nonlocal_bind - BOOLEAN + If set, allows processes to bind() to non-local IPv6 addresses, + which can be quite useful - but may break some applications. + Default: 0 + +router_probe_interval - INTEGER + Minimum interval (in seconds) between Router Probing described + in RFC4191. + + Default: 60 + +router_solicitation_delay - INTEGER + Number of seconds to wait after interface is brought up + before sending Router Solicitations. + Default: 1 + +router_solicitation_interval - INTEGER + Number of seconds to wait between Router Solicitations. + Default: 4 + +router_solicitations - INTEGER + Number of Router Solicitations to send until assuming no + routers are present. + Default: 3 + +use_oif_addrs_only - BOOLEAN + When enabled, the candidate source addresses for destinations + routed via this interface are restricted to the set of addresses + configured on this interface (vis. RFC 6724, section 4). + + Default: false + +use_tempaddr - INTEGER + Preference for Privacy Extensions (RFC3041). + <= 0 : disable Privacy Extensions + == 1 : enable Privacy Extensions, but prefer public + addresses over temporary addresses. + > 1 : enable Privacy Extensions and prefer temporary + addresses over public addresses. + Default: 0 (for most devices) + -1 (for point-to-point devices and loopback devices) + +temp_valid_lft - INTEGER + valid lifetime (in seconds) for temporary addresses. + Default: 604800 (7 days) + +temp_prefered_lft - INTEGER + Preferred lifetime (in seconds) for temporary addresses. + Default: 86400 (1 day) + +keep_addr_on_down - INTEGER + Keep all IPv6 addresses on an interface down event. If set static + global addresses with no expiration time are not flushed. + >0 : enabled + 0 : system default + <0 : disabled + + Default: 0 (addresses are removed) + +max_desync_factor - INTEGER + Maximum value for DESYNC_FACTOR, which is a random value + that ensures that clients don't synchronize with each + other and generate new addresses at exactly the same time. + value is in seconds. + Default: 600 + +regen_max_retry - INTEGER + Number of attempts before give up attempting to generate + valid temporary addresses. + Default: 5 + +max_addresses - INTEGER + Maximum number of autoconfigured addresses per interface. Setting + to zero disables the limitation. It is not recommended to set this + value too large (or to zero) because it would be an easy way to + crash the kernel by allowing too many addresses to be created. + Default: 16 + +disable_ipv6 - BOOLEAN + Disable IPv6 operation. If accept_dad is set to 2, this value + will be dynamically set to TRUE if DAD fails for the link-local + address. + Default: FALSE (enable IPv6 operation) + + When this value is changed from 1 to 0 (IPv6 is being enabled), + it will dynamically create a link-local address on the given + interface and start Duplicate Address Detection, if necessary. + + When this value is changed from 0 to 1 (IPv6 is being disabled), + it will dynamically delete all addresses and routes on the given + interface. From now on it will not possible to add addresses/routes + to the selected interface. + +accept_dad - INTEGER + Whether to accept DAD (Duplicate Address Detection). + 0: Disable DAD + 1: Enable DAD (default) + 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate + link-local address has been found. + + DAD operation and mode on a given interface will be selected according + to the maximum value of conf/{all,interface}/accept_dad. + +force_tllao - BOOLEAN + Enable sending the target link-layer address option even when + responding to a unicast neighbor solicitation. + Default: FALSE + + Quoting from RFC 2461, section 4.4, Target link-layer address: + + "The option MUST be included for multicast solicitations in order to + avoid infinite Neighbor Solicitation "recursion" when the peer node + does not have a cache entry to return a Neighbor Advertisements + message. When responding to unicast solicitations, the option can be + omitted since the sender of the solicitation has the correct link- + layer address; otherwise it would not have be able to send the unicast + solicitation in the first place. However, including the link-layer + address in this case adds little overhead and eliminates a potential + race condition where the sender deletes the cached link-layer address + prior to receiving a response to a previous solicitation." + +ndisc_notify - BOOLEAN + Define mode for notification of address and device changes. + 0 - (default): do nothing + 1 - Generate unsolicited neighbour advertisements when device is brought + up or hardware address changes. + +ndisc_tclass - INTEGER + The IPv6 Traffic Class to use by default when sending IPv6 Neighbor + Discovery (Router Solicitation, Router Advertisement, Neighbor + Solicitation, Neighbor Advertisement, Redirect) messages. + These 8 bits can be interpreted as 6 high order bits holding the DSCP + value and 2 low order bits representing ECN (which you probably want + to leave cleared). + 0 - (default) + +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + Default: 1000 (1 second) + +force_mld_version - INTEGER + 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + 1 - Enforce to use MLD version 1 + 2 - Enforce to use MLD version 2 + +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + 1 - (default) discard fragmented neighbor discovery packets + 0 - allow fragmented neighbor discovery packets + +optimistic_dad - BOOLEAN + Whether to perform Optimistic Duplicate Address Detection (RFC 4429). + 0: disabled (default) + 1: enabled + + Optimistic Duplicate Address Detection for the interface will be enabled + if at least one of conf/{all,interface}/optimistic_dad is set to 1, + it will be disabled otherwise. + +use_optimistic - BOOLEAN + If enabled, do not classify optimistic addresses as deprecated during + source address selection. Preferred addresses will still be chosen + before optimistic addresses, subject to other ranking in the source + address selection algorithm. + 0: disabled (default) + 1: enabled + + This will be enabled if at least one of + conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. + +stable_secret - IPv6 address + This IPv6 address will be used as a secret to generate IPv6 + addresses for link-local addresses and autoconfigured + ones. All addresses generated after setting this secret will + be stable privacy ones by default. This can be changed via the + addrgenmode ip-link. conf/default/stable_secret is used as the + secret for the namespace, the interface specific ones can + overwrite that. Writes to conf/all/stable_secret are refused. + + It is recommended to generate this secret during installation + of a system and keep it stable after that. + + By default the stable secret is unset. + +addr_gen_mode - INTEGER + Defines how link-local and autoconf addresses are generated. + + 0: generate address based on EUI64 (default) + 1: do no generate a link-local address, use EUI64 for addresses generated + from autoconf + 2: generate stable privacy addresses, using the secret from + stable_secret (RFC7217) + 3: generate stable privacy addresses, using a random secret if unset + +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IPv6 packets that are received in link-layer + multicast (or broadcast) frames. + + By default this is turned off. + +drop_unsolicited_na - BOOLEAN + Drop all unsolicited neighbor advertisements, for example if there's + a known good NA proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + By default this is turned off. + +enhanced_dad - BOOLEAN + Include a nonce option in the IPv6 neighbor solicitation messages used for + duplicate address detection per RFC7527. A received DAD NS will only signal + a duplicate address if the nonce is different. This avoids any false + detection of duplicates due to loopback of the NS messages that we send. + The nonce option will be sent on an interface unless both of + conf/{all,interface}/enhanced_dad are set to FALSE. + Default: TRUE + +icmp/*: +ratelimit - INTEGER + Limit the maximal rates for sending ICMPv6 packets. + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Default: 1000 + +echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it over the IPv6 protocol. + Default: 0 + +xfrm6_gc_thresh - INTEGER + The threshold at which we will start garbage collecting for IPv6 + destination cache entries. At twice this value the system will + refuse new allocations. + + +IPv6 Update by: +Pekka Savola <pekkas@netcore.fi> +YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org> + + +/proc/sys/net/bridge/* Variables: + +bridge-nf-call-arptables - BOOLEAN + 1 : pass bridged ARP traffic to arptables' FORWARD chain. + 0 : disable this. + Default: 1 + +bridge-nf-call-iptables - BOOLEAN + 1 : pass bridged IPv4 traffic to iptables' chains. + 0 : disable this. + Default: 1 + +bridge-nf-call-ip6tables - BOOLEAN + 1 : pass bridged IPv6 traffic to ip6tables' chains. + 0 : disable this. + Default: 1 + +bridge-nf-filter-vlan-tagged - BOOLEAN + 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. + 0 : disable this. + Default: 0 + +bridge-nf-filter-pppoe-tagged - BOOLEAN + 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. + 0 : disable this. + Default: 0 + +bridge-nf-pass-vlan-input-dev - BOOLEAN + 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan + interface on the bridge and set the netfilter input device to the vlan. + This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT + target work with vlan-on-top-of-bridge interfaces. When no matching + vlan interface is found, or this switch is off, the input device is + set to the bridge interface. + 0: disable bridge netfilter vlan interface lookup. + Default: 0 + +proc/sys/net/sctp/* Variables: + +addip_enable - BOOLEAN + Enable or disable extension of Dynamic Address Reconfiguration + (ADD-IP) functionality specified in RFC5061. This extension provides + the ability to dynamically add and remove new addresses for the SCTP + associations. + + 1: Enable extension. + + 0: Disable extension. + + Default: 0 + +pf_enable - INTEGER + Enable or disable pf (pf is short for potentially failed) state. A value + of pf_retrans > path_max_retrans also disables pf state. That is, one of + both pf_enable and pf_retrans > path_max_retrans can disable pf state. + Since pf_retrans and path_max_retrans can be changed by userspace + application, sometimes user expects to disable pf state by the value of + pf_retrans > path_max_retrans, but occasionally the value of pf_retrans + or path_max_retrans is changed by the user application, this pf state is + enabled. As such, it is necessary to add this to dynamically enable + and disable pf state. See: + https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for + details. + + 1: Enable pf. + + 0: Disable pf. + + Default: 1 + +addip_noauth_enable - BOOLEAN + Dynamic Address Reconfiguration (ADD-IP) requires the use of + authentication to protect the operations of adding or removing new + addresses. This requirement is mandated so that unauthorized hosts + would not be able to hijack associations. However, older + implementations may not have implemented this requirement while + allowing the ADD-IP extension. For reasons of interoperability, + we provide this variable to control the enforcement of the + authentication requirement. + + 1: Allow ADD-IP extension to be used without authentication. This + should only be set in a closed environment for interoperability + with older implementations. + + 0: Enforce the authentication requirement + + Default: 0 + +auth_enable - BOOLEAN + Enable or disable Authenticated Chunks extension. This extension + provides the ability to send and receive authenticated chunks and is + required for secure operation of Dynamic Address Reconfiguration + (ADD-IP) extension. + + 1: Enable this extension. + 0: Disable this extension. + + Default: 0 + +prsctp_enable - BOOLEAN + Enable or disable the Partial Reliability extension (RFC3758) which + is used to notify peers that a given DATA should no longer be expected. + + 1: Enable extension + 0: Disable + + Default: 1 + +max_burst - INTEGER + The limit of the number of new packets that can be initially sent. It + controls how bursty the generated traffic can be. + + Default: 4 + +association_max_retrans - INTEGER + Set the maximum number for retransmissions that an association can + attempt deciding that the remote end is unreachable. If this value + is exceeded, the association is terminated. + + Default: 10 + +max_init_retransmits - INTEGER + The maximum number of retransmissions of INIT and COOKIE-ECHO chunks + that an association will attempt before declaring the destination + unreachable and terminating. + + Default: 8 + +path_max_retrans - INTEGER + The maximum number of retransmissions that will be attempted on a given + path. Once this threshold is exceeded, the path is considered + unreachable, and new traffic will use a different path when the + association is multihomed. + + Default: 5 + +pf_retrans - INTEGER + The number of retransmissions that will be attempted on a given path + before traffic is redirected to an alternate transport (should one + exist). Note this is distinct from path_max_retrans, as a path that + passes the pf_retrans threshold can still be used. Its only + deprioritized when a transmission path is selected by the stack. This + setting is primarily used to enable fast failover mechanisms without + having to reduce path_max_retrans to a very low value. See: + http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt + for details. Note also that a value of pf_retrans > path_max_retrans + disables this feature. Since both pf_retrans and path_max_retrans can + be changed by userspace application, a variable pf_enable is used to + disable pf state. + + Default: 0 + +rto_initial - INTEGER + The initial round trip timeout value in milliseconds that will be used + in calculating round trip times. This is the initial time interval + for retransmissions. + + Default: 3000 + +rto_max - INTEGER + The maximum value (in milliseconds) of the round trip timeout. This + is the largest time interval that can elapse between retransmissions. + + Default: 60000 + +rto_min - INTEGER + The minimum value (in milliseconds) of the round trip timeout. This + is the smallest time interval the can elapse between retransmissions. + + Default: 1000 + +hb_interval - INTEGER + The interval (in milliseconds) between HEARTBEAT chunks. These chunks + are sent at the specified interval on idle paths to probe the state of + a given path between 2 associations. + + Default: 30000 + +sack_timeout - INTEGER + The amount of time (in milliseconds) that the implementation will wait + to send a SACK. + + Default: 200 + +valid_cookie_life - INTEGER + The default lifetime of the SCTP cookie (in milliseconds). The cookie + is used during association establishment. + + Default: 60000 + +cookie_preserve_enable - BOOLEAN + Enable or disable the ability to extend the lifetime of the SCTP cookie + that is used during the establishment phase of SCTP association + + 1: Enable cookie lifetime extension. + 0: Disable + + Default: 1 + +cookie_hmac_alg - STRING + Select the hmac algorithm used when generating the cookie value sent by + a listening sctp socket to a connecting client in the INIT-ACK chunk. + Valid values are: + * md5 + * sha1 + * none + Ability to assign md5 or sha1 as the selected alg is predicated on the + configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and + CONFIG_CRYPTO_SHA1). + + Default: Dependent on configuration. MD5 if available, else SHA1 if + available, else none. + +rcvbuf_policy - INTEGER + Determines if the receive buffer is attributed to the socket or to + association. SCTP supports the capability to create multiple + associations on a single socket. When using this capability, it is + possible that a single stalled association that's buffering a lot + of data may block other associations from delivering their data by + consuming all of the receive buffer space. To work around this, + the rcvbuf_policy could be set to attribute the receiver buffer space + to each association instead of the socket. This prevents the described + blocking. + + 1: rcvbuf space is per association + 0: rcvbuf space is per socket + + Default: 0 + +sndbuf_policy - INTEGER + Similar to rcvbuf_policy above, this applies to send buffer space. + + 1: Send buffer is tracked per association + 0: Send buffer is tracked per socket. + + Default: 0 + +sctp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all SCTP sockets. + + min: Below this number of pages SCTP is not bothered about its + memory appetite. When amount of memory allocated by SCTP exceeds + this number, SCTP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all SCTP sockets. + + Default is calculated at boot time from amount of available memory. + +sctp_rmem - vector of 3 INTEGERs: min, default, max + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 4K + +sctp_wmem - vector of 3 INTEGERs: min, default, max + Currently this tunable has no effect. + +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + 0 - Disable IPv4 address scoping + 1 - Enable IPv4 address scoping + 2 - Follow draft but allow IPv4 private addresses + 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + + +/proc/sys/net/core/* + Please see: Documentation/sysctl/net.txt for descriptions of these entries. + + +/proc/sys/net/unix/* +max_dgram_qlen - INTEGER + The maximum length of dgram socket receive queue + + Default: 10 + diff --git a/Documentation/networking/ip_dynaddr.txt b/Documentation/networking/ip_dynaddr.txt new file mode 100644 index 000000000..45f3c1268 --- /dev/null +++ b/Documentation/networking/ip_dynaddr.txt @@ -0,0 +1,29 @@ +IP dynamic address hack-port v0.03 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +This stuff allows diald ONESHOT connections to get established by +dynamically changing packet source address (and socket's if local procs). +It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2). + +If enabled[*] and forwarding interface has changed: + 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS + while in SYN_SENT state (diald-box processes). + 2) Out-bounded MASQueraded source address changes ON OUTPUT (when + internal host does retransmission) until a packet from outside is + received by the tunnel. + +This is specially helpful for auto dialup links (diald), where the +``actual'' outgoing address is unknown at the moment the link is +going up. So, the *same* (local AND masqueraded) connections requests that +bring the link up will be able to get established. + +[*] At boot, by default no address rewriting is attempted. + To enable: + # echo 1 > /proc/sys/net/ipv4/ip_dynaddr + To enable verbose mode: + # echo 2 > /proc/sys/net/ipv4/ip_dynaddr + To disable (default) + # echo 0 > /proc/sys/net/ipv4/ip_dynaddr + +Enjoy! + +-- Juanjo <jjciarla@raiz.uncu.edu.ar> diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.txt new file mode 100644 index 000000000..ba5c217ff --- /dev/null +++ b/Documentation/networking/ipddp.txt @@ -0,0 +1,73 @@ +Text file for ipddp.c: + AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation + +This text file is written by Jay Schulist <jschlst@samba.org> + +Introduction +------------ + +AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk +networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams +inside AppleTalk packets. + +Through this driver you can either allow your Linux box to communicate +IP over an AppleTalk network or you can provide IP gatewaying functions +for your AppleTalk users. + +You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk, +EtherTalk and PPPTalk. The only limit on the protocol is that of what +kernel AppleTalk layer and drivers are available. + +Each mode requires its own user space software. + +Compiling AppleTalk-IP Decapsulation/Encapsulation +================================================= + +AppleTalk-IP decapsulation needs to be compiled into your kernel. You +will need to turn on AppleTalk-IP driver support. Then you will need to +select ONE of the two options; IP to AppleTalk-IP encapsulation support or +AppleTalk-IP to IP decapsulation support. If you compile the driver +statically you will only be able to use the driver for the function you have +enabled in the kernel. If you compile the driver as a module you can +select what mode you want it to run in via a module loading param. +ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for +AppleTalk-IP to IP decapsulation. + +Basic instructions for user space tools +======================================= + +I will briefly describe the operation of the tools, but you will +need to consult the supporting documentation for each set of tools. + +Decapsulation - You will need to download a software package called +MacGate. In this distribution there will be a tool called MacRoute +which enables you to add routes to the kernel for your Macs by hand. +Also the tool MacRegGateWay is included to register the +proper IP Gateway and IP addresses for your machine. Included in this +distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from +ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional +but it allows automatic adding and deleting of routes for Macs. (Handy +for locations with large Mac installations) + +Encapsulation - You will need to download a software daemon called ipddpd. +This software expects there to be an AppleTalk-IP gateway on the network. +You will also need to add the proper routes to route your Linux box's IP +traffic out the ipddp interface. + +Common Uses of ipddp.c +---------------------- +Of course AppleTalk-IP decapsulation and encapsulation, but specifically +decapsulation is being used most for connecting LocalTalk networks to +IP networks. Although it has been used on EtherTalk networks to allow +Macs that are only able to tunnel IP over EtherTalk. + +Encapsulation has been used to allow a Linux box stuck on a LocalTalk +network to use IP. It should work equally well if you are stuck on an +EtherTalk only network. + +Further Assistance +------------------- +You can contact me (Jay Schulist <jschlst@samba.org>) with any +questions regarding decapsulation or encapsulation. Bradford W. Johnson +<johns393@maroon.tc.umn.edu> originally wrote the ipddp.c driver for IP +encapsulation in AppleTalk. diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt new file mode 100644 index 000000000..670b72f16 --- /dev/null +++ b/Documentation/networking/iphase.txt @@ -0,0 +1,158 @@ + + READ ME FISRT + ATM (i)Chip IA Linux Driver Source +-------------------------------------------------------------------------------- + Read This Before You Begin! +-------------------------------------------------------------------------------- + +Description +----------- + +This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver +source release. + +The features and limitations of this driver are as follows: + - A single VPI (VPI value of 0) is supported. + - Supports 4K VCs for the server board (with 512K control memory) and 1K + VCs for the client board (with 128K control memory). + - UBR, ABR and CBR service categories are supported. + - Only AAL5 is supported. + - Supports setting of PCR on the VCs. + - Multiple adapters in a system are supported. + - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, + including x575 (OC3, control memory 128K , 512K and packet memory 128K, + 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See + http://www.iphase.com/ + for details. + - Only x86 platforms are supported. + - SMP is supported. + + +Before You Start +---------------- + + +Installation +------------ + +1. Installing the adapters in the system + To install the ATM adapters in the system, follow the steps below. + a. Login as root. + b. Shut down the system and power off the system. + c. Install one or more ATM adapters in the system. + d. Connect each adapter to a port on an ATM switch. The green 'Link' + LED on the front panel of the adapter will be on if the adapter is + connected to the switch properly when the system is powered up. + e. Power on and boot the system. + +2. [ Removed ] + +3. Rebuild kernel with ABR support + [ a. and b. removed ] + c. Reconfigure the kernel, choose the Interphase ia driver through "make + menuconfig" or "make xconfig". + d. Rebuild the kernel, loadable modules and the atm tools. + e. Install the new built kernel and modules and reboot. + +4. Load the adapter hardware driver (ia driver) if it is built as a module + a. Login as root. + b. Change directory to /lib/modules/<kernel-version>/atm. + c. Run "insmod suni.o;insmod iphase.o" + The yellow 'status' LED on the front panel of the adapter will blink + while the driver is loaded in the system. + d. To verify that the 'ia' driver is loaded successfully, run the + following command: + + cat /proc/atm/devices + + If the driver is loaded successfully, the output of the command will + be similar to the following lines: + + Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ... + 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 ) + + You can also check the system log file /var/log/messages for messages + related to the ATM driver. + +5. Ia Driver Configuration + +5.1 Configuration of adapter buffers + The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and + 1M. The RAM size decides the number of buffers and buffer size. The default + size and number of buffers are set as following: + + Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf + RAM size size size size size cnt cnt + -------- ------ ------ ------ ------ ------ ------ + 128K 64K 64K 10K 10K 6 6 + 512K 256K 256K 10K 10K 25 25 + 1M 512K 512K 10K 10K 51 51 + + These setting should work well in most environments, but can be + changed by typing the following command: + + insmod <IA_DIR>/ia.o IA_RX_BUF=<RX_CNT> IA_RX_BUF_SZ=<RX_SIZE> \ + IA_TX_BUF=<TX_CNT> IA_TX_BUF_SZ=<TX_SIZE> + Where: + RX_CNT = number of receive buffers in the range (1-128) + RX_SIZE = size of receive buffers in the range (48-64K) + TX_CNT = number of transmit buffers in the range (1-128) + TX_SIZE = size of transmit buffers in the range (48-64K) + + 1. Transmit and receive buffer size must be a multiple of 4. + 2. Care should be taken so that the memory required for the + transmit and receive buffers is less than or equal to the + total adapter packet memory. + +5.2 Turn on ia debug trace + + When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver + can provide more debug trace if needed. There is a bit mask variable, + IADebugFlag, which controls the output of the traces. You can find the bit + map of the IADebugFlag in iphase.h. + The debug trace can be turn on through the insmod command line option, for + example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug + traces together with loading the driver. + +6. Ia Driver Test Using ttcp_atm and PVC + + For the PVC setup, the test machines can either be connected back-to-back or + through a switch. If connected through the switch, the switch must be + configured for the PVC(s). + + a. For UBR test: + At the test machine intended to receive data, type: + ttcp_atm -r -a -s 0.100 + At the other test machine, type: + ttcp_atm -t -a -s 0.100 -n 10000 + Run "ttcp_atm -h" to display more options of the ttcp_atm tool. + b. For ABR test: + It is the same as the UBR testing, but with an extra command option: + -Pabr:max_pcr=<xxx> + where: + xxx = the maximum peak cell rate, from 170 - 353207. + This option must be set on both the machines. + c. For CBR test: + It is the same as the UBR testing, but with an extra command option: + -Pcbr:max_pcr=<xxx> + where: + xxx = the maximum peak cell rate, from 170 - 353207. + This option may only be set on the transmit machine. + + +OUTSTANDING ISSUES +------------------ + + + +Contact Information +------------------- + + Customer Support: + United States: Telephone: (214) 654-5555 + Fax: (214) 654-5500 + E-Mail: intouch@iphase.com + Europe: Telephone: 33 (0)1 41 15 44 00 + Fax: 33 (0)1 41 15 12 13 + World Wide Web: http://www.iphase.com + Anonymous FTP: ftp.iphase.com diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt new file mode 100644 index 000000000..ba794b7e5 --- /dev/null +++ b/Documentation/networking/ipsec.txt @@ -0,0 +1,38 @@ + +Here documents known IPsec corner cases which need to be keep in mind when +deploy various IPsec configuration in real world production environment. + +1. IPcomp: Small IP packet won't get compressed at sender, and failed on + policy check on receiver. + +Quote from RFC3173: +2.2. Non-Expansion Policy + + If the total size of a compressed payload and the IPComp header, as + defined in section 3, is not smaller than the size of the original + payload, the IP datagram MUST be sent in the original non-compressed + form. To clarify: If an IP datagram is sent non-compressed, no + + IPComp header is added to the datagram. This policy ensures saving + the decompression processing cycles and avoiding incurring IP + datagram fragmentation when the expanded datagram is larger than the + MTU. + + Small IP datagrams are likely to expand as a result of compression. + Therefore, a numeric threshold should be applied before compression, + where IP datagrams of size smaller than the threshold are sent in the + original form without attempting compression. The numeric threshold + is implementation dependent. + +Current IPComp implementation is indeed by the book, while as in practice +when sending non-compressed packet to the peer (whether or not packet len +is smaller than the threshold or the compressed len is larger than original +packet len), the packet is dropped when checking the policy as this packet +matches the selector but not coming from any XFRM layer, i.e., with no +security path. Such naked packet will not eventually make it to upper layer. +The result is much more wired to the user when ping peer with different +payload length. + +One workaround is try to set "level use" for each policy if user observed +above scenario. The consequence of doing so is small packet(uncompressed) +will skip policy checking on receiver side. diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt new file mode 100644 index 000000000..6cd74fa55 --- /dev/null +++ b/Documentation/networking/ipv6.txt @@ -0,0 +1,72 @@ + +Options for the ipv6 module are supplied as parameters at load time. + +Module options may be given as command line arguments to the insmod +or modprobe command, but are usually specified in either +/etc/modules.d/*.conf configuration files, or in a distro-specific +configuration file. + +The available ipv6 module parameters are listed below. If a parameter +is not specified the default value is used. + +The parameters are as follows: + +disable + + Specifies whether to load the IPv6 module, but disable all + its functionality. This might be used when another module + has a dependency on the IPv6 module being loaded, but no + IPv6 addresses or operations are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled. + + This is the default value. + + 1 + IPv6 is disabled. + + No IPv6 addresses will be added to interfaces, and + it will not be possible to open an IPv6 socket. + + A reboot is required to enable IPv6. + +autoconf + + Specifies whether to enable IPv6 address autoconfiguration + on all interfaces. This might be used when one does not wish + for addresses to be automatically generated from prefixes + received in Router Advertisements. + + The possible values and their effects are: + + 0 + IPv6 address autoconfiguration is disabled on all interfaces. + + Only the IPv6 loopback address (::1) and link-local addresses + will be added to interfaces. + + 1 + IPv6 address autoconfiguration is enabled on all interfaces. + + This is the default value. + +disable_ipv6 + + Specifies whether to disable IPv6 on all interfaces. + This might be used when no IPv6 addresses are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled on all interfaces. + + This is the default value. + + 1 + IPv6 is disabled on all interfaces. + + No IPv6 addresses will be added to interfaces. + diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt new file mode 100644 index 000000000..27a38e50c --- /dev/null +++ b/Documentation/networking/ipvlan.txt @@ -0,0 +1,146 @@ + + IPVLAN Driver HOWTO + +Initial Release: + Mahesh Bandewar <maheshb AT google.com> + +1. Introduction: + This is conceptually very similar to the macvlan driver with one major +exception of using L3 for mux-ing /demux-ing among slaves. This property makes +the master device share the L2 with it's slave devices. I have developed this +driver in conjunction with network namespaces and not sure if there is use case +outside of it. + + +2. Building and Installation: + In order to build the driver, please select the config item CONFIG_IPVLAN. +The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module +(CONFIG_IPVLAN=m). + + +3. Configuration: + There are no module parameters for this driver and it can be configured +using IProute2/ip utility. + + ip link add link <master> name <slave> type ipvlan [ mode MODE ] [ FLAGS ] + where + MODE: l3 (default) | l3s | l2 + FLAGS: bridge (default) | private | vepa + + e.g. + (a) Following will create IPvlan link with eth0 as master in + L3 bridge mode + bash# ip link add link eth0 name ipvl0 type ipvlan + (b) This command will create IPvlan link in L2 bridge mode. + bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge + (c) This command will create an IPvlan device in L2 private mode. + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private + (d) This command will create an IPvlan device in L2 vepa mode. + bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa + + +4. Operating modes: + IPvlan has two modes of operation - L2 and L3. For a given master device, +you can select one of these two modes and all slaves on that master will +operate in the same (selected) mode. The RX mode is almost identical except +that in L3 mode the slaves wont receive any multicast / broadcast traffic. +L3 mode is more restrictive since routing is controlled from the other (mostly) +default namespace. + +4.1 L2 mode: + In this mode TX processing happens on the stack instance attached to the +slave device and packets are switched and queued to the master device to send +out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) +as well. + +4.2 L3 mode: + In this mode TX processing up to L3 happens on the stack instance attached +to the slave device and packets are switched to the stack instance of the +master device for the L2 processing and routing from that instance will be +used before packets are queued on the outbound device. In this mode the slaves +will not receive nor can send multicast / broadcast traffic. + +4.3 L3S mode: + This is very similar to the L3 mode except that iptables (conn-tracking) +works in this mode and hence it is L3-symmetric (L3s). This will have slightly less +performance but that shouldn't matter since you are choosing this mode over plain-L3 +mode to make conn-tracking work. + +5. Mode flags: + At this time following mode flags are available + +5.1 bridge: + This is the default option. To configure the IPvlan port in this mode, +user can choose to either add this option on the command-line or don't specify +anything. This is the traditional mode where slaves can cross-talk among +themselves apart from talking through the master device. + +5.2 private: + If this option is added to the command-line, the port is set in private +mode. i.e. port won't allow cross communication between slaves. + +5.3 vepa: + If this is added to the command-line, the port is set in VEPA mode. +i.e. port will offload switching functionality to the external entity as +described in 802.1Qbg +Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the +master-device, so the packets which are emitted in this mode for the adjacent +neighbor will have source and destination mac same. This will make the switch / +router send the redirect message. + +6. What to choose (macvlan vs. ipvlan)? + These two devices are very similar in many regards and the specific use +case could very well define which device to choose. if one of the following +situations defines your use case then you can choose to use ipvlan - + (a) The Linux host that is connected to the external switch / router has +policy configured that allows only one mac per port. + (b) No of virtual devices created on a master exceed the mac capacity and +puts the NIC in promiscuous mode and degraded performance is a concern. + (c) If the slave device is to be put into the hostile / untrusted network +namespace where L2 on the slave could be changed / misused. + + +6. Example configuration: + + +=============================================================+ + | Host: host1 | + | | + | +----------------------+ +----------------------+ | + | | NS:ns0 | | NS:ns1 | | + | | | | | | + | | | | | | + | | ipvl0 | | ipvl1 | | + | +----------#-----------+ +-----------#----------+ | + | # # | + | ################################ | + | # eth0 | + +==============================#==============================+ + + + (a) Create two network namespaces - ns0, ns1 + ip netns add ns0 + ip netns add ns1 + + (b) Create two ipvlan slaves on eth0 (master device) + ip link add link eth0 ipvl0 type ipvlan mode l2 + ip link add link eth0 ipvl1 type ipvlan mode l2 + + (c) Assign slaves to the respective network namespaces + ip link set dev ipvl0 netns ns0 + ip link set dev ipvl1 netns ns1 + + (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices + - For ns0 + (1) ip netns exec ns0 bash + (2) ip link set dev ipvl0 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl0 + (6) ip -4 route add default via $ROUTER dev ipvl0 + - For ns1 + (1) ip netns exec ns1 bash + (2) ip link set dev ipvl1 up + (3) ip link set dev lo up + (4) ip -4 addr add 127.0.0.1 dev lo + (5) ip -4 addr add $IPADDR dev ipvl1 + (6) ip -4 route add default via $ROUTER dev ipvl1 diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt new file mode 100644 index 000000000..fc531c29a --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.txt @@ -0,0 +1,293 @@ +/proc/sys/net/ipv4/vs/* Variables: + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +backup_only - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conn_reuse_mode - INTEGER + 1 - default + + Controls how ipvs will deal with connections that are detected + port reuse. It is a bitmap, with the values being: + + 0: disable any special handling on port reuse. The new + connection will be delivered to the same real server that was + servicing the previous connection. + + bit 1: enable rescheduling of new connections when it is safe. + That is, whenever expire_nodest_conn and for TCP sockets, when + the connection is in TIME_WAIT state (which is only possible if + you use NAT mode). + + bit 2: it is bit 1 plus, for TCP connections, when connections + are in FIN_WAIT state, as this is the last state seen by load + balancer in Direct Routing mode. This bit helps on adding new + real servers to a very busy cluster. + +conntrack - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + +cache_bypass - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + 0 - transmission error messages (default) + 1 - non-fatal error messages + 2 - configuration + 3 - destination trash + 4 - drop entry + 5 - service lookup + 6 - scheduling + 7 - connection new/expire, lookup and synchronization + 8 - state transition + 9 - binding destination, template checks and applications + 10 - IPVS packet transmission + 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + 12 or more - packet traversal + + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +ignore_tunneled - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, ipvs will set the ipvs_property on all packets which are of + unrecognized protocols. This prevents us from routing tunneled + protocols like ipip, which is useful to prevent rescheduling + packets that have been tunneled to the ipvs host (i.e. to prevent + ipvs routing loops when ipvs is also acting as a real server). + +nat_icmp_send - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +pmtu_disc - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + By default, reject with FRAG_NEEDED all DF packets that exceed + the PMTU, irrespective of the forwarding method. For TUN method + the flag can be disabled to fragment such packets. + +secure_tcp - INTEGER + 0 - disabled (default) + + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. + + The value definition is the same as that of drop_entry and + drop_packet. + +sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period + default 3 50 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus sync_period equals the threshold. The range of the + threshold is from 0 to sync_period. + + When sync_period and sync_refresh_period are 0, send sync only + for state changes or only once when pkts matches sync_threshold + +sync_refresh_period - UNSIGNED INTEGER + default 0 + + In seconds, difference in reported connection timer that triggers + new sync message. It can be used to avoid sync messages for the + specified period (or half of the connection timeout if it is lower) + if connection state is not changed since last sync. + + This is useful for normal connections with high traffic to reduce + sync rate. Additionally, retry sync_retries times with period of + sync_refresh_period/8. + +sync_retries - INTEGER + default 0 + + Defines sync retries with period of sync_refresh_period/8. Useful + to protect against loss of sync messages. The range of the + sync_retries is from 0 to 3. + +sync_qlen_max - UNSIGNED LONG + + Hard limit for queued sync messages that are not sent yet. It + defaults to 1/32 of the memory pages but actually represents + number of messages. It will protect us from allocating large + parts of memory when the sending rate is lower than the queuing + rate. + +sync_sock_size - INTEGER + default 0 + + Configuration of SNDBUF (master) or RCVBUF (slave) socket limit. + Default value is 0 (preserve system defaults). + +sync_ports - INTEGER + default 1 + + The number of threads that master and backup servers can use for + sync traffic. Every thread will use single UDP port, thread 0 will + use the default port 8848 while last thread will use port + 8848+sync_ports-1. + +snat_reroute - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/ixgb.txt b/Documentation/networking/ixgb.txt new file mode 100644 index 000000000..09f71d719 --- /dev/null +++ b/Documentation/networking/ixgb.txt @@ -0,0 +1,433 @@ +Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection +===================================================================== + +March 14, 2011 + + +Contents +======== + +- In This Release +- Identifying Your Adapter +- Building and Installation +- Command Line Parameters +- Improving Performance +- Additional Configurations +- Known Issues/Troubleshooting +- Support + + + +In This Release +=============== + +This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R) +Network Connection. This driver includes support for Itanium(R)2-based +systems. + +For questions related to hardware requirements, refer to the documentation +supplied with your 10 Gigabit adapter. All hardware requirements listed apply +to use with Linux. + +The following features are available in this kernel: + - Native VLANs + - Channel Bonding (teaming) + - SNMP + +Channel Bonding documentation can be found in the Linux kernel source: +/Documentation/networking/bonding.txt + +The driver information previously displayed in the /proc filesystem is not +supported in this release. Alternatively, you can use ethtool (version 1.6 +or later), lspci, and iproute2 to obtain the same information. + +Instructions on updating ethtool can be found in the section "Additional +Configurations" later in this document. + + +Identifying Your Adapter +======================== + +The following Intel network adapters are compatible with the drivers in this +release: + +Controller Adapter Name Physical Layer +---------- ------------ -------------- +82597EX Intel(R) PRO/10GbE LR/SR/CX4 10G Base-LR (1310 nm optical fiber) + Server Adapters 10G Base-SR (850 nm optical fiber) + 10G Base-CX4(twin-axial copper cabling) + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + + +Building and Installation +========================= + +select m for "Intel(R) PRO/10GbE support" located at: + Location: + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet (10000 Mbit) (NETDEV_10000 [=y]) +1. make modules && make modules_install + +2. Load the module: + + modprobe ixgb <parameter>=<value> + + The insmod command can be used if the full + path to the driver module is specified. For example: + + insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgb/ixgb.ko + + With 2.6 based kernels also make sure that older ixgb drivers are + removed from the kernel, before loading the new module: + + rmmod ixgb; modprobe ixgb + +3. Assign an IP address to the interface by entering the following, where + x is the interface number: + + ip addr add ethx <IP_address> + +4. Verify that the interface works. Enter the following, where <IP_address> + is the IP address for another machine on the same subnet as the interface + that is being tested: + + ping <IP_address> + + +Command Line Parameters +======================= + +If the driver is built as a module, the following optional parameters are +used by entering them on the command line with the modprobe command using +this syntax: + + modprobe ixgb [<option>=<VAL1>,<VAL2>,...] + +For example, with two 10GbE PCI adapters, entering: + + modprobe ixgb TxDescriptors=80,128 + +loads the ixgb driver with 80 TX resources for the first adapter and 128 TX +resources for the second adapter. + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +FlowControl +Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) +Default: Read from the EEPROM + If EEPROM is not detected, default is 1 + This parameter controls the automatic generation(Tx) and response(Rx) to + Ethernet PAUSE frames. There are hardware bugs associated with enabling + Tx flow control so beware. + +RxDescriptors +Valid Range: 64-512 +Default Value: 512 + This value is the number of receive descriptors allocated by the driver. + Increasing this value allows the driver to buffer more incoming packets. + Each descriptor is 16 bytes. A receive buffer is also allocated for + each descriptor and can be either 2048, 4056, 8192, or 16384 bytes, + depending on the MTU setting. When the MTU size is 1500 or less, the + receive buffer size is 2048 bytes. When the MTU is greater than 1500 the + receive buffer size will be either 4056, 8192, or 16384 bytes. The + maximum MTU size is 16114. + +RxIntDelay +Valid Range: 0-65535 (0=off) +Default Value: 72 + This value delays the generation of receive interrupts in units of + 0.8192 microseconds. Receive interrupt reduction can improve CPU + efficiency if properly tuned for specific network traffic. Increasing + this value adds extra latency to frame reception and can end up + decreasing the throughput of TCP traffic. If the system is reporting + dropped receives, this value may be set too high, causing the driver to + run out of available receive descriptors. + +TxDescriptors +Valid Range: 64-4096 +Default Value: 256 + This value is the number of transmit descriptors allocated by the driver. + Increasing this value allows the driver to queue more transmits. Each + descriptor is 16 bytes. + +XsumRX +Valid Range: 0-1 +Default Value: 1 + A value of '1' indicates that the driver should enable IP checksum + offload for received packets (both UDP and TCP) to the adapter hardware. + + +Improving Performance +===================== + +With the 10 Gigabit server adapters, the default Linux configuration will +very likely limit the total available throughput artificially. There is a set +of configuration changes that, when applied together, will increase the ability +of Linux to transmit and receive data. The following enhancements were +originally acquired from settings published at http://www.spec.org/web99/ for +various submitted results using Linux. + +NOTE: These changes are only suggestions, and serve as a starting point for + tuning your network performance. + +The changes are made in three major ways, listed in order of greatest effect: +- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen + parameter. +- Use sysctl to modify /proc parameters (essentially kernel tuning) +- Use setpci to modify the MMRBC field in PCI-X configuration space to increase + transmit burst lengths on the bus. + +NOTE: setpci modifies the adapter's configuration registers to allow it to read +up to 4k bytes at a time (for transmits). However, for some systems the +behavior after modifying this register may be undefined (possibly errors of +some kind). A power-cycle, hard reset or explicitly setting the e6 register +back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a +stable configuration. + +- COPY these lines and paste them into ixgb_perf.sh: +#!/bin/bash +echo "configuring network performance , edit this file to change the interface +or device ID of 10GbE card" +# set mmrbc to 4k reads, modify only Intel 10GbE device IDs +# replace 1a48 with appropriate 10GbE device's ID installed on the system, +# if needed. +setpci -d 8086:1a48 e6.b=2e +# set the MTU (max transmission unit) - it requires your switch and clients +# to change as well. +# set the txqueuelen +# your ixgb adapter should be loaded as eth1 for this to work, change if needed +ip li set dev eth1 mtu 9000 txqueuelen 1000 up +# call the sysctl utility to modify /proc/sys entries +sysctl -p ./sysctl_ixgb.conf +- END ixgb_perf.sh + +- COPY these lines and paste them into sysctl_ixgb.conf: +# some of the defaults may be different for your kernel +# call this file with sysctl -p <this file> +# these are just suggested values that worked well to increase throughput in +# several network benchmark tests, your mileage may vary + +### IPV4 specific settings +# turn TCP timestamp support off, default 1, reduces CPU use +net.ipv4.tcp_timestamps = 0 +# turn SACK support off, default on +# on systems with a VERY fast bus -> memory interface this is the big gainer +net.ipv4.tcp_sack = 0 +# set min/default/max TCP read buffer, default 4096 87380 174760 +net.ipv4.tcp_rmem = 10000000 10000000 10000000 +# set min/pressure/max TCP write buffer, default 4096 16384 131072 +net.ipv4.tcp_wmem = 10000000 10000000 10000000 +# set min/pressure/max TCP buffer space, default 31744 32256 32768 +net.ipv4.tcp_mem = 10000000 10000000 10000000 + +### CORE settings (mostly for socket and UDP effect) +# set maximum receive socket buffer size, default 131071 +net.core.rmem_max = 524287 +# set maximum send socket buffer size, default 131071 +net.core.wmem_max = 524287 +# set default receive socket buffer size, default 65535 +net.core.rmem_default = 524287 +# set default send socket buffer size, default 65535 +net.core.wmem_default = 524287 +# set maximum amount of option memory buffers, default 10240 +net.core.optmem_max = 524287 +# set number of unprocessed input packets before kernel starts dropping them; default 300 +net.core.netdev_max_backlog = 300000 +- END sysctl_ixgb.conf + +Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface +your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's +ID installed on the system. + +NOTE: Unless these scripts are added to the boot process, these changes will + only last only until the next system reboot. + + +Resolving Slow UDP Traffic +-------------------------- +If your server does not seem to be able to receive UDP traffic as fast as it +can receive TCP traffic, it could be because Linux, by default, does not set +the network stack buffers as large as they need to be to support high UDP +transfer rates. One way to alleviate this problem is to allow more memory to +be used by the IP stack to store incoming data. + +For instance, use the commands: + sysctl -w net.core.rmem_max=262143 +and + sysctl -w net.core.rmem_default=262143 +to increase the read buffer memory max and default to 262143 (256k - 1) from +defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables +will increase the amount of memory used by the network stack for receives, and +can be increased significantly more if necessary for your application. + + +Additional Configurations +========================= + + Configuring the Driver on Different Distributions + ------------------------------------------------- + Configuring a network driver to load properly when the system is started is + distribution dependent. Typically, the configuration process involves adding + an alias line to /etc/modprobe.conf as well as editing other system startup + scripts and/or configuration files. Many popular Linux distributions ship + with tools to make these changes for you. To learn the proper way to + configure a network device for your system, refer to your distribution + documentation. If during this process you are asked for the driver or module + name, the name for the Linux Base Driver for the Intel 10GbE Family of + Adapters is ixgb. + + Viewing Link Messages + --------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following: + + dmesg -n 8 + + NOTE: This setting is not saved across reboots. + + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16114. Use the ip command to + increase the MTU size. For example: + + ip li set dev ethx mtu 9000 + + The maximum MTU setting for Jumbo Frames is 16114. This value coincides + with the maximum Jumbo Frames size of 16128. + + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 1.6 or later is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool/ + + NOTE: The ethtool version 1.6 only supports a limited set of ethtool options. + Support for a more complete ethtool feature set can be enabled by + upgrading to the latest version. + + + NAPI + ---- + + NAPI (Rx polling mode) is supported in the ixgb driver. NAPI is enabled + or disabled based on the configuration of the kernel. see CONFIG_IXGB_NAPI + + See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. + + +Known Issues/Troubleshooting +============================ + + NOTE: After installing the driver, if your Intel Network Connection is not + working, verify in the "In This Release" section of the readme that you have + installed the correct driver. + + Intel(R) PRO/10GbE CX4 Server Adapter Cable Interoperability Issue with + Fujitsu XENPAK Module in SmartBits Chassis + --------------------------------------------------------------------- + Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 + Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits + chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni. + The CRC errors may be received either by the Intel(R) PRO/10GbE CX4 + Server adapter or the SmartBits. If this situation occurs using a different + cable assembly may resolve the issue. + + CX4 Server Adapter Cable Interoperability Issues with HP Procurve 3400cl + Switch Port + ------------------------------------------------------------------------ + Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server + adapter is connected to an HP Procurve 3400cl switch port using short cables + (1 m or shorter). If this situation occurs, using a longer cable may resolve + the issue. + + Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that + Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC + errors may be received either by the CX4 Server adapter or at the switch. If + this situation occurs, using a different cable assembly may resolve the issue. + + + Jumbo Frames System Requirement + ------------------------------- + Memory allocation failures have been observed on Linux systems with 64 MB + of RAM or less that are running Jumbo Frames. If you are using Jumbo + Frames, your system may require more than the advertised minimum + requirement of 64 MB of system memory. + + + Performance Degradation with Jumbo Frames + ----------------------------------------- + Degradation in throughput performance may be observed in some Jumbo frames + environments. If this is observed, increasing the application's socket buffer + size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. + See the specific application manual and /usr/src/linux*/Documentation/ + networking/ip-sysctl.txt for more details. + + + Allocating Rx Buffers when Using Jumbo Frames + --------------------------------------------- + Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if + the available memory is heavily fragmented. This issue may be seen with PCI-X + adapters or with packet split disabled. This can be reduced or eliminated + by changing the amount of available memory for receive buffer allocation, by + increasing /proc/sys/vm/min_free_kbytes. + + + Multiple Interfaces on Same Ethernet Broadcast Network + ------------------------------------------------------ + Due to the default ARP behavior on Linux, it is not possible to have + one system on two IP networks in the same Ethernet broadcast domain + (non-partitioned switch) behave as expected. All Ethernet interfaces + will respond to IP traffic for any IP address assigned to the system. + This results in unbalanced receive traffic. + + If you have multiple interfaces in a server, do either of the following: + + - Turn on ARP filtering by entering: + echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + + - Install the interfaces in separate broadcast domains - either in + different switches or in a switch partitioned to VLANs. + + + UDP Stress Test Dropped Packet Issue + -------------------------------------- + Under small packets UDP stress test with 10GbE driver, the Linux system + may drop UDP packets due to the fullness of socket buffers. You may want + to change the driver's Flow Control variables to the minimum value for + controlling packet reception. + + + Tx Hangs Possible Under Stress + ------------------------------ + Under stress conditions, if TX hangs occur, turning off TSO + "ethtool -K eth0 tso off" may resolve the problem. + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt new file mode 100644 index 000000000..687835415 --- /dev/null +++ b/Documentation/networking/ixgbe.txt @@ -0,0 +1,349 @@ +Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of +Adapters +============================================================================= + +Intel 10 Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Performance Tuning +- Known Issues +- Support + +Identifying Your Adapter +======================== + +The driver in this release is compatible with 82598, 82599 and X540-based +Intel Network Connections. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + +SFP+ Devices with Pluggable Optics +---------------------------------- + +82599-BASED ADAPTERS + +NOTES: If your 82599-based Intel(R) Network Adapter came with Intel optics, or +is an Intel(R) Ethernet Server Adapter X520-2, then it only supports Intel +optics and/or the direct attach cables listed below. + +When 82599-based SFP+ devices are connected back to back, they should be set to +the same Speed setting via ethtool. Results may vary if you mix speed settings. +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. + +Supplier Type Part Numbers + +SR Modules +Intel DUAL RATE 1G/10G SFP+ SR (bailed) FTLX8571D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDZ-IN2 +LR Modules +Intel DUAL RATE 1G/10G SFP+ LR (bailed) FTLX1471D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDZ-IN2 + +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. + +Supplier Type Part Numbers + +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL + +Finisar DUAL RATE 1G/10G SFP+ SR (No Bail) FTLX8571D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ SR (No Bail) AFBR-703SDZ-IN1 +Finisar DUAL RATE 1G/10G SFP+ LR (No Bail) FTLX1471D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ LR (No Bail) AFCT-701SDZ-IN1 +Finistar 1000BASE-T SFP FCLF8522P2BTL +Avago 1000BASE-T SFP ABCU-5710RZ + +82599-based adapters support all passive and active limiting direct attach +cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. + +Laser turns off for SFP+ when device is down +------------------------------------------- +"ip link set down" turns off the laser for 82599-based SFP+ fiber adapters. +"ip link set up" turns on the laser. + + +82598-BASED ADAPTERS + +NOTES for 82598-Based Adapters: +- Intel(R) Network Adapters that support removable optical modules only support + their original module type (i.e., the Intel(R) 10 Gigabit SR Dual Port + Express Module only supports SR optical modules). If you plug in a different + type of module, the driver will not load. +- Hot Swapping/hot plugging optical modules is not supported. +- Only single speed, 10 gigabit modules are supported. +- LAN on Motherboard (LOMs) may support DA, SR, or LR modules. Other module + types are not supported. Please see your system documentation for details. + +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. + +Supplier Type Part Numbers + +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL + +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. + + +Flow Control +------------ +Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable +receiving and transmitting pause frames for ixgbe. When TX is enabled, PAUSE +frames are generated when the receive packet buffer crosses a predefined +threshold. When rx is enabled, the transmit unit will halt for the time delay +specified when a PAUSE frame is received. + +Flow Control is enabled by default. If you want to disable a flow control +capable link partner, use ethtool: + + ethtool -A eth? autoneg off RX off TX off + +NOTE: For 82598 backplane cards entering 1 gig mode, flow control default +behavior is changed to off. Flow control in 1 gig mode on these devices can +lead to Tx hangs. + +Intel(R) Ethernet Flow Director +------------------------------- +Supports advanced filters that direct receive packets by their flows to +different queues. Enables tight control on routing a flow in the platform. +Matches flows and CPU cores for flow affinity. Supports multiple parameters +for flexible flow classification and load balancing. + +Flow director is enabled only if the kernel is multiple TX queue capable. + +An included script (set_irq_affinity.sh) automates setting the IRQ to CPU +affinity. + +You can verify that the driver is using Flow Director by looking at the counter +in ethtool: fdir_miss and fdir_match. + +Other ethtool Commands: +To enable Flow Director + ethtool -K ethX ntuple on +To add a filter + Use -U switch. e.g., ethtool -U ethX flow-type tcp4 src-ip 10.0.128.23 + action 1 +To see the list of filters currently present: + ethtool -u ethX + +Perfect Filter: Perfect filter is an interface to load the filter table that +funnels all flow into queue_0 unless an alternative queue is specified using +"action". In that case, any flow that matches the filter criteria will be +directed to the appropriate queue. + +If the queue is defined as -1, filter will drop matching packets. + +To account for filter matches and misses, there are two stats in ethtool: +fdir_match and fdir_miss. In addition, rx_queue_N_packets shows the number of +packets processed by the Nth queue. + +NOTE: Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are not +compatible with Flow Director. IF Flow Director is enabled, these will be +disabled. + +The following three parameters impact Flow Director. + +FdirMode +-------- +Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode) +Default Value: 1 + + Flow Director filtering modes. + +FdirPballoc +----------- +Valid Range: 0-2 (0=64k, 1=128k, 2=256k) +Default Value: 0 + + Flow Director allocated packet buffer size. + +AtrSampleRate +-------------- +Valid Range: 1-100 +Default Value: 20 + + Software ATR Tx packet sample rate. For example, when set to 20, every 20th + packet, looks to see if the packet will create a new flow. + +Node +---- +Valid Range: 0-n +Default Value: 1 (off) + + 0 - n: where n is the number of NUMA nodes (i.e. 0 - 3) currently online in + your system + 1: turns this option off + + The Node parameter will allow you to pick which NUMA node you want to have + the adapter allocate memory on. + +max_vfs +------- +Valid Range: 1-63 +Default Value: 0 + + If the value is greater than 0 it will also force the VMDq parameter to be 1 + or more. + + This parameter adds support for SR-IOV. It causes the driver to spawn up to + max_vfs worth of virtual function. + + +Additional Configurations +========================= + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16110. Use the ip command to + increase the MTU size. For example: + + ip link set dev ethx mtu 9000 + + The maximum MTU setting for Jumbo Frames is 9710. This value coincides + with the maximum Jumbo Frames size of 9728. + + Generic Receive Offload, aka GRO + -------------------------------- + The driver supports the in-kernel software implementation of GRO. GRO has + shown that by coalescing Rx traffic into larger chunks of data, CPU + utilization can be significantly reduced when under large Rx load. GRO is an + evolution of the previously-used LRO interface. GRO is able to coalesce + other protocols besides TCP. It's also safe to use with configurations that + are problematic for LRO, namely bridging and iSCSI. + + Data Center Bridging, aka DCB + ----------------------------- + DCB is a configuration Quality of Service implementation in hardware. + It uses the VLAN priority tag (802.1p) to filter traffic. That means + that there are 8 different priorities that traffic can be filtered into. + It also enables priority flow control which can limit or eliminate the + number of dropped packets during network stress. Bandwidth can be + allocated to each of these priorities, which is enforced at the hardware + level. + + To enable DCB support in ixgbe, you must enable the DCB netlink layer to + allow the userspace tools (see below) to communicate with the driver. + This can be found in the kernel configuration here: + + -> Networking support + -> Networking options + -> Data Center Bridging support + + Once this is selected, DCB support must be selected for ixgbe. This can + be found here: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet (10000 Mbit) (NETDEV_10000 [=y]) + -> Intel(R) 10GbE PCI Express adapters support + -> Data Center Bridging (DCB) Support + + After these options are selected, you must rebuild your kernel and your + modules. + + In order to use DCB, userspace tools must be downloaded and installed. + The dcbd tools can be found at: + + http://e1000.sf.net + + Ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool/ + + FCoE + ---- + This release of the ixgbe driver contains new code to enable users to use + Fiber Channel over Ethernet (FCoE) and Data Center Bridging (DCB) + functionality that is supported by the 82598-based hardware. This code has + no default effect on the regular driver operation, and configuring DCB and + FCoE is outside the scope of this driver README. Refer to + http://www.open-fcoe.org/ for FCoE project information and contact + e1000-eedc@lists.sourceforge.net for DCB information. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + Enabling SR-IOV in a 32-bit or 64-bit Microsoft* Windows* Server 2008/R2 + Guest OS using Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE + controller under KVM + ------------------------------------------------------------------------ + KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This + includes traditional PCIe devices, as well as SR-IOV-capable devices using + Intel 82576-based and 82599-based controllers. + + While direct assignment of a PCIe device or an SR-IOV Virtual Function (VF) + to a Linux-based VM running 2.6.32 or later kernel works fine, there is a + known issue with Microsoft Windows Server 2008 VM that results in a "yellow + bang" error. This problem is within the KVM VMM itself, not the Intel driver, + or the SR-IOV logic of the VMM, but rather that KVM emulates an older CPU + model for the guests, and this older CPU model does not support MSI-X + interrupts, which is a requirement for Intel SR-IOV. + + If you wish to use the Intel 82576 or 82599-based controllers in SR-IOV mode + with KVM and a Microsoft Windows Server 2008 guest try the following + workaround. The workaround is to tell KVM to emulate a different model of CPU + when using qemu to create the KVM guest: + + "-cpu qemu64,model=13" + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://e1000.sourceforge.net + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ixgbevf.txt b/Documentation/networking/ixgbevf.txt new file mode 100644 index 000000000..53d8d2a5a --- /dev/null +++ b/Documentation/networking/ixgbevf.txt @@ -0,0 +1,52 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Known Issues/Troubleshooting +- Support + +This file describes the ixgbevf Linux* Base Driver for Intel Network +Connection. + +The ixgbevf driver supports 82599-based virtual function devices that can only +be activated on kernels with CONFIG_PCI_IOV enabled. + +The ixgbevf driver supports virtual functions generated by the ixgbe driver +with a max_vfs value of 1 or greater. + +The guest OS loading the ixgbevf driver must support MSI-X interrupts. + +VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs. + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Known Issues/Troubleshooting +============================ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst new file mode 100644 index 000000000..f03ae64be --- /dev/null +++ b/Documentation/networking/kapi.rst @@ -0,0 +1,171 @@ +========================================= +Linux Networking and Network Devices APIs +========================================= + +Linux Networking +================ + +Networking Base Types +--------------------- + +.. kernel-doc:: include/linux/net.h + :internal: + +Socket Buffer Functions +----------------------- + +.. kernel-doc:: include/linux/skbuff.h + :internal: + +.. kernel-doc:: include/net/sock.h + :internal: + +.. kernel-doc:: net/socket.c + :export: + +.. kernel-doc:: net/core/skbuff.c + :export: + +.. kernel-doc:: net/core/sock.c + :export: + +.. kernel-doc:: net/core/datagram.c + :export: + +.. kernel-doc:: net/core/stream.c + :export: + +Socket Filter +------------- + +.. kernel-doc:: net/core/filter.c + :export: + +Generic Network Statistics +-------------------------- + +.. kernel-doc:: include/uapi/linux/gen_stats.h + :internal: + +.. kernel-doc:: net/core/gen_stats.c + :export: + +.. kernel-doc:: net/core/gen_estimator.c + :export: + +SUN RPC subsystem +----------------- + +.. kernel-doc:: net/sunrpc/xdr.c + :export: + +.. kernel-doc:: net/sunrpc/svc_xprt.c + :export: + +.. kernel-doc:: net/sunrpc/xprt.c + :export: + +.. kernel-doc:: net/sunrpc/sched.c + :export: + +.. kernel-doc:: net/sunrpc/socklib.c + :export: + +.. kernel-doc:: net/sunrpc/stats.c + :export: + +.. kernel-doc:: net/sunrpc/rpc_pipe.c + :export: + +.. kernel-doc:: net/sunrpc/rpcb_clnt.c + :export: + +.. kernel-doc:: net/sunrpc/clnt.c + :export: + +WiMAX +----- + +.. kernel-doc:: net/wimax/op-msg.c + :export: + +.. kernel-doc:: net/wimax/op-reset.c + :export: + +.. kernel-doc:: net/wimax/op-rfkill.c + :export: + +.. kernel-doc:: net/wimax/stack.c + :export: + +.. kernel-doc:: include/net/wimax.h + :internal: + +.. kernel-doc:: include/uapi/linux/wimax.h + :internal: + +Network device support +====================== + +Driver Support +-------------- + +.. kernel-doc:: net/core/dev.c + :export: + +.. kernel-doc:: net/ethernet/eth.c + :export: + +.. kernel-doc:: net/sched/sch_generic.c + :export: + +.. kernel-doc:: include/linux/etherdevice.h + :internal: + +.. kernel-doc:: include/linux/netdevice.h + :internal: + +PHY Support +----------- + +.. kernel-doc:: drivers/net/phy/phy.c + :export: + +.. kernel-doc:: drivers/net/phy/phy.c + :internal: + +.. kernel-doc:: drivers/net/phy/phy_device.c + :export: + +.. kernel-doc:: drivers/net/phy/phy_device.c + :internal: + +.. kernel-doc:: drivers/net/phy/mdio_bus.c + :export: + +.. kernel-doc:: drivers/net/phy/mdio_bus.c + :internal: + +PHYLINK +------- + + PHYLINK interfaces traditional network drivers with PHYLIB, fixed-links, + and SFF modules (eg, hot-pluggable SFP) that may contain PHYs. PHYLINK + provides management of the link state and link modes. + +.. kernel-doc:: include/linux/phylink.h + :internal: + +.. kernel-doc:: drivers/net/phy/phylink.c + +SFP support +----------- + +.. kernel-doc:: drivers/net/phy/sfp-bus.c + :internal: + +.. kernel-doc:: include/linux/sfp.h + :internal: + +.. kernel-doc:: drivers/net/phy/sfp-bus.c + :export: diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt new file mode 100644 index 000000000..b773a5278 --- /dev/null +++ b/Documentation/networking/kcm.txt @@ -0,0 +1,285 @@ +Kernel Connection Multiplexor +----------------------------- + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below: + ++------------+ +------------+ +------------+ +------------+ +| KCM socket | | KCM socket | | KCM socket | | KCM socket | ++------------+ +------------+ +------------+ +------------+ + | | | | + +-----------+ | | +----------+ + | | | | + +----------------------------------+ + | Multiplexor | + +----------------------------------+ + | | | | | + +---------+ | | | ------------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| Psock | | Psock | | Psock | | Psock | | Psock | ++----------+ +----------+ +----------+ +----------+ +----------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | ++----------+ +----------+ +----------+ +----------+ +----------+ + +KCM sockets +----------- + +The KCM sockets provide the user interface to the multiplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +----------- + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks +-------------------- + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics +------------------------ + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types +------------ + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +------------------- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +--------------------- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +KCM limits the maximum receive message size to be the size of the receive +socket buffer on the attached TCP socket (the socket buffer size can be set by +SO_RCVBUF). If the length of a new message reported by the BPF program is +greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP +socket. The BPF program may also enforce a maximum messages size and report an +error when it is exceeded. + +A timeout may be set for assembling messages on a receive socket. The timeout +value is taken from the receive timeout of the attached TCP socket (this is set +by SO_RCVTIMEO). If the timer expires before assembly is complete an error +(ETIMEDOUT) is posted on the socket. + +User interface +============== + +Creating a multiplexor +---------------------- + +A new multiplexor and initial KCM socket is created by a socket call: + + socket(AF_KCM, type, protocol) + + - type is either SOCK_DGRAM or SOCK_SEQPACKET + - protocol is KCMPROTO_CONNECTED + +Cloning KCM sockets +------------------- + +After the first KCM socket is created using the socket call as described +above, additional sockets for the multiplexor can be created by cloning +a KCM socket. This is accomplished by an ioctl on a KCM socket: + + /* From linux/kcm.h */ + struct kcm_clone { + int fd; + }; + + struct kcm_clone info; + + memset(&info, 0, sizeof(info)); + + err = ioctl(kcmfd, SIOCKCMCLONE, &info); + + if (!err) + newkcmfd = info.fd; + +Attach transport sockets +------------------------ + +Attaching of transport sockets to a multiplexor is performed by calling an +ioctl on a KCM socket for the multiplexor. e.g.: + + /* From linux/kcm.h */ + struct kcm_attach { + int fd; + int bpf_fd; + }; + + struct kcm_attach info; + + memset(&info, 0, sizeof(info)); + + info.fd = tcpfd; + info.bpf_fd = bpf_prog_fd; + + ioctl(kcmfd, SIOCKCMATTACH, &info); + +The kcm_attach structure contains: + fd: file descriptor for TCP socket being attached + bpf_prog_fd: file descriptor for compiled BPF program downloaded + +Unattach transport sockets +-------------------------- + +Unattaching a transport socket from a multiplexor is straightforward. An +"unattach" ioctl is done with the kcm_unattach structure as the argument: + + /* From linux/kcm.h */ + struct kcm_unattach { + int fd; + }; + + struct kcm_unattach info; + + memset(&info, 0, sizeof(info)); + + info.fd = cfd; + + ioctl(fd, SIOCKCMUNATTACH, &info); + +Disabling receive on KCM socket +------------------------------- + +A setsockopt is used to disable or enable receiving on a KCM socket. +When receive is disabled, any pending messages in the socket's +receive buffer are moved to other sockets. This feature is useful +if an application thread knows that it will be doing a lot of +work on a request and won't be able to service new messages for a +while. Example use: + + int val = 1; + + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) + +BFP programs for message delineation +------------------------------------ + +BPF programs can be compiled using the BPF LLVM backend. For example, +the BPF program for parsing Thrift is: + + #include "bpf.h" /* for __sk_buff */ + #include "bpf_helpers.h" /* for load_word intrinsic */ + + SEC("socket_kcm") + int bpf_prog1(struct __sk_buff *skb) + { + return load_word(skb, 0) + 4; + } + + char _license[] SEC("license") = "GPL"; + +Use in applications +=================== + +KCM accelerates application layer protocols. Specifically, it allows +applications to use a message based interface for sending and receiving +messages. The kernel provides necessary assurances that messages are sent +and received atomically. This relieves much of the burden applications have +in mapping a message based protocol onto the TCP stream. KCM also make +application layer messages a unit of work in the kernel for the purposes of +steering and scheduling, which in turn allows a simpler networking model in +multithreaded applications. + +Configurations +-------------- + +In an Nx1 configuration, KCM logically provides multiple socket handles +to the same TCP connection. This allows parallelism between in I/O +operations on the TCP socket (for instance copyin and copyout of data is +parallelized). In an application, a KCM socket can be opened for each +processing thread and inserted into the epoll (similar to how SO_REUSEPORT +is used to allow multiple listener sockets on the same port). + +In a MxN configuration, multiple connections are established to the +same destination. These are used for simple load balancing. + +Message batching +---------------- + +The primary purpose of KCM is load balancing between KCM sockets and hence +threads in a nominal use case. Perfect load balancing, that is steering +each received message to a different KCM socket or steering each sent +message to a different TCP socket, can negatively impact performance +since this doesn't allow for affinities to be established. Balancing +based on groups, or batches of messages, can be beneficial for performance. + +On transmit, there are three ways an application can batch (pipeline) +messages on a KCM socket. + 1) Send multiple messages in a single sendmmsg. + 2) Send a group of messages each with a sendmsg call, where all messages + except the last have MSG_BATCH in the flags of sendmsg call. + 3) Create "super message" composed of multiple messages and send this + with a single sendmsg. + +On receive, the KCM module attempts to queue messages received on the +same KCM socket during each TCP ready callback. The targeted KCM socket +changes at each receive ready callback on the KCM socket. The application +does not need to configure this. + +Error handling +-------------- + +An application should include a thread to monitor errors raised on +the TCP connection. Normally, this will be done by placing each +TCP socket attached to a KCM multiplexor in epoll set for POLLERR +event. If an error occurs on an attached TCP socket, KCM sets an EPIPE +on the socket thus waking up the application thread. When the application +sees the error (which may just be a disconnect) it should unattach the +socket from KCM and then close it. It is assumed that once an error is +posted on the TCP socket the data stream is unrecoverable (i.e. an error +may have occurred in the middle of receiving a message). + +TCP connection monitoring +------------------------- + +In KCM there is no means to correlate a message to the TCP socket that +was used to send or receive the message (except in the case there is +only one attached TCP socket). However, the application does retain +an open file descriptor to the socket so it will be able to get statistics +from the socket which can be used in detecting issues (such as high +retransmissions on the socket). diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt new file mode 100644 index 000000000..9bc271cdc --- /dev/null +++ b/Documentation/networking/l2tp.txt @@ -0,0 +1,345 @@ +This document describes how to use the kernel's L2TP drivers to +provide L2TP functionality. L2TP is a protocol that tunnels one or +more sessions over an IP tunnel. It is commonly used for VPNs +(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP +network infrastructure. With L2TPv3, it is also useful as a Layer-2 +tunneling infrastructure. + +Features +======== + +L2TPv2 (PPP over L2TP (UDP tunnels)). +L2TPv3 ethernet pseudowires. +L2TPv3 PPP pseudowires. +L2TPv3 IP encapsulation. +Netlink sockets for L2TPv3 configuration management. + +History +======= + +The original pppol2tp driver was introduced in 2.6.23 and provided +L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP +sessions over a UDP tunnel. + +L2TPv3 (rfc3931) changes the protocol to allow different frame types +to be passed over an L2TP tunnel by moving the PPP-specific parts of +the protocol out of the core L2TP packet headers. Each frame type is +known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM +pseudowires for L2TP are defined in separate RFC standards. Another +change for L2TPv3 is that it can be carried directly over IP with no +UDP header (UDP is optional). It is also possible to create static +unmanaged L2TPv3 tunnels manually without a control protocol +(userspace daemon) to manage them. + +To support L2TPv3, the original pppol2tp driver was split up to +separate the L2TP and PPP functionality. Existing L2TPv2 userspace +apps should be unaffected as the original pppol2tp sockets API is +retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and +sessions. + +Design +====== + +The L2TP protocol separates control and data frames. The L2TP kernel +drivers handle only L2TP data frames; control frames are always +handled by userspace. L2TP control frames carry messages between L2TP +clients/servers and are used to setup / teardown tunnels and +sessions. An L2TP client or server is implemented in userspace. + +Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP +provides L2TPv3 IP encapsulation (no UDP) and is implemented using a +new l2tpip socket family. The tunnel socket is typically created by +userspace, though for unmanaged L2TPv3 tunnels, the socket can also be +created by the kernel. Each L2TP session (pseudowire) gets a network +interface instance. In the case of PPP, these interfaces are created +indirectly by pppd using a pppol2tp socket. In the case of ethernet, +the netdevice is created upon a netlink request to create an L2TPv3 +ethernet pseudowire. + +For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a +mechanism by which PPP frames carried through an L2TP session are +passed through the kernel's PPP subsystem. The standard PPP daemon, +pppd, handles all PPP interaction with the peer. PPP network +interfaces are created for each local PPP endpoint. The kernel's PPP +subsystem arranges for PPP control frames to be delivered to pppd, +while data frames are forwarded as usual. + +For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a +netdevice driver, managing virtual ethernet devices, one per +pseudowire. These interfaces can be managed using standard Linux tools +such as "ip" and "ifconfig". If only IP frames are passed over the +tunnel, the interface can be given an IP addresses of itself and its +peer. If non-IP frames are to be passed over the tunnel, the interface +can be added to a bridge using brctl. All L2TP datapath protocol +functions are handled by the L2TP core driver. + +Each tunnel and session within a tunnel is assigned a unique tunnel_id +and session_id. These ids are carried in the L2TP header of every +control and data packet. (Actually, in L2TPv3, the tunnel_id isn't +present in data frames - it is inferred from the IP connection on +which the packet was received.) The L2TP driver uses the ids to lookup +internal tunnel and/or session contexts to determine how to handle the +packet. Zero tunnel / session ids are treated specially - zero ids are +never assigned to tunnels or sessions in the network. In the driver, +the tunnel context keeps a reference to the tunnel UDP or L2TPIP +socket. The session context holds data that lets the driver interface +to the kernel's network frame type subsystems, i.e. PPP, ethernet. + +Userspace Programming +===================== + +For L2TPv2, there are a number of requirements on the userspace L2TP +daemon in order to use the pppol2tp driver. + +1. Use a UDP socket per tunnel. + +2. Create a single PPPoL2TP socket per tunnel bound to a special null + session id. This is used only for communicating with the driver but + must remain open while the tunnel is active. Opening this tunnel + management socket causes the driver to mark the tunnel socket as an + L2TP UDP encapsulation socket and flags it for use by the + referenced tunnel id. This hooks up the UDP receive path via + udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed + in this special PPPoX socket. + +3. Create a PPPoL2TP socket per L2TP session. This is typically done + by starting pppd with the pppol2tp plugin and appropriate + arguments. A PPPoL2TP tunnel management socket (Step 2) must be + created before the first PPPoL2TP session socket is created. + +When creating PPPoL2TP sockets, the application provides information +to the driver about the socket in a socket connect() call. Source and +destination tunnel and session ids are provided, as well as the file +descriptor of a UDP socket. See struct pppol2tp_addr in +include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are +treated specially. When creating the per-tunnel PPPoL2TP management +socket in Step 2 above, zero source and destination session ids are +specified, which tells the driver to prepare the supplied UDP file +descriptor for use as an L2TP tunnel socket. + +Userspace may control behavior of the tunnel or session using +setsockopt and ioctl on the PPPoX socket. The following socket +options are supported:- + +DEBUG - bitmask of debug message categories. See below. +SENDSEQ - 0 => don't send packets with sequence numbers + 1 => send packets with sequence numbers +RECVSEQ - 0 => receive packet sequence numbers are optional + 1 => drop receive packets without sequence numbers +LNSMODE - 0 => act as LAC. + 1 => act as LNS. +REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder. + +Only the DEBUG option is supported by the special tunnel management +PPPoX socket. + +In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided +to retrieve tunnel and session statistics from the kernel using the +PPPoX socket of the appropriate tunnel or session. + +For L2TPv3, userspace must use the netlink API defined in +include/linux/l2tp.h to manage tunnel and session contexts. The +general procedure to create a new L2TP tunnel with one session is:- + +1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel + using netlink. + +2. Create a UDP or L2TPIP socket for the tunnel. + +3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE + request. Set attributes according to desired tunnel parameters, + referencing the UDP or L2TPIP socket created in the previous step. + +4. Create a new L2TP session in the tunnel using a + L2TP_CMD_SESSION_CREATE request. + +The tunnel and all of its sessions are closed when the tunnel socket +is closed. The netlink API may also be used to delete sessions and +tunnels. Configuration and status info may be set or read using netlink. + +The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These +are where there is no L2TP control message exchange with the peer to +setup the tunnel; the tunnel is configured manually at each end of the +tunnel. There is no need for an L2TP userspace application in this +case -- the tunnel socket is created by the kernel and configured +using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink +request. The "ip" utility of iproute2 has commands for managing static +L2TPv3 tunnels; do "ip l2tp help" for more information. + +Debugging +========= + +The driver supports a flexible debug scheme where kernel trace +messages may be optionally enabled per tunnel and per session. Care is +needed when debugging a live system since the messages are not +rate-limited and a busy system could be swamped. Userspace uses +setsockopt on the PPPoX socket to set a debug mask. + +The following debug mask bits are available: + +L2TP_MSG_DEBUG verbose debug (if compiled in) +L2TP_MSG_CONTROL userspace - kernel interface +L2TP_MSG_SEQ sequence numbers handling +L2TP_MSG_DATA data packets + +If enabled, files under a l2tp debugfs directory can be used to dump +kernel state about L2TP tunnels and sessions. To access it, the +debugfs filesystem must first be mounted. + +# mount -t debugfs debugfs /debug + +Files under the l2tp directory can then be accessed. + +# cat /debug/l2tp/tunnels + +The debugfs files should not be used by applications to obtain L2TP +state information because the file format is subject to change. It is +implemented to provide extra debug information to help diagnose +problems.) Users should use the netlink API. + +/proc/net/pppol2tp is also provided for backwards compatibility with +the original pppol2tp driver. It lists information about L2TPv2 +tunnels and sessions only. Its use is discouraged. + +Unmanaged L2TPv3 Tunnels +======================== + +Some commercial L2TP products support unmanaged L2TPv3 ethernet +tunnels, where there is no L2TP control protocol; tunnels are +configured at each side manually. New commands are available in +iproute2's ip utility to support this. + +To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1 +and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the +tunnel endpoints:- + +# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \ + udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2 +# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1 +# ip -s -d show dev l2tpeth0 +# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0 +# ip li set dev l2tpeth0 up + +Choose IP addresses to be the address of a local IP interface and that +of the remote system. The IP addresses of the l2tpeth0 interface can be +anything suitable. + +Repeat the above at the peer, with ports, tunnel/session ids and IP +addresses reversed. The tunnel and session IDs can be any non-zero +32-bit number, but the values must be reversed at the peer. + +Host 1 Host2 +udp_sport=5000 udp_sport=5001 +udp_dport=5001 udp_dport=5000 +tunnel_id=42 tunnel_id=45 +peer_tunnel_id=45 peer_tunnel_id=42 +session_id=128 session_id=5196755 +peer_session_id=5196755 peer_session_id=128 + +When done at both ends of the tunnel, it should be possible to send +data over the network. e.g. + +# ping 10.5.1.1 + + +Sample Userspace Code +===================== + +1. Create tunnel management PPPoX socket + + kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); + if (kernel_fd >= 0) { + struct sockaddr_pppol2tp sax; + struct sockaddr_in const *peer_addr; + + peer_addr = l2tp_tunnel_get_peer_addr(tunnel); + memset(&sax, 0, sizeof(sax)); + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */ + sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = peer_addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = 0; /* special case: mgmt socket */ + sax.pppol2tp.d_tunnel = 0; + sax.pppol2tp.d_session = 0; /* special case: mgmt socket */ + + if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) { + perror("connect failed"); + result = -errno; + goto err; + } + } + +2. Create session PPPoX data socket + + struct sockaddr_pppol2tp sax; + int fd; + + /* Note, the target socket must be bound already, else it will not be ready */ + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = tunnel_fd; + sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = session_id; + sax.pppol2tp.d_tunnel = peer_tunnel_id; + sax.pppol2tp.d_session = peer_session_id; + + /* session_fd is the fd of the session's PPPoL2TP socket. + * tunnel_fd is the fd of the tunnel UDP socket. + */ + fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); + if (fd < 0 ) { + return -errno; + } + return 0; + +Internal Implementation +======================= + +The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a +struct l2tp_session context for each session. The l2tp_tunnel is +always associated with a UDP or L2TP/IP socket and keeps a list of +sessions in the tunnel. The l2tp_session context keeps kernel state +about the session. It has private data which is used for data specific +to the session type. With L2TPv2, the session always carried PPP +traffic. With L2TPv3, the session can also carry ethernet frames +(ethernet pseudowire) or other data types such as ATM, HDLC or Frame +Relay. + +When a tunnel is first opened, the reference count on the socket is +increased using sock_hold(). This ensures that the kernel socket +cannot be removed while L2TP's data structures reference it. + +Some L2TP sessions also have a socket (PPP pseudowires) while others +do not (ethernet pseudowires). We can't use the socket reference count +as the reference count for session contexts. The L2TP implementation +therefore has its own internal reference counts on the session +contexts. + +To Do +===== + +Add L2TP tunnel switching support. This would route tunneled traffic +from one L2TP tunnel into another. Specified in +http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08 + +Add L2TPv3 VLAN pseudowire support. + +Add L2TPv3 IP pseudowire support. + +Add L2TPv3 ATM pseudowire support. + +Miscellaneous +============= + +The L2TP drivers were developed as part of the OpenL2TP project by +Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server, +designed from the ground up to have the L2TP datapath in the +kernel. The project also implemented the pppol2tp plugin for pppd +which allows pppd to use the kernel driver. Details can be found at +http://www.openl2tp.org. diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.txt new file mode 100644 index 000000000..d4fc8f221 --- /dev/null +++ b/Documentation/networking/lapb-module.txt @@ -0,0 +1,263 @@ + The Linux LAPB Module Interface 1.3 + + Jonathan Naylor 29.12.96 + +Changed (Henner Eisen, 2000-10-29): int return value for data_indication() + +The LAPB module will be a separately compiled module for use by any parts of +the Linux operating system that require a LAPB service. This document +defines the interfaces to, and the services provided by this module. The +term module in this context does not imply that the LAPB module is a +separately loadable module, although it may be. The term module is used in +its more standard meaning. + +The interface to the LAPB module consists of functions to the module, +callbacks from the module to indicate important state changes, and +structures for getting and setting information about the module. + +Structures +---------- + +Probably the most important structure is the skbuff structure for holding +received and transmitted data, however it is beyond the scope of this +document. + +The two LAPB specific structures are the LAPB initialisation structure and +the LAPB parameter structure. These will be defined in a standard header +file, <linux/lapb.h>. The header file <net/lapb.h> is internal to the LAPB +module and is not for use. + +LAPB Initialisation Structure +----------------------------- + +This structure is used only once, in the call to lapb_register (see below). +It contains information about the device driver that requires the services +of the LAPB module. + +struct lapb_register_struct { + void (*connect_confirmation)(int token, int reason); + void (*connect_indication)(int token, int reason); + void (*disconnect_confirmation)(int token, int reason); + void (*disconnect_indication)(int token, int reason); + int (*data_indication)(int token, struct sk_buff *skb); + void (*data_transmit)(int token, struct sk_buff *skb); +}; + +Each member of this structure corresponds to a function in the device driver +that is called when a particular event in the LAPB module occurs. These will +be described in detail below. If a callback is not required (!!) then a NULL +may be substituted. + + +LAPB Parameter Structure +------------------------ + +This structure is used with the lapb_getparms and lapb_setparms functions +(see below). They are used to allow the device driver to get and set the +operational parameters of the LAPB implementation for a given connection. + +struct lapb_parms_struct { + unsigned int t1; + unsigned int t1timer; + unsigned int t2; + unsigned int t2timer; + unsigned int n2; + unsigned int n2count; + unsigned int window; + unsigned int state; + unsigned int mode; +}; + +T1 and T2 are protocol timing parameters and are given in units of 100ms. N2 +is the maximum number of tries on the link before it is declared a failure. +The window size is the maximum number of outstanding data packets allowed to +be unacknowledged by the remote end, the value of the window is between 1 +and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB +link. + +The mode variable is a bit field used for setting (at present) three values. +The bit fields have the following meanings: + +Bit Meaning +0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED). +1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP). +2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE) +3-31 Reserved, must be 0. + +Extended LAPB operation indicates the use of extended sequence numbers and +consequently larger window sizes, the default is standard LAPB operation. +MLP operation is the same as SLP operation except that the addresses used by +LAPB are different to indicate the mode of operation, the default is Single +Link Procedure. The difference between DCE and DTE operation is (i) the +addresses used for commands and responses, and (ii) when the DCE is not +connected, it sends DM without polls set, every T1. The upper case constant +names will be defined in the public LAPB header file. + + +Functions +--------- + +The LAPB module provides a number of function entry points. + + +int lapb_register(void *token, struct lapb_register_struct); + +This must be called before the LAPB module may be used. If the call is +successful then LAPB_OK is returned. The token must be a unique identifier +generated by the device driver to allow for the unique identification of the +instance of the LAPB link. It is returned by the LAPB module in all of the +callbacks, and is used by the device driver in all calls to the LAPB module. +For multiple LAPB links in a single device driver, multiple calls to +lapb_register must be made. The format of the lapb_register_struct is given +above. The return values are: + +LAPB_OK LAPB registered successfully. +LAPB_BADTOKEN Token is already registered. +LAPB_NOMEM Out of memory + + +int lapb_unregister(void *token); + +This releases all the resources associated with a LAPB link. Any current +LAPB link will be abandoned without further messages being passed. After +this call, the value of token is no longer valid for any calls to the LAPB +function. The valid return values are: + +LAPB_OK LAPB unregistered successfully. +LAPB_BADTOKEN Invalid/unknown LAPB token. + + +int lapb_getparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to get the values of the current LAPB +variables, the lapb_parms_struct is described above. The valid return values +are: + +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. + + +int lapb_setparms(void *token, struct lapb_parms_struct *parms); + +This allows the device driver to set the values of the current LAPB +variables, the lapb_parms_struct is described above. The values of t1timer, +t2timer and n2count are ignored, likewise changing the mode bits when +connected will be ignored. An error implies that none of the values have +been changed. The valid return values are: + +LAPB_OK LAPB getparms was successful. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_INVALUE One of the values was out of its allowable range. + + +int lapb_connect_request(void *token); + +Initiate a connect using the current parameter settings. The valid return +values are: + +LAPB_OK LAPB is starting to connect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_CONNECTED LAPB module is already connected. + + +int lapb_disconnect_request(void *token); + +Initiate a disconnect. The valid return values are: + +LAPB_OK LAPB is starting to disconnect. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. + + +int lapb_data_request(void *token, struct sk_buff *skb); + +Queue data with the LAPB module for transmitting over the link. If the call +is successful then the skbuff is owned by the LAPB module and may not be +used by the device driver again. The valid return values are: + +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. +LAPB_NOTCONNECTED LAPB module is not connected. + + +int lapb_data_received(void *token, struct sk_buff *skb); + +Queue data with the LAPB module which has been received from the device. It +is expected that the data passed to the LAPB module has skb->data pointing +to the beginning of the LAPB data. If the call is successful then the skbuff +is owned by the LAPB module and may not be used by the device driver again. +The valid return values are: + +LAPB_OK LAPB has accepted the data. +LAPB_BADTOKEN Invalid/unknown LAPB token. + + +Callbacks +--------- + +These callbacks are functions provided by the device driver for the LAPB +module to call when an event occurs. They are registered with the LAPB +module with lapb_register (see above) in the structure lapb_register_struct +(see above). + + +void (*connect_confirmation)(void *token, int reason); + +This is called by the LAPB module when a connection is established after +being requested by a call to lapb_connect_request (see above). The reason is +always LAPB_OK. + + +void (*connect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is established by the remote +system. The value of reason is always LAPB_OK. + + +void (*disconnect_confirmation)(void *token, int reason); + +This is called by the LAPB module when an event occurs after the device +driver has called lapb_disconnect_request (see above). The reason indicates +what has happened. In all cases the LAPB link can be regarded as being +terminated. The values for reason are: + +LAPB_OK The LAPB link was terminated normally. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. + + +void (*disconnect_indication)(void *token, int reason); + +This is called by the LAPB module when the link is terminated by the remote +system or another event has occurred to terminate the link. This may be +returned in response to a lapb_connect_request (see above) if the remote +system refused the request. The values for reason are: + +LAPB_OK The LAPB link was terminated normally by the remote + system. +LAPB_REFUSED The remote system refused the connect request. +LAPB_NOTCONNECTED The remote system was not connected. +LAPB_TIMEDOUT No response was received in N2 tries from the remote + system. + + +int (*data_indication)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data has been received from the +remote system that should be passed onto the next layer in the protocol +stack. The skbuff becomes the property of the device driver and the LAPB +module will not perform any more actions on it. The skb->data pointer will +be pointing to the first byte of data after the LAPB header. + +This method should return NET_RX_DROP (as defined in the header +file include/linux/netdevice.h) if and only if the frame was dropped +before it could be delivered to the upper layer. + + +void (*data_transmit)(void *token, struct sk_buff *skb); + +This is called by the LAPB module when data is to be transmitted to the +remote system by the device driver. The skbuff becomes the property of the +device driver and the LAPB module will not perform any more actions on it. +The skb->data pointer will be pointing to the first byte of the LAPB header. diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt new file mode 100644 index 000000000..0bf3220c7 --- /dev/null +++ b/Documentation/networking/ltpc.txt @@ -0,0 +1,131 @@ +This is the ALPHA version of the ltpc driver. + +In order to use it, you will need at least version 1.3.3 of the +netatalk package, and the Apple or Farallon LocalTalk PC card. +There are a number of different LocalTalk cards for the PC; this +driver applies only to the one with the 65c02 processor chip on it. + +To include it in the kernel, select the CONFIG_LTPC switch in the +configuration dialog. You can also compile it as a module. + +While the driver will attempt to autoprobe the I/O port address, IRQ +line, and DMA channel of the card, this does not always work. For +this reason, you should be prepared to supply these parameters +yourself. (see "Card Configuration" below for how to determine or +change the settings on your card) + +When the driver is compiled into the kernel, you can add a line such +as the following to your /etc/lilo.conf: + + append="ltpc=0x240,9,1" + +where the parameters (in order) are the port address, IRQ, and DMA +channel. The second and third values can be omitted, in which case +the driver will try to determine them itself. + +If you load the driver as a module, you can pass the parameters "io=", +"irq=", and "dma=" on the command line with insmod or modprobe, or add +them as options in a configuration file in /etc/modprobe.d/ directory: + + alias lt0 ltpc # autoload the module when the interface is configured + options ltpc io=0x240 irq=9 dma=1 + +Before starting up the netatalk demons (perhaps in rc.local), you +need to add a line such as: + + /sbin/ifconfig lt0 127.0.0.42 + +The address is unimportant - however, the card needs to be configured +with ifconfig so that Netatalk can find it. + +The appropriate netatalk configuration depends on whether you are +attached to a network that includes AppleTalk routers or not. If, +like me, you are simply connecting to your home Macintoshes and +printers, you need to set up netatalk to "seed". The way I do this +is to have the lines + + dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033" + lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033" + +in my atalkd.conf. What is going on here is that I need to fool +netatalk into thinking that there are two AppleTalk interfaces +present; otherwise, it refuses to seed. This is a hack, and a more +permanent solution would be to alter the netatalk code. Also, make +sure you have the correct name for the dummy interface - If it's +compiled as a module, you will need to refer to it as "dummy0" or some +such. + +If you are attached to an extended AppleTalk network, with routers on +it, then you don't need to fool around with this -- the appropriate +line in atalkd.conf is + + lt0 -phase 1 + +-------------------------------------- + +Card Configuration: + +The interrupts and so forth are configured via the dipswitch on the +board. Set the switches so as not to conflict with other hardware. + + Interrupts -- set at most one. If none are set, the driver uses + polled mode. Because the card was developed in the XT era, the + original documentation refers to IRQ2. Since you'll be running + this on an AT (or later) class machine, that really means IRQ9. + + SW1 IRQ 4 + SW2 IRQ 3 + SW3 IRQ 9 (2 in original card documentation only applies to XT) + + + DMA -- choose DMA 1 or 3, and set both corresponding switches. + + SW4 DMA 3 + SW5 DMA 1 + SW6 DMA 3 + SW7 DMA 1 + + + I/O address -- choose one. + + SW8 220 / 240 + +-------------------------------------- + +IP: + +Yes, it is possible to do IP over LocalTalk. However, you can't just +treat the LocalTalk device like an ordinary Ethernet device, even if +that's what it looks like to Netatalk. + +Instead, you follow the same procedure as for doing IP in EtherTalk. +See Documentation/networking/ipddp.txt for more information about the +kernel driver and userspace tools needed. + +-------------------------------------- + +BUGS: + +IRQ autoprobing often doesn't work on a cold boot. To get around +this, either compile the driver as a module, or pass the parameters +for the card to the kernel as described above. + +Also, as usual, autoprobing is not recommended when you use the driver +as a module. (though it usually works at boot time, at least) + +Polled mode is *really* slow sometimes, but this seems to depend on +the configuration of the network. + +It may theoretically be possible to use two LTPC cards in the same +machine, but this is unsupported, so if you really want to do this, +you'll probably have to hack the initialization code a bit. + +______________________________________ + +THANKS: + Thanks to Alan Cox for helpful discussions early on in this +work, and to Denis Hainsworth for doing the bleeding-edge testing. + +-- Bradford Johnson <bradford@math.umn.edu> + +-- Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org> diff --git a/Documentation/networking/mac80211-auth-assoc-deauth.txt b/Documentation/networking/mac80211-auth-assoc-deauth.txt new file mode 100644 index 000000000..d7a15fe91 --- /dev/null +++ b/Documentation/networking/mac80211-auth-assoc-deauth.txt @@ -0,0 +1,95 @@ +# +# This outlines the Linux authentication/association and +# deauthentication/disassociation flows. +# +# This can be converted into a diagram using the service +# at http://www.websequencediagrams.com/ +# + +participant userspace +participant mac80211 +participant driver + +alt authentication needed (not FT) +userspace->mac80211: authenticate + +alt authenticated/authenticating already +mac80211->driver: sta_state(AP, not-exists) +mac80211->driver: bss_info_changed(clear BSSID) +else associated +note over mac80211,driver +like deauth/disassoc, without sending the +BA session stop & deauth/disassoc frames +end note +end + +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) + +alt no probe request data known +mac80211->driver: TX directed probe request +driver->mac80211: RX probe response +end + +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame + +alt WEP shared key auth +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame +end + +mac80211->driver: sta_state(AP, authenticated) +mac80211->userspace: RX auth frame + +end + +userspace->mac80211: associate +alt authenticated or associated +note over mac80211,driver: cleanup like for authenticate +end + +alt not previously authenticated (FT) +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) +mac80211->driver: sta_state(AP, authenticated) +end +mac80211->driver: TX assoc +driver->mac80211: RX assoc response +note over mac80211: init rate control +mac80211->driver: sta_state(AP, associated) + +alt not using WPA +mac80211->driver: sta_state(AP, authorized) +end + +mac80211->driver: set up QoS parameters + +mac80211->driver: bss_info_changed(QoS, HT, associated with AID) +mac80211->userspace: associated + +note left of userspace: associated now + +alt using WPA +note over userspace +do 4-way-handshake +(data frames) +end note +userspace->mac80211: authorized +mac80211->driver: sta_state(AP, authorized) +end + +userspace->mac80211: deauthenticate/disassociate +mac80211->driver: stop BA sessions +mac80211->driver: TX deauth/disassoc +mac80211->driver: flush frames +mac80211->driver: sta_state(AP,associated) +mac80211->driver: sta_state(AP,authenticated) +mac80211->driver: sta_state(AP,exists) +mac80211->driver: sta_state(AP,not-exists) +mac80211->driver: turn off powersave +mac80211->driver: bss_info_changed(clear BSSID, not associated, no QoS, ...) +mac80211->driver: config(channel type to non-HT) +mac80211->userspace: disconnected diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt new file mode 100644 index 000000000..d58d78df9 --- /dev/null +++ b/Documentation/networking/mac80211-injection.txt @@ -0,0 +1,97 @@ +How to use packet injection with mac80211 +========================================= + +mac80211 now allows arbitrary packets to be injected down any Monitor Mode +interface from userland. The packet you inject needs to be composed in the +following format: + + [ radiotap header ] + [ ieee80211 header ] + [ payload ] + +The radiotap format is discussed in +./Documentation/networking/radiotap-headers.txt. + +Despite many radiotap parameters being currently defined, most only make sense +to appear on received packets. The following information is parsed from the +radiotap headers and used to control injection: + + * IEEE80211_RADIOTAP_FLAGS + + IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated + IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available + IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the + current fragmentation threshold. + + * IEEE80211_RADIOTAP_TX_FLAGS + + IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for + an ACK even if it is a unicast frame + + * IEEE80211_RADIOTAP_RATE + + legacy rate for the transmission (only for devices without own rate control) + + * IEEE80211_RADIOTAP_MCS + + HT rate for the transmission (only for devices without own rate control). + Also some flags are parsed + + IEEE80211_RADIOTAP_MCS_SGI: use short guard interval + IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode + + * IEEE80211_RADIOTAP_DATA_RETRIES + + number of retries when either IEEE80211_RADIOTAP_RATE or + IEEE80211_RADIOTAP_MCS was used + + * IEEE80211_RADIOTAP_VHT + + VHT mcs and number of streams used in the transmission (only for devices + without own rate control). Also other fields are parsed + + flags field + IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval + + bandwidth field + 1: send using 40MHz channel width + 4: send using 80MHz channel width + 11: send using 160MHz channel width + +The injection code can also skip all other currently defined radiotap fields +facilitating replay of captured radiotap headers directly. + +Here is an example valid radiotap header defining some parameters + + 0x00, 0x00, // <-- radiotap version + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate + 0x0c, //<-- tx power + 0x01 //<-- antenna + +The ieee80211 header follows immediately afterwards, looking for example like +this: + + 0x08, 0x01, 0x00, 0x00, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x10, 0x86 + +Then lastly there is the payload. + +After composing the packet contents, it is sent by send()-ing it to a logical +mac80211 interface that is in Monitor mode. Libpcap can also be used, +(which is easier than doing the work to bind the socket to the right +interface), along the following lines: + + ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); +... + r = pcap_inject(ppcap, u8aSendBuffer, nLength); + +You can also find a link to a complete inject application here: + +http://wireless.kernel.org/en/users/Documentation/packetspammer + +Andy Green <andy@warmcat.com> diff --git a/Documentation/networking/mac80211_hwsim/README b/Documentation/networking/mac80211_hwsim/README new file mode 100644 index 000000000..3566a725d --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/README @@ -0,0 +1,68 @@ +mac80211_hwsim - software simulator of 802.11 radio(s) for mac80211 +Copyright (c) 2008, Jouni Malinen <j@w1.fi> + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License version 2 as +published by the Free Software Foundation. + + +Introduction + +mac80211_hwsim is a Linux kernel module that can be used to simulate +arbitrary number of IEEE 802.11 radios for mac80211. It can be used to +test most of the mac80211 functionality and user space tools (e.g., +hostapd and wpa_supplicant) in a way that matches very closely with +the normal case of using real WLAN hardware. From the mac80211 view +point, mac80211_hwsim is yet another hardware driver, i.e., no changes +to mac80211 are needed to use this testing tool. + +The main goal for mac80211_hwsim is to make it easier for developers +to test their code and work with new features to mac80211, hostapd, +and wpa_supplicant. The simulated radios do not have the limitations +of real hardware, so it is easy to generate an arbitrary test setup +and always reproduce the same setup for future tests. In addition, +since all radio operation is simulated, any channel can be used in +tests regardless of regulatory rules. + +mac80211_hwsim kernel module has a parameter 'radios' that can be used +to select how many radios are simulated (default 2). This allows +configuration of both very simply setups (e.g., just a single access +point and a station) or large scale tests (multiple access points with +hundreds of stations). + +mac80211_hwsim works by tracking the current channel of each virtual +radio and copying all transmitted frames to all other radios that are +currently enabled and on the same channel as the transmitting +radio. Software encryption in mac80211 is used so that the frames are +actually encrypted over the virtual air interface to allow more +complete testing of encryption. + +A global monitoring netdev, hwsim#, is created independent of +mac80211. This interface can be used to monitor all transmitted frames +regardless of channel. + + +Simple example + +This example shows how to use mac80211_hwsim to simulate two radios: +one to act as an access point and the other as a station that +associates with the AP. hostapd and wpa_supplicant are used to take +care of WPA2-PSK authentication. In addition, hostapd is also +processing access point side of association. + + +# Build mac80211_hwsim as part of kernel configuration + +# Load the module +modprobe mac80211_hwsim + +# Run hostapd (AP) for wlan0 +hostapd hostapd.conf + +# Run wpa_supplicant (station) for wlan1 +wpa_supplicant -Dnl80211 -iwlan1 -c wpa_supplicant.conf + + +More test cases are available in hostap.git: +git://w1.fi/srv/git/hostap.git and mac80211_hwsim/tests subdirectory +(http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=tree;f=mac80211_hwsim/tests) diff --git a/Documentation/networking/mac80211_hwsim/hostapd.conf b/Documentation/networking/mac80211_hwsim/hostapd.conf new file mode 100644 index 000000000..08cde7e35 --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/hostapd.conf @@ -0,0 +1,11 @@ +interface=wlan0 +driver=nl80211 + +hw_mode=g +channel=1 +ssid=mac80211 test + +wpa=2 +wpa_key_mgmt=WPA-PSK +wpa_pairwise=CCMP +wpa_passphrase=12345678 diff --git a/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf new file mode 100644 index 000000000..299128cff --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf @@ -0,0 +1,10 @@ +ctrl_interface=/var/run/wpa_supplicant + +network={ + ssid="mac80211 test" + psk="12345678" + key_mgmt=WPA-PSK + proto=WPA2 + pairwise=CCMP + group=CCMP +} diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt new file mode 100644 index 000000000..2f24a1912 --- /dev/null +++ b/Documentation/networking/mpls-sysctl.txt @@ -0,0 +1,48 @@ +/proc/sys/net/mpls/* Variables: + +platform_labels - INTEGER + Number of entries in the platform label table. It is not + possible to configure forwarding for label values equal to or + greater than the number of platform labels. + + A dense utilization of the entries in the platform label table + is possible and expected as the platform labels are locally + allocated. + + If the number of platform label table entries is set to 0 no + label will be recognized by the kernel and mpls forwarding + will be disabled. + + Reducing this value will remove all label routing entries that + no longer fit in the table. + + Possible values: 0 - 1048575 + Default: 0 + +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled / RFC 3443 [Short] Pipe Model + 1 - enabled / RFC 3443 Uniform Model (default) + +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + +conf/<interface>/input - BOOL + Control whether packets can be input on this interface. + + If disabled, packets will be discarded without further + processing. + + 0 - disabled (default) + not 0 - enabled diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst new file mode 100644 index 000000000..fe46d4867 --- /dev/null +++ b/Documentation/networking/msg_zerocopy.rst @@ -0,0 +1,256 @@ + +============ +MSG_ZEROCOPY +============ + +Intro +===== + +The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. +The feature is currently implemented for TCP sockets. + + +Opportunity and Caveats +----------------------- + +Copying large buffers between user process and kernel can be +expensive. Linux supports various interfaces that eschew copying, +such as sendpage and splice. The MSG_ZEROCOPY flag extends the +underlying copy avoidance mechanism to common socket send calls. + +Copy avoidance is not a free lunch. As implemented, with page pinning, +it replaces per byte copy cost with page accounting and completion +notification overhead. As a result, MSG_ZEROCOPY is generally only +effective at writes over around 10 KB. + +Page pinning also changes system call semantics. It temporarily shares +the buffer between process and network stack. Unlike with copying, the +process cannot immediately overwrite the buffer after system call +return without possibly modifying the data in flight. Kernel integrity +is not affected, but a buggy program can possibly corrupt its own data +stream. + +The kernel returns a notification when it is safe to modify data. +Converting an existing application to MSG_ZEROCOPY is not always as +trivial as just passing the flag, then. + + +More Info +--------- + +Much of this document was derived from a longer paper presented at +netdev 2.1. For more in-depth information see that paper and talk, +the excellent reporting over at LWN.net or read the original code. + + paper, slides, video + https://netdevconf.org/2.1/session.html?debruijn + + LWN article + https://lwn.net/Articles/726917/ + + patchset + [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY + http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com + + +Interface +========= + +Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy +avoidance, but not the only one. + +Socket Setup +------------ + +The kernel is permissive when applications pass undefined flags to the +send system call. By default it simply ignores these. To avoid enabling +copy avoidance mode for legacy processes that accidentally already pass +this flag, a process must first signal intent by setting a socket option: + +:: + + if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) + error(1, errno, "setsockopt zerocopy"); + +Transmission +------------ + +The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. +Pass the new flag. + +:: + + ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); + +A zerocopy failure will return -1 with errno ENOBUFS. This happens if +the socket option was not set, the socket exceeds its optmem limit or +the user exceeds its ulimit on locked pages. + + +Mixing copy avoidance and copying +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many workloads have a mixture of large and small buffers. Because copy +avoidance is more expensive than copying for small packets, the +feature is implemented as a flag. It is safe to mix calls with the flag +with those without. + + +Notifications +------------- + +The kernel has to notify the process when it is safe to reuse a +previously passed buffer. It queues completion notifications on the +socket error queue, akin to the transmit timestamping interface. + +The notification itself is a simple scalar value. Each socket +maintains an internal unsigned 32-bit counter. Each send call with +MSG_ZEROCOPY that successfully sends data increments the counter. The +counter is not incremented on failure or if called with length zero. +The counter counts system call invocations, not bytes. It wraps after +UINT_MAX calls. + + +Notification Reception +~~~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates the API. In the simplest case, each +send syscall is followed by a poll and recvmsg on the error queue. + +Reading from the error queue is always a non-blocking operation. The +poll call is there to block until an error is outstanding. It will set +POLLERR in its output flags. That flag does not have to be set in the +events field. Errors are signaled unconditionally. + +:: + + pfd.fd = fd; + pfd.events = 0; + if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) + error(1, errno, "poll"); + + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret == -1) + error(1, errno, "recvmsg"); + + read_notification(msg); + +The example is for demonstration purpose only. In practice, it is more +efficient to not wait for notifications, but read without blocking +every couple of send calls. + +Notifications can be processed out of order with other operations on +the socket. A socket that has an error queued would normally block +other operations until the error is read. Zerocopy notifications have +a zero error code, however, to not block send and recv calls. + + +Notification Batching +~~~~~~~~~~~~~~~~~~~~~ + +Multiple outstanding packets can be read at once using the recvmmsg +call. This is often not needed. In each message the kernel returns not +a single value, but a range. It coalesces consecutive notifications +while one is outstanding for reception on the error queue. + +When a new notification is about to be queued, it checks whether the +new value extends the range of the notification at the tail of the +queue. If so, it drops the new notification packet and instead increases +the range upper value of the outstanding notification. + +For protocols that acknowledge data in-order, like TCP, each +notification can be squashed into the previous one, so that no more +than one notification is outstanding at any one point. + +Ordered delivery is the common case, but not guaranteed. Notifications +may arrive out of order on retransmission and socket teardown. + + +Notification Parsing +~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates how to parse the control message: the +read_notification() call in the previous snippet. A notification +is encoded in the standard error format, sock_extended_err. + +The level and type fields in the control data are protocol family +specific, IP_RECVERR or IPV6_RECVERR. + +Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, +as explained before, to avoid blocking read and write system calls on +the socket. + +The 32-bit notification range is encoded as [ee_info, ee_data]. This +range is inclusive. Other fields in the struct must be treated as +undefined, bar for ee_code, as discussed below. + +:: + + struct sock_extended_err *serr; + struct cmsghdr *cm; + + cm = CMSG_FIRSTHDR(msg); + if (cm->cmsg_level != SOL_IP && + cm->cmsg_type != IP_RECVERR) + error(1, 0, "cmsg"); + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != 0 || + serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "serr"); + + printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); + + +Deferred copies +~~~~~~~~~~~~~~~ + +Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy +avoidance, and a contract that the kernel will queue a completion +notification. It is not a guarantee that the copy is elided. + +Copy avoidance is not always feasible. Devices that do not support +scatter-gather I/O cannot send packets made up of kernel generated +protocol headers plus zerocopy user data. A packet may need to be +converted to a private copy of data deep in the stack, say to compute +a checksum. + +In all these cases, the kernel returns a completion notification when +it releases its hold on the shared pages. That notification may arrive +before the (copied) data is fully transmitted. A zerocopy completion +notification is not a transmit completion notification, therefore. + +Deferred copies can be more expensive than a copy immediately in the +system call, if the data is no longer warm in the cache. The process +also incurs notification processing cost for no benefit. For this +reason, the kernel signals if data was completed with a copy, by +setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. +A process may use this signal to stop passing flag MSG_ZEROCOPY on +subsequent requests on the same socket. + + +Implementation +============== + +Loopback +-------- + +Data sent to local sockets can be queued indefinitely if the receive +process does not read its socket. Unbound notification latency is not +acceptable. For this reason all packets generated with MSG_ZEROCOPY +that are looped to a local socket will incur a deferred copy. This +includes looping onto packet sockets (e.g., tcpdump) and tun devices. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/msg_zerocopy.c. + +Be cognizant of the loopback constraint. The test can be run between +a pair of hosts. But if run between a local pair of processes, for +instance when run with msg_zerocopy.sh between a veth pair across +namespaces, the test will not show any improvement. For testing, the +loopback restriction can be temporarily relaxed by making +skb_orphan_frags_rx identical to skb_orphan_frags. diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 000000000..4caa0e314 --- /dev/null +++ b/Documentation/networking/multiqueue.txt @@ -0,0 +1,79 @@ + + HOWTO for multiqueue network device support + =========================================== + +Section 1: Base driver requirements for implementing multiqueue support + +Intro: Kernel support for multiqueue devices +--------------------------------------------------------- + +Kernel support for multiqueue devices is always present. + +Section 1: Base driver requirements for implementing multiqueue support +----------------------------------------------------------------------- + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev->queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev->queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + + +Section 2: Qdisc support for multiqueue devices + +----------------------------------------------- + +Currently two qdiscs are optimized for multiqueue devices. The first is the +default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue. +A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The +qdisc is responsible for classifying the skb's and then directing the skb's to +bands and queues based on the value in skb->queue_mapping. Use this field in +the base driver to determine which queue to send the skb to. + +sch_multiq has been added for hardware that wishes to avoid head-of-line +blocking. It will cycle though the bands and verify that the hardware queue +associated with the band is not stopped prior to dequeuing a packet. + +On qdisc load, the number of bands is based on the number of queues on the +hardware. Once the association is made, any skb with skb->queue_mapping set, +will be queued to the band associated with the hardware queue. + + +Section 3: Brief howto using MULTIQ for multiqueue devices +--------------------------------------------------------------- + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the MULTIQ qdisc to your network device, assuming the device +is called eth0, run the following command: + +# tc qdisc add dev eth0 root handle 1: multiq + +The qdisc will allocate the number of bands to equal the number of queues that +the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx +queues, the band mapping would look like: + +band 0 => queue 0 +band 1 => queue 1 +band 2 => queue 2 +band 3 => queue 3 + +Traffic will begin flowing through each queue based on either the simple_tx_hash +function or based on netdev->select_queue() if you have it defined. + +The behavior of tc filters remains the same. However a new tc action, +skbedit, has been added. Assuming you wanted to route all traffic to a +specific host, for example 192.168.0.3, through a specific queue you could use +this action and establish a filter such as: + +tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \ + match ip dst 192.168.0.3 \ + action skbedit queue_mapping 3 + +Author: Alexander Duyck <alexander.h.duyck@intel.com> +Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt new file mode 100644 index 000000000..9cb31c5e2 --- /dev/null +++ b/Documentation/networking/net_dim.txt @@ -0,0 +1,174 @@ +Net DIM - Generic Network Dynamic Interrupt Moderation +====================================================== + +Author: + Tal Gilboa <talgi@mellanox.com> + + +Contents +========= + +- Assumptions +- Introduction +- The Net DIM Algorithm +- Registering a Network Device to DIM +- Example + +Part 0: Assumptions +====================== + +This document assumes the reader has basic knowledge in network drivers +and in general interrupt moderation. + + +Part I: Introduction +====================== + +Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the +interrupt moderation configuration of a channel in order to optimize packet +processing. The mechanism includes an algorithm which decides if and how to +change moderation parameters for a channel, usually by performing an analysis on +runtime data sampled from the system. Net DIM is such a mechanism. In each +iteration of the algorithm, it analyses a given sample of the data, compares it +to the previous sample and if required, it can decide to change some of the +interrupt moderation configuration fields. The data sample is composed of data +bandwidth, the number of packets and the number of events. The time between +samples is also measured. Net DIM compares the current and the previous data and +returns an adjusted interrupt moderation configuration object. In some cases, +the algorithm might decide not to change anything. The configuration fields are +the minimum duration (microseconds) allowed between events and the maximum +number of wanted packets per event. The Net DIM algorithm ascribes importance to +increase bandwidth over reducing interrupt rate. + + +Part II: The Net DIM Algorithm +=============================== + +Each iteration of the Net DIM algorithm follows these steps: +1. Calculates new data sample. +2. Compares it to previous sample. +3. Makes a decision - suggests interrupt moderation configuration fields. +4. Applies a schedule work function, which applies suggested configuration. + +The first two steps are straightforward, both the new and the previous data are +supplied by the driver registered to Net DIM. The previous data is the new data +supplied to the previous iteration. The comparison step checks the difference +between the new and previous data and decides on the result of the last step. +A step would result as "better" if bandwidth increases and as "worse" if +bandwidth reduces. If there is no change in bandwidth, the packet rate is +compared in a similar fashion - increase == "better" and decrease == "worse". +In case there is no change in the packet rate as well, the interrupt rate is +compared. Here the algorithm tries to optimize for lower interrupt rate so an +increase in the interrupt rate is considered "worse" and a decrease is +considered "better". Step #2 has an optimization for avoiding false results: it +only considers a difference between samples as valid if it is greater than a +certain percentage. Also, since Net DIM does not measure anything by itself, it +assumes the data provided by the driver is valid. + +Step #3 decides on the suggested configuration based on the result from step #2 +and the internal state of the algorithm. The states reflect the "direction" of +the algorithm: is it going left (reducing moderation), right (increasing +moderation) or standing still. Another optimization is that if a decision +to stay still is made multiple times, the interval between iterations of the +algorithm would increase in order to reduce calculation overhead. Also, after +"parking" on one of the most left or most right decisions, the algorithm may +decide to verify this decision by taking a step in the other direction. This is +done in order to avoid getting stuck in a "deep sleep" scenario. Once a +decision is made, an interrupt moderation configuration is selected from +the predefined profiles. + +The last step is to notify the registered driver that it should apply the +suggested configuration. This is done by scheduling a work function, defined by +the Net DIM API and provided by the registered driver. + +As you can see, Net DIM itself does not actively interact with the system. It +would have trouble making the correct decisions if the wrong data is supplied to +it and it would be useless if the work function would not apply the suggested +configuration. This does, however, allow the registered driver some room for +manoeuvre as it may provide partial data or ignore the algorithm suggestion +under some conditions. + + +Part III: Registering a Network Device to DIM +============================================== + +Net DIM API exposes the main function net_dim(struct net_dim *dim, +struct net_dim_sample end_sample). This function is the entry point to the Net +DIM algorithm and has to be called every time the driver would like to check if +it should change interrupt moderation parameters. The driver should provide two +data structures: struct net_dim and struct net_dim_sample. Struct net_dim +describes the state of DIM for a specific object (RX queue, TX queue, +other queues, etc.). This includes the current selected profile, previous data +samples, the callback function provided by the driver and more. +Struct net_dim_sample describes a data sample, which will be compared to the +data sample stored in struct net_dim in order to decide on the algorithm's next +step. The sample should include bytes, packets and interrupts, measured by +the driver. + +In order to use Net DIM from a networking driver, the driver needs to call the +main net_dim() function. The recommended method is to call net_dim() on each +interrupt. Since Net DIM has a built-in moderation and it might decide to skip +iterations under certain conditions, there is no need to moderate the net_dim() +calls as well. As mentioned above, the driver needs to provide an object of type +struct net_dim to the net_dim() function call. It is advised for each entity +using Net DIM to hold a struct net_dim as part of its data structure and use it +as the main Net DIM API object. The struct net_dim_sample should hold the latest +bytes, packets and interrupts count. No need to perform any calculations, just +include the raw data. + +The net_dim() call itself does not return anything. Instead Net DIM relies on +the driver to provide a callback function, which is called when the algorithm +decides to make a change in the interrupt moderation parameters. This callback +will be scheduled and run in a separate thread in order not to add overhead to +the data flow. After the work is done, Net DIM algorithm needs to be set to +the proper state in order to move to the next iteration. + + +Part IV: Example +================= + +The following code demonstrates how to register a driver to Net DIM. The actual +usage is not complete but it should make the outline of the usage clear. + +my_driver.c: + +#include <linux/net_dim.h> + +/* Callback for net DIM to schedule on a decision to change moderation */ +void my_driver_do_dim_work(struct work_struct *work) +{ + /* Get struct net_dim from struct work_struct */ + struct net_dim *dim = container_of(work, struct net_dim, + work); + /* Do interrupt moderation related stuff */ + ... + + /* Signal net DIM work is done and it should move to next iteration */ + dim->state = NET_DIM_START_MEASURE; +} + +/* My driver's interrupt handler */ +int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...) +{ + ... + /* A struct to hold current measured data */ + struct net_dim_sample dim_sample; + ... + /* Initiate data sample struct with current data */ + net_dim_sample(my_entity->events, + my_entity->packets, + my_entity->bytes, + &dim_sample); + /* Call net DIM */ + net_dim(&my_entity->dim, dim_sample); + ... +} + +/* My entity's initialization function (my_entity was already allocated) */ +int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...) +{ + ... + /* Initiate struct work_struct with my driver's callback function */ + INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work); + ... +} diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst new file mode 100644 index 000000000..06c97dcb5 --- /dev/null +++ b/Documentation/networking/net_failover.rst @@ -0,0 +1,119 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +NET_FAILOVER +============ + +Overview +======== + +The net_failover driver provides an automated failover mechanism via APIs +to create and destroy a failover master netdev and mananges a primary and +standby slave netdevs that get registered via the generic failover +infrastructrure. + +The failover netdev acts a master device and controls 2 slave devices. The +original paravirtual interface is registered as 'standby' slave netdev and +a passthru/vf device with the same MAC gets registered as 'primary' slave +netdev. Both 'standby' and 'failover' netdevs are associated with the same +'pci' device. The user accesses the network interface via 'failover' netdev. +The 'failover' netdev chooses 'primary' netdev as default for transmits when +it is available with link up and running. + +This can be used by paravirtual drivers to enable an alternate low latency +datapath. It also enables hypervisor controlled live migration of a VM with +direct attached VF by failing over to the paravirtual datapath when the VF +is unplugged. + +virtio-net accelerated datapath: STANDBY mode +============================================= + +net_failover enables hypervisor controlled accelerated datapath to virtio-net +enabled VMs in a transparent manner with no/minimal guest userspace chanages. + +To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY +feature on the virtio-net interface and assign the same MAC address to both +virtio-net and VF interfaces. + +Here is an example XML snippet that shows such configuration. +:: + + <interface type='network'> + <mac address='52:54:00:00:12:53'/> + <source network='enp66s0f0_br'/> + <target dev='tap01'/> + <model type='virtio'/> + <driver name='vhost' queues='4'/> + <link state='down'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='52:54:00:00:12:53'/> + <source> + <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> + </interface> + +Booting a VM with the above configuration will result in the following 3 +netdevs created in the VM. +:: + + 4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + inet 192.168.12.53/24 brd 192.168.12.255 scope global dynamic ens10 + valid_lft 42482sec preferred_lft 42482sec + inet6 fe80::97d8:db2:8c10:b6d6/64 scope link + valid_lft forever preferred_lft forever + 5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + 7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + +ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave +'standby' and 'primary' netdevs respectively. + +Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode +================================================================== + +net_failover also enables hypervisor controlled live migration to be supported +with VMs that have direct attached SR-IOV VF devices by automatic failover to +the paravirtual datapath when the VF is unplugged. + +Here is a sample script that shows the steps to initiate live migration on +the source hypervisor. +:: + + # cat vf_xml + <interface type='hostdev' managed='yes'> + <mac address='52:54:00:00:12:53'/> + <source> + <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> + </interface> + + # Source Hypervisor + #!/bin/bash + + DOMAIN=fedora27-tap01 + PF=enp66s0f0 + VF_NUM=5 + TAP_IF=tap01 + VF_XML= + + MAC=52:54:00:00:12:53 + ZERO_MAC=00:00:00:00:00:00 + + virsh domif-setlink $DOMAIN $TAP_IF up + bridge fdb del $MAC dev $PF master + virsh detach-device $DOMAIN $VF_XML + ip link set $PF vf $VF_NUM mac $ZERO_MAC + + virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system + + # Destination Hypervisor + #!/bin/bash + + virsh attach-device $DOMAIN $VF_XML + virsh domif-setlink $DOMAIN $TAP_IF down diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt new file mode 100644 index 000000000..296ea00fd --- /dev/null +++ b/Documentation/networking/netconsole.txt @@ -0,0 +1,210 @@ + +started by Ingo Molnar <mingo@redhat.com>, 2001.09.17 +2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 +IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013 +Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015 + +Please send bug reports to Matt Mackall <mpm@selenic.com> +Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com> + +Introduction: +============= + +This module logs kernel printk messages over UDP allowing debugging of +problem where disk logging fails and serial consoles are impractical. + +It can be used either built-in or as a module. As a built-in, +netconsole initializes immediately after NIC cards and will bring up +the specified interface as soon as possible. While this doesn't allow +capture of early kernel panics, it does capture most of the boot +process. + +Sender and receiver configuration: +================================== + +It takes a string configuration parameter "netconsole" in the +following format: + + netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr] + + where + + if present, enable extended console support + src-port source for UDP packets (defaults to 6665) + src-ip source IP to use (interface address) + dev network interface (eth0) + tgt-port port for logging agent (6666) + tgt-ip IP address for logging agent + tgt-macaddr ethernet MAC address for logging agent (broadcast) + +Examples: + + linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc + + or + + insmod netconsole netconsole=@/,@10.0.0.2/ + + or using IPv6 + + insmod netconsole netconsole=@/,@fd00:1:2:3::1/ + +It also supports logging to multiple remote agents by specifying +parameters for the multiple agents separated by semicolons and the +complete string enclosed in "quotes", thusly: + + modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/" + +Built-in netconsole starts immediately after the TCP stack is +initialized and attempts to bring up the supplied dev at the supplied +address. + +The remote host has several options to receive the kernel messages, +for example: + +1) syslogd + +2) netcat + + On distributions using a BSD-based netcat version (e.g. Fedora, + openSUSE and Ubuntu) the listening port must be specified without + the -p switch: + + 'nc -u -l -p <port>' / 'nc -u -l <port>' or + 'netcat -u -l -p <port>' / 'netcat -u -l <port>' + +3) socat + + 'socat udp-recv:<port> -' + +Dynamic reconfiguration: +======================== + +Dynamic reconfigurability is a useful addition to netconsole that enables +remote logging targets to be dynamically added, removed, or have their +parameters reconfigured at runtime from a configfs-based userspace interface. +[ Note that the parameters of netconsole targets that were specified/created +from the boot/module option are not exposed via this interface, and hence +cannot be modified dynamically. ] + +To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the +netconsole module (or kernel, if netconsole is built-in). + +Some examples follow (where configfs is mounted at the /sys/kernel/config +mountpoint). + +To add a remote logging target (target names can be arbitrary): + + cd /sys/kernel/config/netconsole/ + mkdir target1 + +Note that newly created targets have default parameter values (as mentioned +above) and are disabled by default -- they must first be enabled by writing +"1" to the "enabled" attribute (usually after setting parameters accordingly) +as described below. + +To remove a target: + + rmdir /sys/kernel/config/netconsole/othertarget/ + +The interface exposes these parameters of a netconsole target to userspace: + + enabled Is this target currently enabled? (read-write) + extended Extended mode enabled (read-write) + dev_name Local network interface name (read-write) + local_port Source UDP port to use (read-write) + remote_port Remote agent's UDP port (read-write) + local_ip Source IP address to use (read-write) + remote_ip Remote agent's IP address (read-write) + local_mac Local interface's MAC address (read-only) + remote_mac Remote agent's MAC address (read-write) + +The "enabled" attribute is also used to control whether the parameters of +a target can be updated or not -- you can modify the parameters of only +disabled targets (i.e. if "enabled" is 0). + +To update a target's parameters: + + cat enabled # check if enabled is 1 + echo 0 > enabled # disable the target (if required) + echo eth2 > dev_name # set local interface + echo 10.0.0.4 > remote_ip # update some parameter + echo cb:a9:87:65:43:21 > remote_mac # update more parameters + echo 1 > enabled # enable target again + +You can also update the local interface dynamically. This is especially +useful if you want to use interfaces that have newly come up (and may not +have existed when netconsole was loaded / initialized). + +Extended console: +================= + +If '+' is prefixed to the configuration line or "extended" config file +is set to 1, extended console support is enabled. An example boot +param follows. + + linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc + +Log messages are transmitted with extended metadata header in the +following format which is the same as /dev/kmsg. + + <level>,<sequnum>,<timestamp>,<contflag>;<message text> + +Non printable characters in <message text> are escaped using "\xff" +notation. If the message contains optional dictionary, verbatim +newline is used as the delimeter. + +If a message doesn't fit in certain number of bytes (currently 1000), +the message is split into multiple fragments by netconsole. These +fragments are transmitted with "ncfrag" header field added. + + ncfrag=<byte-offset>/<total-bytes> + +For example, assuming a lot smaller chunk size, a message "the first +chunk, the 2nd chunk." may be split as follows. + + 6,416,1758426,-,ncfrag=0/31;the first chunk, + 6,416,1758426,-,ncfrag=16/31; the 2nd chunk. + +Miscellaneous notes: +==================== + +WARNING: the default target ethernet setting uses the broadcast +ethernet address to send packets, which can cause increased load on +other systems on the same ethernet segment. + +TIP: some LAN switches may be configured to suppress ethernet broadcasts +so it is advised to explicitly specify the remote agents' MAC addresses +from the config parameters passed to netconsole. + +TIP: to find out the MAC address of, say, 10.0.0.2, you may try using: + + ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2 + +TIP: in case the remote logging agent is on a separate LAN subnet than +the sender, it is suggested to try specifying the MAC address of the +default gateway (you may use /sbin/route -n to find it out) as the +remote MAC address instead. + +NOTE: the network device (eth1 in the above case) can run any kind +of other network traffic, netconsole is not intrusive. Netconsole +might cause slight delays in other traffic if the volume of kernel +messages is high, but should have no other impact. + +NOTE: if you find that the remote logging agent is not receiving or +printing all messages from the sender, it is likely that you have set +the "console_loglevel" parameter (on the sender) to only send high +priority messages to the console. You can change this at runtime using: + + dmesg -n 8 + +or by specifying "debug" on the kernel command line at boot, to send +all kernel messages to the console. A specific value for this parameter +can also be set using the "loglevel" kernel boot option. See the +dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst for details. + +Netconsole was designed to be as instantaneous as possible, to +enable the logging of even the most critical kernel bugs. It works +from IRQ contexts as well, and does not enable interrupts while +sending packets. Due to these unique needs, configuration cannot +be more automatic, and some fundamental limitations will remain: +only IP networks, UDP packets and ethernet devices are supported. diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst new file mode 100644 index 000000000..0ac5fa77f --- /dev/null +++ b/Documentation/networking/netdev-FAQ.rst @@ -0,0 +1,259 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _netdev-FAQ: + +========== +netdev FAQ +========== + +Q: What is netdev? +------------------ +A: It is a mailing list for all network-related Linux stuff. This +includes anything found under net/ (i.e. core code like IPv6) and +drivers/net (i.e. hardware specific drivers) in the Linux source tree. + +Note that some subsystems (e.g. wireless drivers) which have a high +volume of traffic have their own specific mailing lists. + +The netdev list is managed (like many other Linux mailing lists) through +VGER (http://vger.kernel.org/) and archives can be found below: + +- http://marc.info/?l=linux-netdev +- http://www.spinics.net/lists/netdev/ + +Aside from subsystems like that mentioned above, all network-related +Linux development (i.e. RFC, review, comments, etc.) takes place on +netdev. + +Q: How do the changes posted to netdev make their way into Linux? +----------------------------------------------------------------- +A: There are always two trees (git repositories) in play. Both are +driven by David Miller, the main network maintainer. There is the +``net`` tree, and the ``net-next`` tree. As you can probably guess from +the names, the ``net`` tree is for fixes to existing code already in the +mainline tree from Linus, and ``net-next`` is where the new code goes +for the future release. You can find the trees here: + +- https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git +- https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git + +Q: How often do changes from these trees make it to the mainline Linus tree? +---------------------------------------------------------------------------- +A: To understand this, you need to know a bit of background information on +the cadence of Linux development. Each new release starts off with a +two week "merge window" where the main maintainers feed their new stuff +to Linus for merging into the mainline tree. After the two weeks, the +merge window is closed, and it is called/tagged ``-rc1``. No new +features get mainlined after this -- only fixes to the rc1 content are +expected. After roughly a week of collecting fixes to the rc1 content, +rc2 is released. This repeats on a roughly weekly basis until rc7 +(typically; sometimes rc6 if things are quiet, or rc8 if things are in a +state of churn), and a week after the last vX.Y-rcN was done, the +official vX.Y is released. + +Relating that to netdev: At the beginning of the 2-week merge window, +the ``net-next`` tree will be closed - no new changes/features. The +accumulated new content of the past ~10 weeks will be passed onto +mainline/Linus via a pull request for vX.Y -- at the same time, the +``net`` tree will start accumulating fixes for this pulled content +relating to vX.Y + +An announcement indicating when ``net-next`` has been closed is usually +sent to netdev, but knowing the above, you can predict that in advance. + +IMPORTANT: Do not send new ``net-next`` content to netdev during the +period during which ``net-next`` tree is closed. + +Shortly after the two weeks have passed (and vX.Y-rc1 is released), the +tree for ``net-next`` reopens to collect content for the next (vX.Y+1) +release. + +If you aren't subscribed to netdev and/or are simply unsure if +``net-next`` has re-opened yet, simply check the ``net-next`` git +repository link above for any new networking-related commits. You may +also check the following website for the current status: + + http://vger.kernel.org/~davem/net-next.html + +The ``net`` tree continues to collect fixes for the vX.Y content, and is +fed back to Linus at regular (~weekly) intervals. Meaning that the +focus for ``net`` is on stabilization and bug fixes. + +Finally, the vX.Y gets released, and the whole cycle starts over. + +Q: So where are we now in this cycle? + +Load the mainline (Linus) page here: + + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git + +and note the top of the "tags" section. If it is rc1, it is early in +the dev cycle. If it was tagged rc7 a week ago, then a release is +probably imminent. + +Q: How do I indicate which tree (net vs. net-next) my patch should be in? +------------------------------------------------------------------------- +A: Firstly, think whether you have a bug fix or new "next-like" content. +Then once decided, assuming that you use git, use the prefix flag, i.e. +:: + + git format-patch --subject-prefix='PATCH net-next' start..finish + +Use ``net`` instead of ``net-next`` (always lower case) in the above for +bug-fix ``net`` content. If you don't use git, then note the only magic +in the above is just the subject text of the outgoing e-mail, and you +can manually change it yourself with whatever MUA you are comfortable +with. + +Q: I sent a patch and I'm wondering what happened to it? +-------------------------------------------------------- +Q: How can I tell whether it got merged? +A: Start by looking at the main patchworks queue for netdev: + + http://patchwork.ozlabs.org/project/netdev/list/ + +The "State" field will tell you exactly where things are at with your +patch. + +Q: The above only says "Under Review". How can I find out more? +---------------------------------------------------------------- +A: Generally speaking, the patches get triaged quickly (in less than +48h). So be patient. Asking the maintainer for status updates on your +patch is a good way to ensure your patch is ignored or pushed to the +bottom of the priority list. + +Q: I submitted multiple versions of the patch series +---------------------------------------------------- +Q: should I directly update patchwork for the previous versions of these +patch series? +A: No, please don't interfere with the patch status on patchwork, leave +it to the maintainer to figure out what is the most recent and current +version that should be applied. If there is any doubt, the maintainer +will reply and ask what should be done. + +Q: How can I tell what patches are queued up for backporting to the various stable releases? +-------------------------------------------------------------------------------------------- +A: Normally Greg Kroah-Hartman collects stable commits himself, but for +networking, Dave collects up patches he deems critical for the +networking subsystem, and then hands them off to Greg. + +There is a patchworks queue that you can see here: + + http://patchwork.ozlabs.org/bundle/davem/stable/?state=* + +It contains the patches which Dave has selected, but not yet handed off +to Greg. If Greg already has the patch, then it will be here: + + https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git + +A quick way to find whether the patch is in this stable-queue is to +simply clone the repo, and then git grep the mainline commit ID, e.g. +:: + + stable-queue$ git grep -l 284041ef21fdf2e + releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + stable/stable-queue$ + +Q: I see a network patch and I think it should be backported to stable. +----------------------------------------------------------------------- +Q: Should I request it via stable@vger.kernel.org like the references in +the kernel's Documentation/process/stable-kernel-rules.rst file say? +A: No, not for networking. Check the stable queues as per above first +to see if it is already queued. If not, then send a mail to netdev, +listing the upstream commit ID and why you think it should be a stable +candidate. + +Before you jump to go do the above, do note that the normal stable rules +in :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>` +still apply. So you need to explicitly indicate why it is a critical +fix and exactly what users are impacted. In addition, you need to +convince yourself that you *really* think it has been overlooked, +vs. having been considered and rejected. + +Generally speaking, the longer it has had a chance to "soak" in +mainline, the better the odds that it is an OK candidate for stable. So +scrambling to request a commit be added the day after it appears should +be avoided. + +Q: I have created a network patch and I think it should be backported to stable. +-------------------------------------------------------------------------------- +Q: Should I add a Cc: stable@vger.kernel.org like the references in the +kernel's Documentation/ directory say? +A: No. See above answer. In short, if you think it really belongs in +stable, then ensure you write a decent commit log that describes who +gets impacted by the bug fix and how it manifests itself, and when the +bug was introduced. If you do that properly, then the commit will get +handled appropriately and most likely get put in the patchworks stable +queue if it really warrants it. + +If you think there is some valid information relating to it being in +stable that does *not* belong in the commit log, then use the three dash +marker line as described in +:ref:`Documentation/process/submitting-patches.rst <the_canonical_patch_format>` +to temporarily embed that information into the patch that you send. + +Q: Are all networking bug fixes backported to all stable releases? +------------------------------------------------------------------ +A: Due to capacity, Dave could only take care of the backports for the +last two stable releases. For earlier stable releases, each stable +branch maintainer is supposed to take care of them. If you find any +patch is missing from an earlier stable branch, please notify +stable@vger.kernel.org with either a commit ID or a formal patch +backported, and CC Dave and other relevant networking developers. + +Q: Is the comment style convention different for the networking content? +------------------------------------------------------------------------ +A: Yes, in a largely trivial way. Instead of this:: + + /* + * foobar blah blah blah + * another line of text + */ + +it is requested that you make it look like this:: + + /* foobar blah blah blah + * another line of text + */ + +Q: I am working in existing code that has the former comment style and not the latter. +-------------------------------------------------------------------------------------- +Q: Should I submit new code in the former style or the latter? +A: Make it the latter style, so that eventually all code in the domain +of netdev is of this format. + +Q: I found a bug that might have possible security implications or similar. +--------------------------------------------------------------------------- +Q: Should I mail the main netdev maintainer off-list?** +A: No. The current netdev maintainer has consistently requested that +people use the mailing lists and not reach out directly. If you aren't +OK with that, then perhaps consider mailing security@kernel.org or +reading about http://oss-security.openwall.org/wiki/mailing-lists/distros +as possible alternative mechanisms. + +Q: What level of testing is expected before I submit my change? +--------------------------------------------------------------- +A: If your changes are against ``net-next``, the expectation is that you +have tested by layering your changes on top of ``net-next``. Ideally +you will have done run-time testing specific to your change, but at a +minimum, your changes should survive an ``allyesconfig`` and an +``allmodconfig`` build without new warnings or failures. + +Q: Any other tips to help ensure my net/net-next patch gets OK'd? +----------------------------------------------------------------- +A: Attention to detail. Re-read your own work as if you were the +reviewer. You can start with using ``checkpatch.pl``, perhaps even with +the ``--strict`` flag. But do not be mindlessly robotic in doing so. +If your change is a bug fix, make sure your commit log indicates the +end-user visible symptom, the underlying reason as to why it happens, +and then if necessary, explain why the fix proposed is the best way to +get things done. Don't mangle whitespace, and as is common, don't +mis-indent function arguments that span multiple lines. If it is your +first patch, mail it to yourself so you can test apply it to an +unpatched tree to confirm infrastructure didn't mangle it. + +Finally, go back and read +:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` +to be sure you are not repeating some common mistake documented there. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt new file mode 100644 index 000000000..c4a54c162 --- /dev/null +++ b/Documentation/networking/netdev-features.txt @@ -0,0 +1,181 @@ +Netdev features mess and how to get out from it alive +===================================================== + +Author: + Michał Mirosław <mirq-linux@rere.qmqm.pl> + + + + Part I: Feature sets +====================== + +Long gone are the days when a network card would just take and give packets +verbatim. Today's devices add multiple features and bugs (read: offloads) +that relieve an OS of various tasks like generating and checking checksums, +splitting packets, classifying them. Those capabilities and their state +are commonly referred to as netdev features in Linux kernel world. + +There are currently three sets of features relevant to the driver, and +one used internally by network core: + + 1. netdev->hw_features set contains features whose state may possibly + be changed (enabled or disabled) for a particular device by user's + request. This set should be initialized in ndo_init callback and not + changed later. + + 2. netdev->features set contains features which are currently enabled + for a device. This should be changed only by network core or in + error paths of ndo_set_features callback. + + 3. netdev->vlan_features set contains features whose state is inherited + by child VLAN devices (limits netdev->features set). This is currently + used for all VLAN devices whether tags are stripped or inserted in + hardware or software. + + 4. netdev->wanted_features set contains feature set requested by user. + This set is filtered by ndo_fix_features callback whenever it or + some device-specific conditions change. This set is internal to + networking core and should not be referenced in drivers. + + + + Part II: Controlling enabled features +======================================= + +When current feature set (netdev->features) is to be changed, new set +is calculated and filtered by calling ndo_fix_features callback +and netdev_fix_features(). If the resulting set differs from current +set, it is passed to ndo_set_features callback and (if the callback +returns success) replaces value stored in netdev->features. +NETDEV_FEAT_CHANGE notification is issued after that whenever current +set might have changed. + +The following events trigger recalculation: + 1. device's registration, after ndo_init returned success + 2. user requested changes in features state + 3. netdev_update_features() is called + +ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks +are treated as always returning success. + +A driver that wants to trigger recalculation must do so by calling +netdev_update_features() while holding rtnl_lock. This should not be done +from ndo_*_features callbacks. netdev->features should not be modified by +driver except by means of ndo_fix_features callback. + + + + Part III: Implementation hints +================================ + + * ndo_fix_features: + +All dependencies between features should be resolved here. The resulting +set can be reduced further by networking core imposed limitations (as coded +in netdev_fix_features()). For this reason it is safer to disable a feature +when its dependencies are not met instead of forcing the dependency on. + +This callback should not modify hardware nor driver state (should be +stateless). It can be called multiple times between successive +ndo_set_features calls. + +Callback must not alter features contained in NETIF_F_SOFT_FEATURES or +NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but +care must be taken as the change won't affect already configured VLANs. + + * ndo_set_features: + +Hardware should be reconfigured to match passed feature set. The set +should not be altered unless some error condition happens that can't +be reliably detected in ndo_fix_features. In this case, the callback +should update netdev->features to match resulting hardware state. +Errors returned are not (and cannot be) propagated anywhere except dmesg. +(Note: successful return is zero, >0 means silent error.) + + + + Part IV: Features +=================== + +For current list of features, see include/linux/netdev_features.h. +This section describes semantics of some of them. + + * Transmit checksumming + +For complete description, see comments near the top of include/linux/skbuff.h. + +Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. +It means that device can fill TCP/UDP-like checksum anywhere in the packets +whatever headers there might be. + + * Transmit TCP segmentation offload + +NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit +set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + + * Transmit UDP segmentation offload + +NETIF_F_GSO_UDP_GSO_L4 accepts a single UDP header with a payload that exceeds +gso_size. On segmentation, it segments the payload on gso_size boundaries and +replicates the network and UDP headers (fixing up the last one if less than +gso_size). + + * Transmit DMA from high memory + +On platforms where this is relevant, NETIF_F_HIGHDMA signals that +ndo_start_xmit can handle skbs with frags in high memory. + + * Transmit scatter-gather + +Those features say that ndo_start_xmit can handle fragmented skbs: +NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- +chained skbs (skb->next/prev list). + + * Software features + +Features contained in NETIF_F_SOFT_FEATURES are features of networking +stack. Driver should not change behaviour based on them. + + * LLTX driver (deprecated for hardware drivers) + +NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, +e.g. software tunnels. + +This is also used in a few legacy drivers that implement their +own locking, don't use it for new (hardware) drivers. + + * netns-local device + +NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between +network namespaces (e.g. loopback). + +Don't use it in drivers. + + * VLAN challenged + +NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN +headers. Some drivers set this because the cards can't handle the bigger MTU. +[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU +VLANs. This may be not useful, though.] + +* rx-fcs + +This requests that the NIC append the Ethernet Frame Checksum (FCS) +to the end of the skb data. This allows sniffers and other tools to +read the CRC recorded by the NIC on receipt of the packet. + +* rx-all + +This requests that the NIC receive all possible frames, including errored +frames (such as bad FCS, etc). This can be helpful when sniffing a link with +bad packets on it. Some NICs may receive more packets if also put into normal +PROMISC mode. + +* rx-gro-hw + +This requests that the NIC enables Hardware GRO (generic receive offload). +Hardware GRO is basically the exact reverse of TSO, and is generally +stricter than Hardware LRO. A packet stream merged by Hardware GRO must +be re-segmentable by GSO or TSO back to the exact original packet stream. +Hardware GRO is dependent on RXCSUM since every packet successfully merged +by hardware must also have the checksum verified by hardware. diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt new file mode 100644 index 000000000..7fec2061a --- /dev/null +++ b/Documentation/networking/netdevices.txt @@ -0,0 +1,104 @@ + +Network Devices, the Kernel, and You! + + +Introduction +============ +The following is a random collection of documentation regarding +network devices. + +struct net_device allocation rules +================================== +Network device structures need to persist even after module is unloaded and +must be allocated with alloc_netdev_mqs() and friends. +If device has registered successfully, it will be freed on last use +by free_netdev(). This is required to handle the pathologic case cleanly +(example: rmmod mydriver </sys/class/net/myeth/mtu ) + +alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver +private data which gets freed when the network device is freed. If +separately allocated data is attached to the network device +(netdev_priv(dev)) then it is up to the module exit handler to free that. + +MTU +=== +Each network device has a Maximum Transfer Unit. The MTU does not +include any link layer protocol overhead. Upper layer protocols must +not pass a socket buffer (skb) to a device to transmit with more data +than the mtu. The MTU does not include link layer header overhead, so +for example on Ethernet if the standard MTU is 1500 bytes used, the +actual skb will contain up to 1514 bytes because of the Ethernet +header. Devices should allow for the 4 byte VLAN header as well. + +Segmentation Offload (GSO, TSO) is an exception to this rule. The +upper layer protocol may pass a large socket buffer to the device +transmit routine, and the device will break that up into separate +packets based on the current MTU. + +MTU is symmetrical and applies both to receive and transmit. A device +must be able to receive at least the maximum size packet allowed by +the MTU. A network device may use the MTU as mechanism to size receive +buffers, but the device should allow packets with VLAN header. With +standard Ethernet mtu of 1500 bytes, the device should allow up to +1518 byte packets (1500 + 14 header + 4 tag). The device may either: +drop, truncate, or pass up oversize packets, but dropping oversize +packets is preferred. + + +struct net_device synchronization rules +======================================= +ndo_open: + Synchronization: rtnl_lock() semaphore. + Context: process + +ndo_stop: + Synchronization: rtnl_lock() semaphore. + Context: process + Note: netif_running() is guaranteed false + +ndo_do_ioctl: + Synchronization: rtnl_lock() semaphore. + Context: process + +ndo_get_stats: + Synchronization: dev_base_lock rwlock. + Context: nominally process, but don't sleep inside an rwlock + +ndo_start_xmit: + Synchronization: __netif_tx_lock spinlock. + + When the driver sets NETIF_F_LLTX in dev->features this will be + called without holding netif_tx_lock. In this case the driver + has to lock by itself when needed. + The locking there should also properly protect against + set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. + Don't use it for new drivers. + + Context: Process with BHs disabled or BH (timer), + will be called with interrupts disabled by netconsole. + + Return codes: + o NETDEV_TX_OK everything ok. + o NETDEV_TX_BUSY Cannot transmit packet, try later + Usually a bug, means queue start/stop flow control is broken in + the driver. Note: the driver must NOT put the skb in its DMA ring. + +ndo_tx_timeout: + Synchronization: netif_tx_lock spinlock; all TX queues frozen. + Context: BHs disabled + Notes: netif_queue_stopped() is guaranteed true + +ndo_set_rx_mode: + Synchronization: netif_addr_lock spinlock. + Context: BHs disabled + +struct napi_struct synchronization rules +======================================== +napi->poll: + Synchronization: NAPI_STATE_SCHED bit in napi->state. Device + driver's ndo_stop method will invoke napi_disable() on + all NAPI instances which will do a sleeping poll on the + NAPI_STATE_SCHED napi->state bit, waiting for all pending + NAPI activity to cease. + Context: softirq + will be called with interrupts disabled by netconsole. diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.txt new file mode 100644 index 000000000..55791e50e --- /dev/null +++ b/Documentation/networking/netfilter-sysctl.txt @@ -0,0 +1,10 @@ +/proc/sys/net/netfilter/* Variables: + +nf_log_all_netns - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + By default, only init_net namespace can log packets into kernel log + with LOG target; this aims to prevent containers from flooding host + kernel log. If enabled, this target also works in other network + namespaces. This variable is only accessible from init_net. diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt new file mode 100644 index 000000000..c967ddb90 --- /dev/null +++ b/Documentation/networking/netif-msg.txt @@ -0,0 +1,79 @@ + +________________ +NETIF Msg Level + +The design of the network interface message level setting. + +History + + The design of the debugging message interface was guided and + constrained by backwards compatibility previous practice. It is useful + to understand the history and evolution in order to understand current + practice and relate it to older driver source code. + + From the beginning of Linux, each network device driver has had a local + integer variable that controls the debug message level. The message + level ranged from 0 to 7, and monotonically increased in verbosity. + + The message level was not precisely defined past level 3, but were + always implemented within +-1 of the specified level. Drivers tended + to shed the more verbose level messages as they matured. + 0 Minimal messages, only essential information on fatal errors. + 1 Standard messages, initialization status. No run-time messages + 2 Special media selection messages, generally timer-driver. + 3 Interface starts and stops, including normal status messages + 4 Tx and Rx frame error messages, and abnormal driver operation + 5 Tx packet queue information, interrupt events. + 6 Status on each completed Tx packet and received Rx packets + 7 Initial contents of Tx and Rx packets + + Initially this message level variable was uniquely named in each driver + e.g. "lance_debug", so that a kernel symbolic debugger could locate and + modify the setting. When kernel modules became common, the variables + were consistently renamed to "debug" and allowed to be set as a module + parameter. + + This approach worked well. However there is always a demand for + additional features. Over the years the following emerged as + reasonable and easily implemented enhancements + Using an ioctl() call to modify the level. + Per-interface rather than per-driver message level setting. + More selective control over the type of messages emitted. + + The netif_msg recommendation adds these features with only a minor + complexity and code size increase. + + The recommendation is the following points + Retaining the per-driver integer variable "debug" as a module + parameter with a default level of '1'. + + Adding a per-interface private variable named "msg_enable". The + variable is a bit map rather than a level, and is initialized as + 1 << debug + Or more precisely + debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug) + + Messages should changes from + if (debug > 1) + printk(MSG_DEBUG "%s: ... + to + if (np->msg_enable & NETIF_MSG_LINK) + printk(MSG_DEBUG "%s: ... + + +The set of message levels is named + Old level Name Bit position + 0 NETIF_MSG_DRV 0x0001 + 1 NETIF_MSG_PROBE 0x0002 + 2 NETIF_MSG_LINK 0x0004 + 2 NETIF_MSG_TIMER 0x0004 + 3 NETIF_MSG_IFDOWN 0x0008 + 3 NETIF_MSG_IFUP 0x0008 + 4 NETIF_MSG_RX_ERR 0x0010 + 4 NETIF_MSG_TX_ERR 0x0010 + 5 NETIF_MSG_TX_QUEUED 0x0020 + 5 NETIF_MSG_INTR 0x0020 + 6 NETIF_MSG_TX_DONE 0x0040 + 6 NETIF_MSG_RX_STATUS 0x0040 + 7 NETIF_MSG_PKTDATA 0x0080 + diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt new file mode 100644 index 000000000..92f5b3139 --- /dev/null +++ b/Documentation/networking/netvsc.txt @@ -0,0 +1,75 @@ +Hyper-V network driver +====================== + +Compatibility +============= + +This driver is compatible with Windows Server 2012 R2, 2016 and +Windows 10. + +Features +======== + + Checksum offload + ---------------- + The netvsc driver supports checksum offload as long as the + Hyper-V host version does. Windows Server 2016 and Azure + support checksum offload for TCP and UDP for both IPv4 and + IPv6. Windows Server 2012 only supports checksum offload for TCP. + + Receive Side Scaling + -------------------- + Hyper-V supports receive side scaling. For TCP & UDP, packets can + be distributed among available queues based on IP address and port + number. + + For TCP & UDP, we can switch hash level between L3 and L4 by ethtool + command. TCP/UDP over IPv4 and v6 can be set differently. The default + hash level is L4. We currently only allow switching TX hash level + from within the guests. + + On Azure, fragmented UDP packets have high loss rate with L4 + hashing. Using L3 hashing is recommended in this case. + + For example, for UDP over IPv4 on eth0: + To include UDP port numbers in hashing: + ethtool -N eth0 rx-flow-hash udp4 sdfn + To exclude UDP port numbers in hashing: + ethtool -N eth0 rx-flow-hash udp4 sd + To show UDP hash level: + ethtool -n eth0 rx-flow-hash udp4 + + Generic Receive Offload, aka GRO + -------------------------------- + The driver supports GRO and it is enabled by default. GRO coalesces + like packets and significantly reduces CPU usage under heavy Rx + load. + + SR-IOV support + -------------- + Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV + is enabled in both the vSwitch and the guest configuration, then the + Virtual Function (VF) device is passed to the guest as a PCI + device. In this case, both a synthetic (netvsc) and VF device are + visible in the guest OS and both NIC's have the same MAC address. + + The VF is enslaved by netvsc device. The netvsc driver will transparently + switch the data path to the VF when it is available and up. + Network state (addresses, firewall, etc) should be applied only to the + netvsc device; the slave device should not be accessed directly in + most cases. The exceptions are if some special queue discipline or + flow direction is desired, these should be applied directly to the + VF slave device. + + Receive Buffer + -------------- + Packets are received into a receive area which is created when device + is probed. The receive area is broken into MTU sized chunks and each may + contain one or more packets. The number of receive sections may be changed + via ethtool Rx ring parameters. + + There is a similar send buffer which is used to aggregate packets for sending. + The send area is broken into chunks of 6144 bytes, each of section may + contain one or more packets. The send buffer is an optimization, the driver + will use slower method to handle very large packets or if the send buffer + area is exhausted. diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt new file mode 100644 index 000000000..1669dc241 --- /dev/null +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -0,0 +1,163 @@ +/proc/sys/net/netfilter/nf_conntrack_* Variables: + +nf_conntrack_acct - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow accounting. 64-bit byte and packet + counters per flow are added. + +nf_conntrack_buckets - INTEGER + Size of hash table. If not specified as parameter during module + loading, the default size is calculated by dividing total memory + by 16384 to determine the number of buckets but the hash table will + never have fewer than 32 and limited to 16384 buckets. For systems + with more than 4GB of memory it will be 65536 buckets. + This sysctl is only writeable in the initial net namespace. + +nf_conntrack_checksum - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + Verify checksum of incoming packets. Packets with bad checksums are + in INVALID state. If this is enabled, such packets will not be + considered for connection tracking. + +nf_conntrack_count - INTEGER (read-only) + Number of currently allocated flow entries. + +nf_conntrack_events - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If this option is enabled, the connection tracking code will + provide userspace with connection tracking events via ctnetlink. + +nf_conntrack_expect_max - INTEGER + Maximum size of expectation table. Default value is + nf_conntrack_buckets / 256. Minimum is 1. + +nf_conntrack_frag6_high_thresh - INTEGER + default 262144 + + Maximum memory used to reassemble IPv6 fragments. When + nf_conntrack_frag6_high_thresh bytes of memory is allocated for this + purpose, the fragment handler will toss packets until + nf_conntrack_frag6_low_thresh is reached. + +nf_conntrack_frag6_low_thresh - INTEGER + default 196608 + + See nf_conntrack_frag6_low_thresh + +nf_conntrack_frag6_timeout - INTEGER (seconds) + default 60 + + Time to keep an IPv6 fragment in memory. + +nf_conntrack_generic_timeout - INTEGER (seconds) + default 600 + + Default for generic timeout. This refers to layer 4 unknown/unsupported + protocols. + +nf_conntrack_helper - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable automatic conntrack helper assignment. + If disabled it is required to set up iptables rules to assign + helpers to connections. See the CT target description in the + iptables-extensions(8) man page for further information. + +nf_conntrack_icmp_timeout - INTEGER (seconds) + default 30 + + Default for ICMP timeout. + +nf_conntrack_icmpv6_timeout - INTEGER (seconds) + default 30 + + Default for ICMP6 timeout. + +nf_conntrack_log_invalid - INTEGER + 0 - disable (default) + 1 - log ICMP packets + 6 - log TCP packets + 17 - log UDP packets + 33 - log DCCP packets + 41 - log ICMPv6 packets + 136 - log UDPLITE packets + 255 - log packets of any protocol + + Log invalid packets of a type specified by value. + +nf_conntrack_max - INTEGER + Size of connection tracking table. Default value is + nf_conntrack_buckets value * 4. + +nf_conntrack_tcp_be_liberal - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Be conservative in what you do, be liberal in what you accept from others. + If it's non-zero, we mark only out of window RST segments as INVALID. + +nf_conntrack_tcp_loose - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If it is set to zero, we disable picking up already established + connections. + +nf_conntrack_tcp_max_retrans - INTEGER + default 3 + + Maximum number of packets that can be retransmitted without + received an (acceptable) ACK from the destination. If this number + is reached, a shorter timer will be started. + +nf_conntrack_tcp_timeout_close - INTEGER (seconds) + default 10 + +nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_established - INTEGER (seconds) + default 432000 (5 days) + +nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds) + default 30 + +nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds) + default 300 + +nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds) + default 300 + +nf_conntrack_timestamp - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow timestamping. + +nf_conntrack_udp_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_udp_timeout_stream - INTEGER (seconds) + default 180 + + This extended timeout will be used in case there is an UDP stream + detected. diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt new file mode 100644 index 000000000..b01c91893 --- /dev/null +++ b/Documentation/networking/nf_flowtable.txt @@ -0,0 +1,112 @@ +Netfilter's flowtable infrastructure +==================================== + +This documentation describes the software flowtable infrastructure available in +Netfilter since Linux kernel 4.16. + +Overview +-------- + +Initial packets follow the classic forwarding path, once the flow enters the +established state according to the conntrack semantics (ie. we have seen traffic +in both directions), then you can decide to offload the flow to the flowtable +from the forward chain via the 'flow offload' action available in nftables. + +Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the +output netdevice via neigh_xmit(), hence, they bypass the classic forwarding +path (the visible effect is that you do not see these packets from any of the +netfilter hooks coming after the ingress). In case of flowtable miss, the packet +follows the classic forward path. + +The flowtable uses a resizable hashtable, lookups are based on the following +7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source +and destination ports and the input interface (useful in case there are several +conntrack zones in place). + +Flowtables are populated via the 'flow offload' nftables action, so the user can +selectively specify what flows are placed into the flow table. Hence, packets +follow the classic forwarding path unless the user explicitly instruct packets +to use this new alternative forwarding path via nftables policy. + +This is represented in Fig.1, which describes the classic forwarding path +including the Netfilter hooks and the flowtable fastpath bypass. + + userspace process + ^ | + | | + _____|____ ____\/___ + / \ / \ + | input | | output | + \__________/ \_________/ + ^ | + | | + _________ __________ --------- _____\/_____ + / \ / \ |Routing | / \ + --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit + \_________/ \__________/ ---------- \____________/ ^ + | ^ | | ^ | + flowtable | | ____\/___ | | + | | | / \ | | + __\/___ | --------->| forward |------------ | + |-----| | \_________/ | + |-----| | 'flow offload' rule | + |-----| | adds entry to | + |_____| | flowtable | + | | | + / \ | | + /hit\_no_| | + \ ? / | + \ / | + |__yes_________________fastpath bypass ____________________________| + + Fig.1 Netfilter hooks and flowtable interactions + +The flowtable entry also stores the NAT configuration, so all packets are +mangled according to the NAT policy that matches the initial packets that went +through the classic forwarding path. The TTL is decremented before calling +neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding +path given that the transport selectors are missing, therefore flowtable lookup +is not possible. + +Example configuration +--------------------- + +Enabling the flowtable bypass is relatively easy, you only need to create a +flowtable and add one rule to your forward chain. + + table inet x { + flowtable f { + hook ingress priority 0; devices = { eth0, eth1 }; + } + chain y { + type filter hook forward priority 0; policy accept; + ip protocol tcp flow offload @f + counter packets 0 bytes 0 + } + } + +This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 +netdevices. You can create as many flowtables as you want in case you need to +perform resource partitioning. The flowtable priority defines the order in which +hooks are run in the pipeline, this is convenient in case you already have a +nftables ingress chain (make sure the flowtable priority is smaller than the +nftables ingress chain hence the flowtable runs before in the pipeline). + +The 'flow offload' action from the forward chain 'y' adds an entry to the +flowtable for the TCP syn-ack packet coming in the reply direction. Once the +flow is offloaded, you will observe that the counter rule in the example above +does not get updated for the packets that are being forwarded through the +forwarding bypass. + +More reading +------------ + +This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also +made a very complete and comprehensive summary called "A state of network +acceleration" that describes how things were before this infrastructure was +mailined [3] and it also makes a rough summary of this work [4]. + +[1] https://lwn.net/Articles/738214/ +[2] https://lwn.net/Articles/742164/ +[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html +[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.txt new file mode 100644 index 000000000..b24c29bda --- /dev/null +++ b/Documentation/networking/nfc.txt @@ -0,0 +1,128 @@ +Linux NFC subsystem +=================== + +The Near Field Communication (NFC) subsystem is required to standardize the +NFC device drivers development and to create an unified userspace interface. + +This document covers the architecture overview, the device driver interface +description and the userspace interface description. + +Architecture overview +--------------------- + +The NFC subsystem is responsible for: + - NFC adapters management; + - Polling for targets; + - Low-level data exchange; + +The subsystem is divided in some parts. The 'core' is responsible for +providing the device driver interface. On the other side, it is also +responsible for providing an interface to control operations and low-level +data exchange. + +The control operations are available to userspace via generic netlink. + +The low-level data exchange interface is provided by the new socket family +PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets. + + + +--------------------------------------+ + | USER SPACE | + +--------------------------------------+ + ^ ^ + | low-level | control + | data exchange | operations + | | + | v + | +-----------+ + | AF_NFC | netlink | + | socket +-----------+ + | raw ^ + | | + v v + +---------+ +-----------+ + | rawsock | <--------> | core | + +---------+ +-----------+ + ^ + | + v + +-----------+ + | driver | + +-----------+ + +Device Driver Interface +----------------------- + +When registering on the NFC subsystem, the device driver must inform the core +of the set of supported NFC protocols and the set of ops callbacks. The ops +callbacks that must be implemented are the following: + +* start_poll - setup the device to poll for targets +* stop_poll - stop on progress polling operation +* activate_target - select and initialize one of the targets found +* deactivate_target - deselect and deinitialize the selected target +* data_exchange - send data and receive the response (transceive operation) + +Userspace interface +-------------------- + +The userspace interface is divided in control operations and low-level data +exchange operation. + +CONTROL OPERATIONS: + +Generic netlink is used to implement the interface to the control operations. +The operations are composed by commands and events, all listed below: + +* NFC_CMD_GET_DEVICE - get specific device info or dump the device list +* NFC_CMD_START_POLL - setup a specific device to polling for targets +* NFC_CMD_STOP_POLL - stop the polling operation in a specific device +* NFC_CMD_GET_TARGET - dump the list of targets found by a specific device + +* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition +* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal +* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets +are found + +The user must call START_POLL to poll for NFC targets, passing the desired NFC +protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling +state until it finds any target. However, the user can stop the polling +operation by calling STOP_POLL command. In this case, it will be checked if +the requester of STOP_POLL is the same of START_POLL. + +If the polling operation finds one or more targets, the event TARGETS_FOUND is +sent (including the device id). The user must call GET_TARGET to get the list of +all targets found by such device. Each reply message has target attributes with +relevant information such as the supported NFC protocols. + +All polling operations requested through one netlink socket are stopped when +it's closed. + +LOW-LEVEL DATA EXCHANGE: + +The userspace must use PF_NFC sockets to perform any data communication with +targets. All NFC sockets use AF_NFC: + +struct sockaddr_nfc { + sa_family_t sa_family; + __u32 dev_idx; + __u32 target_idx; + __u32 nfc_protocol; +}; + +To establish a connection with one target, the user must create an +NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc +struct correctly filled. All information comes from NFC_EVENT_TARGETS_FOUND +netlink event. As a target can support more than one NFC protocol, the user +must inform which protocol it wants to use. + +Internally, 'connect' will result in an activate_target call to the driver. +When the socket is closed, the target is deactivated. + +The data format exchanged through the sockets is NFC protocol dependent. For +instance, when communicating with MIFARE tags, the data exchanged are MIFARE +commands and their responses. + +The first received package is the response to the first sent package and so +on. In order to allow valid "empty" responses, every data received has a NULL +header of 1 byte. diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt new file mode 100644 index 000000000..b3b9ac61d --- /dev/null +++ b/Documentation/networking/openvswitch.txt @@ -0,0 +1,248 @@ +Open vSwitch datapath developer documentation +============================================= + +The Open vSwitch kernel module allows flexible userspace control over +flow-level packet processing on selected network devices. It can be +used to implement a plain Ethernet switch, network device bonding, +VLAN processing, network access control, flow-based network control, +and so on. + +The kernel module implements multiple "datapaths" (analogous to +bridges), each of which can have multiple "vports" (analogous to ports +within a bridge). Each datapath also has associated with it a "flow +table" that userspace populates with "flows" that map from keys based +on packet headers and metadata to sets of actions. The most common +action forwards the packet to another vport; other actions are also +implemented. + +When a packet arrives on a vport, the kernel module processes it by +extracting its flow key and looking it up in the flow table. If there +is a matching flow, it executes the associated actions. If there is +no match, it queues the packet to userspace for processing (as part of +its processing, userspace will likely set up a flow to handle further +packets of the same type entirely in-kernel). + + +Flow key compatibility +---------------------- + +Network protocols evolve over time. New protocols become important +and existing protocols lose their prominence. For the Open vSwitch +kernel module to remain relevant, it must be possible for newer +versions to parse additional protocols as part of the flow key. It +might even be desirable, someday, to drop support for parsing +protocols that have become obsolete. Therefore, the Netlink interface +to Open vSwitch is designed to allow carefully written userspace +applications to work with any version of the flow key, past or future. + +To support this forward and backward compatibility, whenever the +kernel module passes a packet to userspace, it also passes along the +flow key that it parsed from the packet. Userspace then extracts its +own notion of a flow key from the packet and compares it against the +kernel-provided version: + + - If userspace's notion of the flow key for the packet matches the + kernel's, then nothing special is necessary. + + - If the kernel's flow key includes more fields than the userspace + version of the flow key, for example if the kernel decoded IPv6 + headers but userspace stopped at the Ethernet type (because it + does not understand IPv6), then again nothing special is + necessary. Userspace can still set up a flow in the usual way, + as long as it uses the kernel-provided flow key to do it. + + - If the userspace flow key includes more fields than the + kernel's, for example if userspace decoded an IPv6 header but + the kernel stopped at the Ethernet type, then userspace can + forward the packet manually, without setting up a flow in the + kernel. This case is bad for performance because every packet + that the kernel considers part of the flow must go to userspace, + but the forwarding behavior is correct. (If userspace can + determine that the values of the extra fields would not affect + forwarding behavior, then it could set up a flow anyway.) + +How flow keys evolve over time is important to making this work, so +the following sections go into detail. + + +Flow key format +--------------- + +A flow key is passed over a Netlink socket as a sequence of Netlink +attributes. Some attributes represent packet metadata, defined as any +information about a packet that cannot be extracted from the packet +itself, e.g. the vport on which the packet was received. Most +attributes, however, are extracted from headers within the packet, +e.g. source and destination addresses from Ethernet, IP, or TCP +headers. + +The <linux/openvswitch.h> header file defines the exact format of the +flow key attributes. For informal explanatory purposes here, we write +them as comma-separated strings, with parentheses indicating arguments +and nesting. For example, the following could represent a flow key +corresponding to a TCP packet that arrived on vport 1: + + in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), + eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, + frag=no), tcp(src=49163, dst=80) + +Often we ellipsize arguments not important to the discussion, e.g.: + + in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) + + +Wildcarded flow key format +-------------------------- + +A wildcarded flow is described with two sequences of Netlink attributes +passed over the Netlink socket. A flow key, exactly as described above, and an +optional corresponding flow mask. + +A wildcarded flow can represent a group of exact match flows. Each '1' bit +in the mask specifies a exact match with the corresponding bit in the flow key. +A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit +of a incoming packet. Using wildcarded flow can improve the flow set up rate +by reduce the number of new flows need to be processed by the user space program. + +Support for the mask Netlink attribute is optional for both the kernel and user +space program. The kernel can ignore the mask attribute, installing an exact +match flow, or reduce the number of don't care bits in the kernel to less than +what was specified by the user space program. In this case, variations in bits +that the kernel does not implement will simply result in additional flow setups. +The kernel module will also work with user space programs that neither support +nor supply flow mask attributes. + +Since the kernel may ignore or modify wildcard bits, it can be difficult for +the userspace program to know exactly what matches are installed. There are +two possible approaches: reactively install flows as they miss the kernel +flow table (and therefore not attempt to determine wildcard changes at all) +or use the kernel's response messages to determine the installed wildcards. + +When interacting with userspace, the kernel should maintain the match portion +of the key exactly as originally installed. This will provides a handle to +identify the flow for all future operations. However, when reporting the +mask of an installed flow, the mask should include any restrictions imposed +by the kernel. + +The behavior when using overlapping wildcarded flows is undefined. It is the +responsibility of the user space program to ensure that any incoming packet +can match at most one flow, wildcarded or not. The current implementation +performs best-effort detection of overlapping wildcarded flows and may reject +some but not all of them. However, this behavior may change in future versions. + + +Unique flow identifiers +----------------------- + +An alternative to using the original match portion of a key as the handle for +flow identification is a unique flow identifier, or "UFID". UFIDs are optional +for both the kernel and user space program. + +User space programs that support UFID are expected to provide it during flow +setup in addition to the flow, then refer to the flow using the UFID for all +future operations. The kernel is not required to index flows by the original +flow key if a UFID is specified. + + +Basic rule for evolving flow keys +--------------------------------- + +Some care is needed to really maintain forward and backward +compatibility for applications that follow the rules listed under +"Flow key compatibility" above. + +The basic rule is obvious: + + ------------------------------------------------------------------ + New network protocol support must only supplement existing flow + key attributes. It must not change the meaning of already defined + flow key attributes. + ------------------------------------------------------------------ + +This rule does have less-obvious consequences so it is worth working +through a few examples. Suppose, for example, that the kernel module +did not already implement VLAN parsing. Instead, it just interpreted +the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the +packet. The flow key for any packet with an 802.1Q header would look +essentially like this, ignoring metadata: + + eth(...), eth_type(0x8100) + +Naively, to add VLAN support, it makes sense to add a new "vlan" flow +key attribute to contain the VLAN tag, then continue to decode the +encapsulated headers beyond the VLAN tag using the existing field +definitions. With this change, a TCP packet in VLAN 10 would have a +flow key much like this: + + eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) + +But this change would negatively affect a userspace application that +has not been updated to understand the new "vlan" flow key attribute. +The application could, following the flow compatibility rules above, +ignore the "vlan" attribute that it does not understand and therefore +assume that the flow contained IP packets. This is a bad assumption +(the flow only contains IP packets if one parses and skips over the +802.1Q header) and it could cause the application's behavior to change +across kernel versions even though it follows the compatibility rules. + +The solution is to use a set of nested attributes. This is, for +example, why 802.1Q support uses nested attributes. A TCP packet in +VLAN 10 is actually expressed as: + + eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), + ip(proto=6, ...), tcp(...))) + +Notice how the "eth_type", "ip", and "tcp" flow key attributes are +nested inside the "encap" attribute. Thus, an application that does +not understand the "vlan" key will not see either of those attributes +and therefore will not misinterpret them. (Also, the outer eth_type +is still 0x8100, not changed to 0x0800.) + +Handling malformed packets +-------------------------- + +Don't drop packets in the kernel for malformed protocol headers, bad +checksums, etc. This would prevent userspace from implementing a +simple Ethernet switch that forwards every packet. + +Instead, in such a case, include an attribute with "empty" content. +It doesn't matter if the empty content could be valid protocol values, +as long as those values are rarely seen in practice, because userspace +can always forward all packets with those values to userspace and +handle them individually. + +For example, consider a packet that contains an IP header that +indicates protocol 6 for TCP, but which is truncated just after the IP +header, so that the TCP header is missing. The flow key for this +packet would include a tcp attribute with all-zero src and dst, like +this: + + eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) + +As another example, consider a packet with an Ethernet type of 0x8100, +indicating that a VLAN TCI should follow, but which is truncated just +after the Ethernet type. The flow key for this packet would include +an all-zero-bits vlan and an empty encap attribute, like this: + + eth(...), eth_type(0x8100), vlan(0), encap() + +Unlike a TCP packet with source and destination ports 0, an +all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka +VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan +attribute expressly to allow this situation to be distinguished. +Thus, the flow key in this second example unambiguously indicates a +missing or malformed VLAN TCI. + +Other rules +----------- + +The other rules for flow keys are much less subtle: + + - Duplicate attributes are not allowed at a given nesting level. + + - Ordering of attributes is not significant. + + - When the kernel sends a given flow key to userspace, it always + composes it the same way. This allows userspace to hash and + compare entire flow keys that it may not be able to fully + interpret. diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt new file mode 100644 index 000000000..355c6d8ef --- /dev/null +++ b/Documentation/networking/operstates.txt @@ -0,0 +1,162 @@ + +1. Introduction + +Linux distinguishes between administrative and operational state of an +interface. Administrative state is the result of "ip link set dev +<dev> up or down" and reflects whether the administrator wants to use +the device for traffic. + +However, an interface is not usable just because the admin enabled it +- ethernet requires to be plugged into the switch and, depending on +a site's networking policy and configuration, an 802.1X authentication +to be performed before user data can be transferred. Operational state +shows the ability of an interface to transmit this user data. + +Thanks to 802.1X, userspace must be granted the possibility to +influence operational state. To accommodate this, operational state is +split into two parts: Two flags that can be set by the driver only, and +a RFC2863 compatible state that is derived from these flags, a policy, +and changeable from userspace under certain rules. + + +2. Querying from userspace + +Both admin and operational state can be queried via the netlink +operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK +to be notified of updates. This is important for setting from userspace. + +These values contain interface state: + +ifinfomsg::if_flags & IFF_UP: + Interface is admin up +ifinfomsg::if_flags & IFF_RUNNING: + Interface is in RFC2863 operational state UP or UNKNOWN. This is for + backward compatibility, routing daemons, dhcp clients can use this + flag to determine whether they should use the interface. +ifinfomsg::if_flags & IFF_LOWER_UP: + Driver has signaled netif_carrier_on() +ifinfomsg::if_flags & IFF_DORMANT: + Driver has signaled netif_dormant_on() + +TLV IFLA_OPERSTATE + +contains RFC2863 state of the interface in numeric representation: + +IF_OPER_UNKNOWN (0): + Interface is in unknown state, neither driver nor userspace has set + operational state. Interface must be considered for user data as + setting operational state has not been implemented in every driver. +IF_OPER_NOTPRESENT (1): + Unused in current kernel (notpresent interfaces normally disappear), + just a numerical placeholder. +IF_OPER_DOWN (2): + Interface is unable to transfer data on L1, f.e. ethernet is not + plugged or interface is ADMIN down. +IF_OPER_LOWERLAYERDOWN (3): + Interfaces stacked on an interface that is IF_OPER_DOWN show this + state (f.e. VLAN). +IF_OPER_TESTING (4): + Unused in current kernel. +IF_OPER_DORMANT (5): + Interface is L1 up, but waiting for an external event, f.e. for a + protocol to establish. (802.1X) +IF_OPER_UP (6): + Interface is operational up and can be used. + +This TLV can also be queried via sysfs. + +TLV IFLA_LINKMODE + +contains link policy. This is needed for userspace interaction +described below. + +This TLV can also be queried via sysfs. + + +3. Kernel driver API + +Kernel drivers have access to two flags that map to IFF_LOWER_UP and +IFF_DORMANT. These flags can be set from everywhere, even from +interrupts. It is guaranteed that only the driver has write access, +however, if different layers of the driver manipulate the same flag, +the driver has to provide the synchronisation needed. + +__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: + +The driver uses netif_carrier_on() to clear and netif_carrier_off() to +set this flag. On netif_carrier_off(), the scheduler stops sending +packets. The name 'carrier' and the inversion are historical, think of +it as lower layer. + +Note that for certain kind of soft-devices, which are not managing any +real hardware, it is possible to set this bit from userspace. One +should use TVL IFLA_CARRIER to do so. + +netif_carrier_ok() can be used to query that bit. + +__LINK_STATE_DORMANT, maps to IFF_DORMANT: + +Set by the driver to express that the device cannot yet be used +because some driver controlled protocol establishment has to +complete. Corresponding functions are netif_dormant_on() to set the +flag, netif_dormant_off() to clear it and netif_dormant() to query. + +On device allocation, networking core sets the flags equivalent to +netif_carrier_ok() and !netif_dormant(). + + +Whenever the driver CHANGES one of these flags, a workqueue event is +scheduled to translate the flag combination to IFLA_OPERSTATE as +follows: + +!netif_carrier_ok(): + IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN + otherwise. Kernel can recognise stacked interfaces because their + ifindex != iflink. + +netif_carrier_ok() && netif_dormant(): + IF_OPER_DORMANT + +netif_carrier_ok() && !netif_dormant(): + IF_OPER_UP if userspace interaction is disabled. Otherwise + IF_OPER_DORMANT with the possibility for userspace to initiate the + IF_OPER_UP transition afterwards. + + +4. Setting from userspace + +Applications have to use the netlink interface to influence the +RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 +via RTM_SETLINK instructs the kernel that an interface should go to +IF_OPER_DORMANT instead of IF_OPER_UP when the combination +netif_carrier_ok() && !netif_dormant() is set by the +driver. Afterwards, the userspace application can set IFLA_OPERSTATE +to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set +netif_carrier_off() or netif_dormant_on(). Changes made by userspace +are multicasted on the netlink group RTMGRP_LINK. + +So basically a 802.1X supplicant interacts with the kernel like this: + +-subscribe to RTMGRP_LINK +-set IFLA_LINKMODE to 1 via RTM_SETLINK +-query RTM_GETLINK once to get initial state +-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until + netlink multicast signals this state +-do 802.1X, eventually abort if flags go down again +-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication + succeeds, IF_OPER_DORMANT otherwise +-see how operstate and IFF_RUNNING is echoed via netlink multicast +-set interface back to IF_OPER_DORMANT if 802.1X reauthentication + fails +-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag + +if supplicant goes down, bring back IFLA_LINKMODE to 0 and +IFLA_OPERSTATE to a sane value. + +A routing daemon or dhcp client just needs to care for IFF_RUNNING or +waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before +considering the interface / querying a DHCP address. + + +For technical questions and/or comments please e-mail to Stefan Rompf +(stefan at loplof.de). diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt new file mode 100644 index 000000000..999eb41da --- /dev/null +++ b/Documentation/networking/packet_mmap.txt @@ -0,0 +1,1061 @@ +-------------------------------------------------------------------------------- ++ ABSTRACT +-------------------------------------------------------------------------------- + +This file documents the mmap() facility available with the PACKET +socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for +i) capture network traffic with utilities like tcpdump, ii) transmit network +traffic, or any other that needs raw access to network interface. + +Howto can be found at: + https://sites.google.com/site/packetmmap/ + +Please send your comments to + Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> + Johann Baudy + +------------------------------------------------------------------------------- ++ Why use PACKET_MMAP +-------------------------------------------------------------------------------- + +In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very +inefficient. It uses very limited buffers and requires one system call to +capture each packet, it requires two if you want to get packet's timestamp +(like libpcap always does). + +In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size +configurable circular buffer mapped in user space that can be used to either +send or receive packets. This way reading packets just needs to wait for them, +most of the time there is no need to issue a single system call. Concerning +transmission, multiple packets can be sent through one system call to get the +highest bandwidth. By using a shared buffer between the kernel and the user +also has the benefit of minimizing packet copies. + +It's fine to use PACKET_MMAP to improve the performance of the capture and +transmission process, but it isn't everything. At least, if you are capturing +at high speeds (this is relative to the cpu speed), you should check if the +device driver of your network interface card supports some sort of interrupt +load mitigation or (even better) if it supports NAPI, also make sure it is +enabled. For transmission, check the MTU (Maximum Transmission Unit) used and +supported by devices of your network. CPU IRQ pinning of your network interface +card can also be an advantage. + +-------------------------------------------------------------------------------- ++ How to use mmap() to improve capture process +-------------------------------------------------------------------------------- + +From the user standpoint, you should use the higher level libpcap library, which +is a de facto standard, portable across nearly all operating systems +including Win32. + +Packet MMAP support was integrated into libpcap around the time of version 1.3.0; +TPACKET_V3 support was added in version 1.5.0 + +-------------------------------------------------------------------------------- ++ How to use mmap() directly to improve capture process +-------------------------------------------------------------------------------- + +From the system calls stand point, the use of PACKET_MMAP involves +the following process: + + +[setup] socket() -------> creation of the capture socket + setsockopt() ---> allocation of the circular buffer (ring) + option: PACKET_RX_RING + mmap() ---------> mapping of the allocated buffer to the + user process + +[capture] poll() ---------> to wait for incoming packets + +[shutdown] close() --------> destruction of the capture socket and + deallocation of all associated + resources. + + +socket creation and destruction is straight forward, and is done +the same way with or without PACKET_MMAP: + + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); + +where mode is SOCK_RAW for the raw interface were link level +information can be captured or SOCK_DGRAM for the cooked +interface where link level information capture is not +supported and a link level pseudo-header is provided +by the kernel. + +The destruction of the socket and all associated resources +is done by a simple call to close(fd). + +Similarly as without PACKET_MMAP, it is possible to use one socket +for capture and transmission. This can be done by mapping the +allocated RX and TX buffer ring with a single mmap() call. +See "Mapping and use of the circular buffer (ring)". + +Next I will describe PACKET_MMAP settings and its constraints, +also the mapping of the circular buffer in the user process and +the use of this buffer. + +-------------------------------------------------------------------------------- ++ How to use mmap() directly to improve transmission process +-------------------------------------------------------------------------------- +Transmission process is similar to capture as shown below. + +[setup] socket() -------> creation of the transmission socket + setsockopt() ---> allocation of the circular buffer (ring) + option: PACKET_TX_RING + bind() ---------> bind transmission socket with a network interface + mmap() ---------> mapping of the allocated buffer to the + user process + +[transmission] poll() ---------> wait for free packets (optional) + send() ---------> send all packets that are set as ready in + the ring + The flag MSG_DONTWAIT can be used to return + before end of transfer. + +[shutdown] close() --------> destruction of the transmission socket and + deallocation of all associated resources. + +Socket creation and destruction is also straight forward, and is done +the same way as in capturing described in the previous paragraph: + + int fd = socket(PF_PACKET, mode, 0); + +The protocol can optionally be 0 in case we only want to transmit +via this socket, which avoids an expensive call to packet_rcv(). +In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 +set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. + +Binding the socket to your network interface is mandatory (with zero copy) to +know the header size of frames used in the circular buffer. + +As capture, each frame contains two parts: + + -------------------- +| struct tpacket_hdr | Header. It contains the status of +| | of this frame +|--------------------| +| data buffer | +. . Data that will be sent over the network interface. +. . + -------------------- + + bind() associates the socket to your network interface thanks to + sll_ifindex parameter of struct sockaddr_ll. + + Initialization example: + + struct sockaddr_ll my_addr; + struct ifreq s_ifr; + ... + + strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); + + /* get interface index of eth0 */ + ioctl(this->socket, SIOCGIFINDEX, &s_ifr); + + /* fill sockaddr_ll struct to prepare binding */ + my_addr.sll_family = AF_PACKET; + my_addr.sll_protocol = htons(ETH_P_ALL); + my_addr.sll_ifindex = s_ifr.ifr_ifindex; + + /* bind socket to eth0 */ + bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); + + A complete tutorial is available at: https://sites.google.com/site/packetmmap/ + +By default, the user should put data at : + frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) + +So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), +the beginning of the user data will be at : + frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +If you wish to put user data at a custom offset from the beginning of +the frame (for payload alignment with SOCK_RAW mode for instance) you +can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order +to make this work it must be enabled previously with setsockopt() +and the PACKET_TX_HAS_OFF option. + +-------------------------------------------------------------------------------- ++ PACKET_MMAP settings +-------------------------------------------------------------------------------- + +To setup PACKET_MMAP from user level code is done with a call like + + - Capture process + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) + - Transmission process + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) + +The most significant argument in the previous call is the req parameter, +this parameter must to have the following structure: + + struct tpacket_req + { + unsigned int tp_block_size; /* Minimal size of contiguous block */ + unsigned int tp_block_nr; /* Number of blocks */ + unsigned int tp_frame_size; /* Size of frame */ + unsigned int tp_frame_nr; /* Total number of frames */ + }; + +This structure is defined in /usr/include/linux/if_packet.h and establishes a +circular buffer (ring) of unswappable memory. +Being mapped in the capture process allows reading the captured frames and +related meta-information like timestamps without requiring a system call. + +Frames are grouped in blocks. Each block is a physically contiguous +region of memory and holds tp_block_size/tp_frame_size frames. The total number +of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because + + frames_per_block = tp_block_size/tp_frame_size + +indeed, packet_set_ring checks that the following condition is true + + frames_per_block * tp_block_nr == tp_frame_nr + +Lets see an example, with the following values: + + tp_block_size= 4096 + tp_frame_size= 2048 + tp_block_nr = 4 + tp_frame_nr = 8 + +we will get the following buffer structure: + + block #1 block #2 ++---------+---------+ +---------+---------+ +| frame 1 | frame 2 | | frame 3 | frame 4 | ++---------+---------+ +---------+---------+ + + block #3 block #4 ++---------+---------+ +---------+---------+ +| frame 5 | frame 6 | | frame 7 | frame 8 | ++---------+---------+ +---------+---------+ + +A frame can be of any size with the only condition it can fit in a block. A block +can only hold an integer number of frames, or in other words, a frame cannot +be spawned across two blocks, so there are some details you have to take into +account when choosing the frame_size. See "Mapping and use of the circular +buffer (ring)". + +-------------------------------------------------------------------------------- ++ PACKET_MMAP setting constraints +-------------------------------------------------------------------------------- + +In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), +the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or +16384 in a 64 bit architecture. For information on these kernel versions +see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt + + Block size limit +------------------ + +As stated earlier, each block is a contiguous physical region of memory. These +memory regions are allocated with calls to the __get_free_pages() function. As +the name indicates, this function allocates pages of memory, and the second +argument is "order" or a power of two number of pages, that is +(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, +order=2 ==> 16384 bytes, etc. The maximum size of a +region allocated by __get_free_pages is determined by the MAX_ORDER macro. More +precisely the limit can be calculated as: + + PAGE_SIZE << MAX_ORDER + + In a i386 architecture PAGE_SIZE is 4096 bytes + In a 2.4/i386 kernel MAX_ORDER is 10 + In a 2.6/i386 kernel MAX_ORDER is 11 + +So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel +respectively, with an i386 architecture. + +User space programs can include /usr/include/sys/user.h and +/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. + +The pagesize can also be determined dynamically with the getpagesize (2) +system call. + + Block number limit +-------------------- + +To understand the constraints of PACKET_MMAP, we have to see the structure +used to hold the pointers to each block. + +Currently, this structure is a dynamically allocated vector with kmalloc +called pg_vec, its size limits the number of blocks that can be allocated. + + +---+---+---+---+ + | x | x | x | x | + +---+---+---+---+ + | | | | + | | | v + | | v block #4 + | v block #3 + v block #2 + block #1 + +kmalloc allocates any number of bytes of physically contiguous memory from +a pool of pre-determined sizes. This pool of memory is maintained by the slab +allocator which is at the end the responsible for doing the allocation and +hence which imposes the maximum memory that kmalloc can allocate. + +In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The +predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" +entries of /proc/slabinfo + +In a 32 bit architecture, pointers are 4 bytes long, so the total number of +pointers to blocks is + + 131072/4 = 32768 blocks + + PACKET_MMAP buffer size calculator +------------------------------------ + +Definitions: + +<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) +<pointer size>: depends on the architecture -- sizeof(void *) +<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) +<max-order> : is the value defined with MAX_ORDER +<frame size> : it's an upper bound of frame's capture size (more on this later) + +from these definitions we will derive + + <block number> = <size-max>/<pointer size> + <block size> = <pagesize> << <max-order> + +so, the max buffer size is + + <block number> * <block size> + +and, the number of frames be + + <block number> * <block size> / <frame size> + +Suppose the following parameters, which apply for 2.6 kernel and an +i386 architecture: + + <size-max> = 131072 bytes + <pointer size> = 4 bytes + <pagesize> = 4096 bytes + <max-order> = 11 + +and a value for <frame size> of 2048 bytes. These parameters will yield + + <block number> = 131072/4 = 32768 blocks + <block size> = 4096 << 11 = 8 MiB. + +and hence the buffer will have a 262144 MiB size. So it can hold +262144 MiB / 2048 bytes = 134217728 frames + +Actually, this buffer size is not possible with an i386 architecture. +Remember that the memory is allocated in kernel space, in the case of +an i386 kernel's memory size is limited to 1GiB. + +All memory allocations are not freed until the socket is closed. The memory +allocations are done with GFP_KERNEL priority, this basically means that +the allocation can wait and swap other process' memory in order to allocate +the necessary memory, so normally limits can be reached. + + Other constraints +------------------- + +If you check the source code you will see that what I draw here as a frame +is not only the link level frame. At the beginning of each frame there is a +header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame +meta information like timestamp. So what we draw here a frame it's really +the following (from include/linux/if_packet.h): + +/* + Frame structure: + + - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 + - struct tpacket_hdr + - pad to TPACKET_ALIGNMENT=16 + - struct sockaddr_ll + - Gap, chosen so that packet data (Start+tp_net) aligns to + TPACKET_ALIGNMENT=16 + - Start+tp_mac: [ Optional MAC header ] + - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. + - Pad to align to TPACKET_ALIGNMENT=16 + */ + + The following are conditions that are checked in packet_set_ring + + tp_block_size must be a multiple of PAGE_SIZE (1) + tp_frame_size must be greater than TPACKET_HDRLEN (obvious) + tp_frame_size must be a multiple of TPACKET_ALIGNMENT + tp_frame_nr must be exactly frames_per_block*tp_block_nr + +Note that tp_block_size should be chosen to be a power of two or there will +be a waste of memory. + +-------------------------------------------------------------------------------- ++ Mapping and use of the circular buffer (ring) +-------------------------------------------------------------------------------- + +The mapping of the buffer in the user process is done with the conventional +mmap function. Even the circular buffer is compound of several physically +discontiguous blocks of memory, they are contiguous to the user space, hence +just one call to mmap is needed: + + mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + +If tp_frame_size is a divisor of tp_block_size frames will be +contiguously spaced by tp_frame_size bytes. If not, each +tp_block_size/tp_frame_size frames there will be a gap between +the frames. This is because a frame cannot be spawn across two +blocks. + +To use one socket for capture and transmission, the mapping of both the +RX and TX buffer ring has to be done with one call to mmap: + + ... + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); + ... + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + tx_ring = rx_ring + size; + +RX must be the first as the kernel maps the TX ring memory right +after the RX one. + +At the beginning of each frame there is an status field (see +struct tpacket_hdr). If this field is 0 means that the frame is ready +to be used for the kernel, If not, there is a frame the user can read +and the following flags apply: + ++++ Capture process: + from include/linux/if_packet.h + + #define TP_STATUS_COPY (1 << 1) + #define TP_STATUS_LOSING (1 << 2) + #define TP_STATUS_CSUMNOTREADY (1 << 3) + #define TP_STATUS_CSUM_VALID (1 << 7) + +TP_STATUS_COPY : This flag indicates that the frame (and associated + meta information) has been truncated because it's + larger than tp_frame_size. This packet can be + read entirely with recvfrom(). + + In order to make this work it must to be + enabled previously with setsockopt() and + the PACKET_COPY_THRESH option. + + The number of frames that can be buffered to + be read with recvfrom is limited like a normal socket. + See the SO_RCVBUF option in the socket (7) man page. + +TP_STATUS_LOSING : indicates there were packet drops from last time + statistics where checked with getsockopt() and + the PACKET_STATISTICS option. + +TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which + its checksum will be done in hardware. So while + reading the packet we should not try to check the + checksum. + +TP_STATUS_CSUM_VALID : This flag indicates that at least the transport + header checksum of the packet has been already + validated on the kernel side. If the flag is not set + then we are free to check the checksum by ourselves + provided that TP_STATUS_CSUMNOTREADY is also not set. + +for convenience there are also the following defines: + + #define TP_STATUS_KERNEL 0 + #define TP_STATUS_USER 1 + +The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel +receives a packet it puts in the buffer and updates the status with +at least the TP_STATUS_USER flag. Then the user can read the packet, +once the packet is read the user must zero the status field, so the kernel +can use again that frame buffer. + +The user can use poll (any other variant should apply too) to check if new +packets are in the ring: + + struct pollfd pfd; + + pfd.fd = fd; + pfd.revents = 0; + pfd.events = POLLIN|POLLRDNORM|POLLERR; + + if (status == TP_STATUS_KERNEL) + retval = poll(&pfd, 1, timeout); + +It doesn't incur in a race condition to first check the status value and +then poll for frames. + +++ Transmission process +Those defines are also used for transmission: + + #define TP_STATUS_AVAILABLE 0 // Frame is available + #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() + #define TP_STATUS_SENDING 2 // Frame is currently in transmission + #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct + +First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a +packet, the user fills a data buffer of an available frame, sets tp_len to +current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. +This can be done on multiple frames. Once the user is ready to transmit, it +calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are +forwarded to the network device. The kernel updates each status of sent +frames with TP_STATUS_SENDING until the end of transfer. +At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. + + header->tp_len = in_i_size; + header->tp_status = TP_STATUS_SEND_REQUEST; + retval = send(this->socket, NULL, 0, 0); + +The user can also use poll() to check if a buffer is available: +(status == TP_STATUS_SENDING) + + struct pollfd pfd; + pfd.fd = fd; + pfd.revents = 0; + pfd.events = POLLOUT; + retval = poll(&pfd, 1, timeout); + +------------------------------------------------------------------------------- ++ What TPACKET versions are available and when to use them? +------------------------------------------------------------------------------- + + int val = tpacket_version; + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + +where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. + +TPACKET_V1: + - Default if not otherwise specified by setsockopt(2) + - RX_RING, TX_RING available + +TPACKET_V1 --> TPACKET_V2: + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 + structures, thus this also works on 64 bit kernel with 32 bit + userspace and the like + - Timestamp resolution in nanoseconds instead of microseconds + - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), + in the tpacket2_hdr structure: + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates + that the tp_vlan_tci field has valid VLAN TCI value + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field + indicates that the tp_vlan_tpid field has valid VLAN TPID value + - How to switch to TPACKET_V2: + 1. Replace struct tpacket_hdr by struct tpacket2_hdr + 2. Query header len and save + 3. Set protocol version to 2, set up ring as usual + 4. For getting the sockaddr_ll, + use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of + (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +TPACKET_V2 --> TPACKET_V3: + - Flexible buffer implementation for RX_RING: + 1. Blocks can be configured with non-static frame-size + 2. Read/poll is at a block-level (as opposed to packet-level) + 3. Added poll timeout to avoid indefinite user-space wait + on idle links + 4. Added user-configurable knobs: + 4.1 block::timeout + 4.2 tpkt_hdr::sk_rxhash + - RX Hash data available in user space + - TX_RING semantics are conceptually similar to TPACKET_V2; + use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN + instead of TPACKET2_HDRLEN. In the current implementation, + the tp_next_offset field in the tpacket3_hdr MUST be set to + zero, indicating that the ring does not hold variable sized frames. + Packets with non-zero values of tp_next_offset will be dropped. + +------------------------------------------------------------------------------- ++ AF_PACKET fanout mode +------------------------------------------------------------------------------- + +In the AF_PACKET fanout mode, packet reception can be load balanced among +processes. This also works in combination with mmap(2) on packet sockets. + +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping + +Minimal example code by David S. Miller (try things like "./test eth0 hash", +"./test eth0 lb", etc.): + +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +#include <sys/types.h> +#include <sys/wait.h> +#include <sys/socket.h> +#include <sys/ioctl.h> + +#include <unistd.h> + +#include <linux/if_ether.h> +#include <linux/if_packet.h> + +#include <net/if.h> + +static const char *device_name; +static int fanout_type; +static int fanout_id; + +#ifndef PACKET_FANOUT +# define PACKET_FANOUT 18 +# define PACKET_FANOUT_HASH 0 +# define PACKET_FANOUT_LB 1 +#endif + +static int setup_socket(void) +{ + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); + struct sockaddr_ll ll; + struct ifreq ifr; + int fanout_arg; + + if (fd < 0) { + perror("socket"); + return EXIT_FAILURE; + } + + memset(&ifr, 0, sizeof(ifr)); + strcpy(ifr.ifr_name, device_name); + err = ioctl(fd, SIOCGIFINDEX, &ifr); + if (err < 0) { + perror("SIOCGIFINDEX"); + return EXIT_FAILURE; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = AF_PACKET; + ll.sll_ifindex = ifr.ifr_ifindex; + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + return EXIT_FAILURE; + } + + fanout_arg = (fanout_id | (fanout_type << 16)); + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (err) { + perror("setsockopt"); + return EXIT_FAILURE; + } + + return fd; +} + +static void fanout_thread(void) +{ + int fd = setup_socket(); + int limit = 10000; + + if (fd < 0) + exit(fd); + + while (limit-- > 0) { + char buf[1600]; + int err; + + err = read(fd, buf, sizeof(buf)); + if (err < 0) { + perror("read"); + exit(EXIT_FAILURE); + } + if ((limit % 10) == 0) + fprintf(stdout, "(%d) \n", getpid()); + } + + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); + + close(fd); + exit(0); +} + +int main(int argc, char **argp) +{ + int fd, err; + int i; + + if (argc != 3) { + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); + return EXIT_FAILURE; + } + + if (!strcmp(argp[2], "hash")) + fanout_type = PACKET_FANOUT_HASH; + else if (!strcmp(argp[2], "lb")) + fanout_type = PACKET_FANOUT_LB; + else { + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); + exit(EXIT_FAILURE); + } + + device_name = argp[1]; + fanout_id = getpid() & 0xffff; + + for (i = 0; i < 4; i++) { + pid_t pid = fork(); + + switch (pid) { + case 0: + fanout_thread(); + + case -1: + perror("fork"); + exit(EXIT_FAILURE); + } + } + + for (i = 0; i < 4; i++) { + int status; + + wait(&status); + } + + return 0; +} + +------------------------------------------------------------------------------- ++ AF_PACKET TPACKET_V3 example +------------------------------------------------------------------------------- + +AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame +sizes by doing it's own memory management. It is based on blocks where polling +works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. + +It is said that TPACKET_V3 brings the following benefits: + *) ~15 - 20% reduction in CPU-usage + *) ~20% increase in packet capture rate + *) ~2x increase in packet density + *) Port aggregation analysis + *) Non static frame size to capture entire packet payload + +So it seems to be a good candidate to be used with packet fanout. + +Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile +it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): + +/* Written from scratch, but kernel-to-user space API usage + * dissected from lolpcap: + * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> + * License: GPL, version 2.0 + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <string.h> +#include <assert.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <netdb.h> +#include <poll.h> +#include <unistd.h> +#include <signal.h> +#include <inttypes.h> +#include <sys/socket.h> +#include <sys/mman.h> +#include <linux/if_packet.h> +#include <linux/if_ether.h> +#include <linux/ip.h> + +#ifndef likely +# define likely(x) __builtin_expect(!!(x), 1) +#endif +#ifndef unlikely +# define unlikely(x) __builtin_expect(!!(x), 0) +#endif + +struct block_desc { + uint32_t version; + uint32_t offset_to_priv; + struct tpacket_hdr_v1 h1; +}; + +struct ring { + struct iovec *rd; + uint8_t *map; + struct tpacket_req3 req; +}; + +static unsigned long packets_total = 0, bytes_total = 0; +static sig_atomic_t sigint = 0; + +static void sighandler(int num) +{ + sigint = 1; +} + +static int setup_socket(struct ring *ring, char *netdev) +{ + int err, i, fd, v = TPACKET_V3; + struct sockaddr_ll ll; + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; + unsigned int blocknum = 64; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(1); + } + + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + memset(&ring->req, 0, sizeof(ring->req)); + ring->req.tp_block_size = blocksiz; + ring->req.tp_frame_size = framesiz; + ring->req.tp_block_nr = blocknum; + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; + ring->req.tp_retire_blk_tov = 60; + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; + + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, + sizeof(ring->req)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); + if (ring->map == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); + assert(ring->rd); + for (i = 0; i < ring->req.tp_block_nr; ++i) { + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); + ring->rd[i].iov_len = ring->req.tp_block_size; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = PF_PACKET; + ll.sll_protocol = htons(ETH_P_ALL); + ll.sll_ifindex = if_nametoindex(netdev); + ll.sll_hatype = 0; + ll.sll_pkttype = 0; + ll.sll_halen = 0; + + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + exit(1); + } + + return fd; +} + +static void display(struct tpacket3_hdr *ppd) +{ + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); + + if (eth->h_proto == htons(ETH_P_IP)) { + struct sockaddr_in ss, sd; + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; + + memset(&ss, 0, sizeof(ss)); + ss.sin_family = PF_INET; + ss.sin_addr.s_addr = ip->saddr; + getnameinfo((struct sockaddr *) &ss, sizeof(ss), + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); + + memset(&sd, 0, sizeof(sd)); + sd.sin_family = PF_INET; + sd.sin_addr.s_addr = ip->daddr; + getnameinfo((struct sockaddr *) &sd, sizeof(sd), + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); + + printf("%s -> %s, ", sbuff, dbuff); + } + + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); +} + +static void walk_block(struct block_desc *pbd, const int block_num) +{ + int num_pkts = pbd->h1.num_pkts, i; + unsigned long bytes = 0; + struct tpacket3_hdr *ppd; + + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + + pbd->h1.offset_to_first_pkt); + for (i = 0; i < num_pkts; ++i) { + bytes += ppd->tp_snaplen; + display(ppd); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + + ppd->tp_next_offset); + } + + packets_total += num_pkts; + bytes_total += bytes; +} + +static void flush_block(struct block_desc *pbd) +{ + pbd->h1.block_status = TP_STATUS_KERNEL; +} + +static void teardown_socket(struct ring *ring, int fd) +{ + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); + free(ring->rd); + close(fd); +} + +int main(int argc, char **argp) +{ + int fd, err; + socklen_t len; + struct ring ring; + struct pollfd pfd; + unsigned int block_num = 0, blocks = 64; + struct block_desc *pbd; + struct tpacket_stats_v3 stats; + + if (argc != 2) { + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); + return EXIT_FAILURE; + } + + signal(SIGINT, sighandler); + + memset(&ring, 0, sizeof(ring)); + fd = setup_socket(&ring, argp[argc - 1]); + assert(fd > 0); + + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = fd; + pfd.events = POLLIN | POLLERR; + pfd.revents = 0; + + while (likely(!sigint)) { + pbd = (struct block_desc *) ring.rd[block_num].iov_base; + + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { + poll(&pfd, 1, -1); + continue; + } + + walk_block(pbd, block_num); + flush_block(pbd); + block_num = (block_num + 1) % blocks; + } + + len = sizeof(stats); + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); + if (err < 0) { + perror("getsockopt"); + exit(1); + } + + fflush(stdout); + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", + stats.tp_packets, bytes_total, stats.tp_drops, + stats.tp_freeze_q_cnt); + + teardown_socket(&ring, fd); + return 0; +} + +------------------------------------------------------------------------------- ++ PACKET_QDISC_BYPASS +------------------------------------------------------------------------------- + +If there is a requirement to load the network with many packets in a similar +fashion as pktgen does, you might set the following option after socket +creation: + + int one = 1; + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); + +This has the side-effect, that packets sent through PF_PACKET will bypass the +kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, +packet are not buffered, tc disciplines are ignored, increased loss can occur +and such packets are also not visible to other PF_PACKET sockets anymore. So, +you have been warned; generally, this can be useful for stress testing various +components of a system. + +On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled +on PF_PACKET sockets. + +------------------------------------------------------------------------------- ++ PACKET_TIMESTAMP +------------------------------------------------------------------------------- + +The PACKET_TIMESTAMP setting determines the source of the timestamp in +the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your +NIC is capable of timestamping packets in hardware, you can request those +hardware timestamps to be used. Note: you may need to enable the generation +of hardware timestamps with SIOCSHWTSTAMP (see related information from +Documentation/networking/timestamping.txt). + +PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING: + + int req = SOF_TIMESTAMPING_RAW_HARDWARE; + setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) + +For the mmap(2)ed ring buffers, such timestamps are stored in the +tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine +what kind of timestamp has been reported, the tp_status field is binary |'ed +with the following possible bits ... + + TP_STATUS_TS_RAW_HARDWARE + TP_STATUS_TS_SOFTWARE + +... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the +RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a +software fallback was invoked *within* PF_PACKET's processing code (less +precise). + +Getting timestamps for the TX_RING works as follows: i) fill the ring frames, +ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant +frames to be updated resp. the frame handed over to the application, iv) walk +through the frames to pick up the individual hw/sw timestamps. + +Only (!) if transmit timestamping is enabled, then these bits are combined +with binary | with TP_STATUS_AVAILABLE, so you must check for that in your +application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) +in a first step to see if the frame belongs to the application, and then +one can extract the type of timestamp in a second step from tp_status)! + +If you don't care about them, thus having it disabled, checking for +TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the +TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec +members do not contain a valid value. For TX_RINGs, by default no timestamp +is generated! + +See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt +for more information on hardware timestamps. + +------------------------------------------------------------------------------- ++ Miscellaneous bits +------------------------------------------------------------------------------- + +- Packet sockets work well together with Linux socket filters, thus you also + might want to have a look at Documentation/networking/filter.txt + +-------------------------------------------------------------------------------- ++ THANKS +-------------------------------------------------------------------------------- + + Jesse Brandeburg, for fixing my grammathical/spelling errors + diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt new file mode 100644 index 000000000..81003581f --- /dev/null +++ b/Documentation/networking/phonet.txt @@ -0,0 +1,214 @@ +Linux Phonet protocol family +============================ + +Introduction +------------ + +Phonet is a packet protocol used by Nokia cellular modems for both IPC +and RPC. With the Linux Phonet socket family, Linux host processes can +receive and send messages from/to the modem, or any other external +device attached to the modem. The modem takes care of routing. + +Phonet packets can be exchanged through various hardware connections +depending on the device, such as: + - USB with the CDC Phonet interface, + - infrared, + - Bluetooth, + - an RS232 serial port (with a dedicated "FBUS" line discipline), + - the SSI bus with some TI OMAP processors. + + +Packets format +-------------- + +Phonet packets have a common header as follows: + + struct phonethdr { + uint8_t pn_media; /* Media type (link-layer identifier) */ + uint8_t pn_rdev; /* Receiver device ID */ + uint8_t pn_sdev; /* Sender device ID */ + uint8_t pn_res; /* Resource ID or function */ + uint16_t pn_length; /* Big-endian message byte length (minus 6) */ + uint8_t pn_robj; /* Receiver object ID */ + uint8_t pn_sobj; /* Sender object ID */ + }; + +On Linux, the link-layer header includes the pn_media byte (see below). +The next 7 bytes are part of the network-layer header. + +The device ID is split: the 6 higher-order bits constitute the device +address, while the 2 lower-order bits are used for multiplexing, as are +the 8-bit object identifiers. As such, Phonet can be considered as a +network layer with 6 bits of address space and 10 bits for transport +protocol (much like port numbers in IP world). + +The modem always has address number zero. All other device have a their +own 6-bit address. + + +Link layer +---------- + +Phonet links are always point-to-point links. The link layer header +consists of a single Phonet media type byte. It uniquely identifies the +link through which the packet is transmitted, from the modem's +perspective. Each Phonet network device shall prepend and set the media +type byte as appropriate. For convenience, a common phonet_header_ops +link-layer header operations structure is provided. It sets the +media type according to the network device hardware address. + +Linux Phonet network interfaces support a dedicated link layer packets +type (ETH_P_PHONET) which is out of the Ethernet type range. They can +only send and receive Phonet packets. + +The virtual TUN tunnel device driver can also be used for Phonet. This +requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, +there is no link-layer header, so there is no Phonet media type byte. + +Note that Phonet interfaces are not allowed to re-order packets, so +only the (default) Linux FIFO qdisc should be used with them. + + +Network layer +------------- + +The Phonet socket address family maps the Phonet packet header: + + struct sockaddr_pn { + sa_family_t spn_family; /* AF_PHONET */ + uint8_t spn_obj; /* Object ID */ + uint8_t spn_dev; /* Device ID */ + uint8_t spn_resource; /* Resource or function */ + uint8_t spn_zero[...]; /* Padding */ + }; + +The resource field is only used when sending and receiving; +It is ignored by bind() and getsockname(). + + +Low-level datagram protocol +--------------------------- + +Applications can send Phonet messages using the Phonet datagram socket +protocol from the PF_PHONET family. Each socket is bound to one of the +2^10 object IDs available, and can send and receive packets with any +other peer. + + struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; + ssize_t len; + socklen_t addrlen = sizeof(addr); + int fd; + + fd = socket(PF_PHONET, SOCK_DGRAM, 0); + bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + /* ... */ + + sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); + len = recvfrom(fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addrlen); + +This protocol follows the SOCK_DGRAM connection-less semantics. +However, connect() and getpeername() are not supported, as they did +not seem useful with Phonet usages (could be added easily). + + +Resource subscription +--------------------- + +A Phonet datagram socket can be subscribed to any number of 8-bits +Phonet resources, as follow: + + uint32_t res = 0xXX; + ioctl(fd, SIOCPNADDRESOURCE, &res); + +Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O +control request, or when the socket is closed. + +Note that no more than one socket can be subcribed to any given +resource at a time. If not, ioctl() will return EBUSY. + + +Phonet Pipe protocol +-------------------- + +The Phonet Pipe protocol is a simple sequenced packets protocol +with end-to-end congestion control. It uses the passive listening +socket paradigm. The listening socket is bound to an unique free object +ID. Each listening socket can handle up to 255 simultaneous +connections, one per accept()'d socket. + + int lfd, cfd; + + lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + listen (lfd, INT_MAX); + + /* ... */ + cfd = accept(lfd, NULL, NULL); + for (;;) + { + char buf[...]; + ssize_t len = read(cfd, buf, sizeof(buf)); + + /* ... */ + + write(cfd, msg, msglen); + } + +Connections are traditionally established between two endpoints by a +"third party" application. This means that both endpoints are passive. + + +As of Linux kernel version 2.6.39, it is also possible to connect +two endpoints directly, using connect() on the active side. This is +intended to support the newer Nokia Wireless Modem API, as found in +e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform: + + struct sockaddr_spn spn; + int fd; + + fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + memset(&spn, 0, sizeof(spn)); + spn.spn_family = AF_PHONET; + spn.spn_obj = ...; + spn.spn_dev = ...; + spn.spn_resource = 0xD9; + connect(fd, (struct sockaddr *)&spn, sizeof(spn)); + /* normal I/O here ... */ + close(fd); + + +WARNING: +When polling a connected pipe socket for writability, there is an +intrinsic race condition whereby writability might be lost between the +polling and the writing system calls. In this case, the socket will +block until write becomes possible again, unless non-blocking mode +is enabled. + + +The pipe protocol provides two socket options at the SOL_PNPIPE level: + + PNPIPE_ENCAP accepts one integer value (int) of: + + PNPIPE_ENCAP_NONE: The socket operates normally (default). + + PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP + interface. This requires CAP_NET_ADMIN capability. GPRS data + support on Nokia modems can use this. Note that the socket cannot + be reliably poll()'d or read() from while in this mode. + + PNPIPE_IFINDEX is a read-only integer value. It contains the + interface index of the network interface created by PNPIPE_ENCAP, + or zero if encapsulation is off. + + PNPIPE_HANDLE is a read-only integer value. It contains the underlying + identifier ("pipe handle") of the pipe. This is only defined for + socket descriptors that are already connected or being connected. + + +Authors +------- + +Linux Phonet was initially written by Sakari Ailus. +Other contributors include Mikä Liljeberg, Andras Domokos, +Carlos Chinea and Rémi Denis-Courmont. +Copyright (C) 2008 Nokia Corporation. diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt new file mode 100644 index 000000000..bdec0f700 --- /dev/null +++ b/Documentation/networking/phy.txt @@ -0,0 +1,427 @@ + +------- +PHY Abstraction Layer +(Updated 2008-04-08) + +Purpose + + Most network devices consist of set of registers which provide an interface + to a MAC layer, which communicates with the physical connection through a + PHY. The PHY concerns itself with negotiating link parameters with the link + partner on the other side of the network connection (typically, an ethernet + cable), and provides a register interface to allow drivers to determine what + settings were chosen, and to configure what settings are allowed. + + While these devices are distinct from the network devices, and conform to a + standard layout for the registers, it has been common practice to integrate + the PHY management code with the network driver. This has resulted in large + amounts of redundant code. Also, on embedded systems with multiple (and + sometimes quite different) ethernet controllers connected to the same + management bus, it is difficult to ensure safe use of the bus. + + Since the PHYs are devices, and the management busses through which they are + accessed are, in fact, busses, the PHY Abstraction Layer treats them as such. + In doing so, it has these goals: + + 1) Increase code-reuse + 2) Increase overall code-maintainability + 3) Speed development time for new network drivers, and for new systems + + Basically, this layer is meant to provide an interface to PHY devices which + allows network driver writers to write as little code as possible, while + still providing a full feature set. + +The MDIO bus + + Most network devices are connected to a PHY by means of a management bus. + Different devices use different busses (though some share common interfaces). + In order to take advantage of the PAL, each bus interface needs to be + registered as a distinct device. + + 1) read and write functions must be implemented. Their prototypes are: + + int write(struct mii_bus *bus, int mii_id, int regnum, u16 value); + int read(struct mii_bus *bus, int mii_id, int regnum); + + mii_id is the address on the bus for the PHY, and regnum is the register + number. These functions are guaranteed not to be called from interrupt + time, so it is safe for them to block, waiting for an interrupt to signal + the operation is complete + + 2) A reset function is optional. This is used to return the bus to an + initialized state. + + 3) A probe function is needed. This function should set up anything the bus + driver needs, setup the mii_bus structure, and register with the PAL using + mdiobus_register. Similarly, there's a remove function to undo all of + that (use mdiobus_unregister). + + 4) Like any driver, the device_driver structure must be configured, and init + exit functions are used to register the driver. + + 5) The bus must also be declared somewhere as a device, and registered. + + As an example for how one driver implemented an mdio bus driver, see + drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file + for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/") + +(RG)MII/electrical interface considerations + + The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin + electrical signal interface using a synchronous 125Mhz clock signal and several + data lines. Due to this design decision, a 1.5ns to 2ns delay must be added + between the clock line (RXC or TXC) and the data lines to let the PHY (clock + sink) have enough setup and hold times to sample the data lines correctly. The + PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let + the PHY driver and optionally the MAC driver, implement the required delay. The + values of phy_interface_t must be understood from the perspective of the PHY + device itself, leading to the following: + + * PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any + internal delay by itself, it assumes that either the Ethernet MAC (if capable + or the PCB traces) insert the correct 1.5-2ns delay + + * PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay + for the transmit data lines (TXD[3:0]) processed by the PHY device + + * PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay + for the receive data lines (RXD[3:0]) processed by the PHY device + + * PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for + both transmit AND receive data lines from/to the PHY device + + Whenever possible, use the PHY side RGMII delay for these reasons: + + * PHY devices may offer sub-nanosecond granularity in how they allow a + receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such + precision may be required to account for differences in PCB trace lengths + + * PHY devices are typically qualified for a large range of applications + (industrial, medical, automotive...), and they provide a constant and + reliable delay across temperature/pressure/voltage ranges + + * PHY device drivers in PHYLIB being reusable by nature, being able to + configure correctly a specified delay enables more designs with similar delay + requirements to be operate correctly + + For cases where the PHY is not capable of providing this delay, but the + Ethernet MAC driver is capable of doing so, the correct phy_interface_t value + should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be + configured correctly in order to provide the required transmit and/or receive + side delay from the perspective of the PHY device. Conversely, if the Ethernet + MAC driver looks at the phy_interface_t value, for any other mode but + PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are + disabled. + + In case neither the Ethernet MAC, nor the PHY are capable of providing the + required delays, as defined per the RGMII standard, several options may be + available: + + * Some SoCs may offer a pin pad/mux/controller capable of configuring a given + set of pins'strength, delays, and voltage; and it may be a suitable + option to insert the expected 2ns RGMII delay. + + * Modifying the PCB design to include a fixed delay (e.g: using a specifically + designed serpentine), which may not require software configuration at all. + +Common problems with RGMII delay mismatch + + When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this + will most likely result in the clock and data line signals to be unstable when + the PHY or MAC take a snapshot of these signals to translate them into logical + 1 or 0 states and reconstruct the data being transmitted/received. Typical + symptoms include: + + * Transmission/reception partially works, and there is frequent or occasional + packet loss observed + + * Ethernet MAC may report some or all packets ingressing with a FCS/CRC error, + or just discard them all + + * Switching to lower speeds such as 10/100Mbits/sec makes the problem go away + (since there is enough setup/hold time in that case) + + +Connecting to a PHY + + Sometime during startup, the network driver needs to establish a connection + between the PHY device, and the network device. At this time, the PHY's bus + and drivers need to all have been loaded, so it is ready for the connection. + At this point, there are several ways to connect to the PHY: + + 1) The PAL handles everything, and only calls the network driver when + the link state changes, so it can react. + + 2) The PAL handles everything except interrupts (usually because the + controller has the interrupt registers). + + 3) The PAL handles everything, but checks in with the driver every second, + allowing the network driver to react first to any changes before the PAL + does. + + 4) The PAL serves only as a library of functions, with the network device + manually calling functions to update status, and configure the PHY + + +Letting the PHY Abstraction Layer do Everything + + If you choose option 1 (The hope is that every driver can, but to still be + useful to drivers that can't), connecting to the PHY is simple: + + First, you need a function to react to changes in the link state. This + function follows this protocol: + + static void adjust_link(struct net_device *dev); + + Next, you need to know the device name of the PHY connected to this device. + The name will look something like, "0:00", where the first number is the + bus id, and the second is the PHY's address on that bus. Typically, + the bus is responsible for making its ID unique. + + Now, to connect, just call this function: + + phydev = phy_connect(dev, phy_name, &adjust_link, interface); + + phydev is a pointer to the phy_device structure which represents the PHY. If + phy_connect is successful, it will return the pointer. dev, here, is the + pointer to your net_device. Once done, this function will have started the + PHY's software state machine, and registered for the PHY's interrupt, if it + has one. The phydev structure will be populated with information about the + current state, though the PHY will not yet be truly operational at this + point. + + PHY-specific flags should be set in phydev->dev_flags prior to the call + to phy_connect() such that the underlying PHY driver can check for flags + and perform specific operations based on them. + This is useful if the system has put hardware restrictions on + the PHY/controller, of which the PHY needs to be aware. + + interface is a u32 which specifies the connection type used + between the controller and the PHY. Examples are GMII, MII, + RGMII, and SGMII. For a full list, see include/linux/phy.h + + Now just make sure that phydev->supported and phydev->advertising have any + values pruned from them which don't make sense for your controller (a 10/100 + controller may be connected to a gigabit capable PHY, so you would need to + mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions + for these bitfields. Note that you should not SET any bits, except the + SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get + put into an unsupported state. + + Lastly, once the controller is ready to handle network traffic, you call + phy_start(phydev). This tells the PAL that you are ready, and configures the + PHY to connect to the network. If you want to handle your own interrupts, + just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start. + Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL. + + When you want to disconnect from the network (even if just briefly), you call + phy_stop(phydev). + +Pause frames / flow control + + The PHY does not participate directly in flow control/pause frames except by + making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in + MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC + controller supports such a thing. Since flow control/pause frames generation + involves the Ethernet MAC driver, it is recommended that this driver takes care + of properly indicating advertisement and support for such features by setting + the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done + either before or after phy_connect() and/or as a result of implementing the + ethtool::set_pauseparam feature. + + +Keeping Close Tabs on the PAL + + It is possible that the PAL's built-in state machine needs a little help to + keep your network device and the PHY properly in sync. If so, you can + register a helper function when connecting to the PHY, which will be called + every second before the state machine reacts to any changes. To do this, you + need to manually call phy_attach() and phy_prepare_link(), and then call + phy_start_machine() with the second argument set to point to your special + handler. + + Currently there are no examples of how to use this functionality, and testing + on it has been limited because the author does not have any drivers which use + it (they all use option 1). So Caveat Emptor. + +Doing it all yourself + + There's a remote chance that the PAL's built-in state machine cannot track + the complex interactions between the PHY and your network device. If this is + so, you can simply call phy_attach(), and not call phy_start_machine or + phy_prepare_link(). This will mean that phydev->state is entirely yours to + handle (phy_start and phy_stop toggle between some of the states, so you + might need to avoid them). + + An effort has been made to make sure that useful functionality can be + accessed without the state-machine running, and most of these functions are + descended from functions which did not interact with a complex state-machine. + However, again, no effort has been made so far to test running without the + state machine, so tryer beware. + + Here is a brief rundown of the functions: + + int phy_read(struct phy_device *phydev, u16 regnum); + int phy_write(struct phy_device *phydev, u16 regnum, u16 val); + + Simple read/write primitives. They invoke the bus's read/write function + pointers. + + void phy_print_status(struct phy_device *phydev); + + A convenience function to print out the PHY status neatly. + + int phy_start_interrupts(struct phy_device *phydev); + int phy_stop_interrupts(struct phy_device *phydev); + + Requests the IRQ for the PHY interrupts, then enables them for + start, or disables then frees them for stop. + + struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, + phy_interface_t interface); + + Attaches a network device to a particular PHY, binding the PHY to a generic + driver if none was found during bus initialization. + + int phy_start_aneg(struct phy_device *phydev); + + Using variables inside the phydev structure, either configures advertising + and resets autonegotiation, or disables autonegotiation, and configures + forced settings. + + static inline int phy_read_status(struct phy_device *phydev); + + Fills the phydev structure with up-to-date information about the current + settings in the PHY. + + int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); + + Ethtool convenience functions. + + int phy_mii_ioctl(struct phy_device *phydev, + struct mii_ioctl_data *mii_data, int cmd); + + The MII ioctl. Note that this function will completely screw up the state + machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to + use this only to write registers which are not standard, and don't set off + a renegotiation. + + +PHY Device Drivers + + With the PHY Abstraction Layer, adding support for new PHYs is + quite easy. In some cases, no work is required at all! However, + many PHYs require a little hand-holding to get up-and-running. + +Generic PHY driver + + If the desired PHY doesn't have any errata, quirks, or special + features you want to support, then it may be best to not add + support, and let the PHY Abstraction Layer's Generic PHY Driver + do all of the work. + +Writing a PHY driver + + If you do need to write a PHY driver, the first thing to do is + make sure it can be matched with an appropriate PHY device. + This is done during bus initialization by reading the device's + UID (stored in registers 2 and 3), then comparing it to each + driver's phy_id field by ANDing it with each driver's + phy_id_mask field. Also, it needs a name. Here's an example: + + static struct phy_driver dm9161_driver = { + .phy_id = 0x0181b880, + .name = "Davicom DM9161E", + .phy_id_mask = 0x0ffffff0, + ... + } + + Next, you need to specify what features (speed, duplex, autoneg, + etc) your PHY device and driver support. Most PHYs support + PHY_BASIC_FEATURES, but you can look in include/mii.h for other + features. + + Each driver consists of a number of function pointers, documented + in include/linux/phy.h under the phy_driver structure. + + Of these, only config_aneg and read_status are required to be + assigned by the driver code. The rest are optional. Also, it is + preferred to use the generic phy driver's versions of these two + functions if at all possible: genphy_read_status and + genphy_config_aneg. If this is not possible, it is likely that + you only need to perform some actions before and after invoking + these functions, and so your functions will wrap the generic + ones. + + Feel free to look at the Marvell, Cicada, and Davicom drivers in + drivers/net/phy/ for examples (the lxt and qsemi drivers have + not been tested as of this writing). + + The PHY's MMD register accesses are handled by the PAL framework + by default, but can be overridden by a specific PHY driver if + required. This could be the case if a PHY was released for + manufacturing before the MMD PHY register definitions were + standardized by the IEEE. Most modern PHYs will be able to use + the generic PAL framework for accessing the PHY's MMD registers. + An example of such usage is for Energy Efficient Ethernet support, + implemented in the PAL. This support uses the PAL to access MMD + registers for EEE query and configuration if the PHY supports + the IEEE standard access mechanisms, or can use the PHY's specific + access interfaces if overridden by the specific PHY driver. See + the Micrel driver in drivers/net/phy/ for an example of how this + can be implemented. + +Board Fixups + + Sometimes the specific interaction between the platform and the PHY requires + special handling. For instance, to change where the PHY's clock input is, + or to add a delay to account for latency issues in the data path. In order + to support such contingencies, the PHY Layer allows platform code to register + fixups to be run when the PHY is brought up (or subsequently reset). + + When the PHY Layer brings up a PHY it checks to see if there are any fixups + registered for it, matching based on UID (contained in the PHY device's phy_id + field) and the bus identifier (contained in phydev->dev.bus_id). Both must + match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as + wildcards for the bus ID and UID, respectively. + + When a match is found, the PHY layer will invoke the run function associated + with the fixup. This function is passed a pointer to the phy_device of + interest. It should therefore only operate on that PHY. + + The platform code can either register the fixup using phy_register_fixup(): + + int phy_register_fixup(const char *phy_id, + u32 phy_uid, u32 phy_uid_mask, + int (*run)(struct phy_device *)); + + Or using one of the two stubs, phy_register_fixup_for_uid() and + phy_register_fixup_for_id(): + + int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask, + int (*run)(struct phy_device *)); + int phy_register_fixup_for_id(const char *phy_id, + int (*run)(struct phy_device *)); + + The stubs set one of the two matching criteria, and set the other one to + match anything. + + When phy_register_fixup() or *_for_uid()/*_for_id() is called at module, + unregister fixup and free allocate memory are required. + + Call one of following function before unloading module. + + int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask); + int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask); + int phy_register_fixup_for_id(const char *phy_id); + +Standards + + IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two: + http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf + + RGMII v1.3: + http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf + + RGMII v2.0: + http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt new file mode 100644 index 000000000..d2fd78f85 --- /dev/null +++ b/Documentation/networking/pktgen.txt @@ -0,0 +1,400 @@ + + + HOWTO for the linux packet generator + ------------------------------------ + +Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel +or as a module. A module is preferred; modprobe pktgen if needed. Once +running, pktgen creates a thread for each CPU with affinity to that CPU. +Monitoring and controlling is done via /proc. It is easiest to select a +suitable sample script and configure that. + +On a dual CPU: + +ps aux | grep pkt +root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0] +root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1] + + +For monitoring and control pktgen creates: + /proc/net/pktgen/pgctrl + /proc/net/pktgen/kpktgend_X + /proc/net/pktgen/ethX + + +Tuning NIC for max performance +============================== + +The default NIC settings are (likely) not tuned for pktgen's artificial +overload type of benchmarking, as this could hurt the normal use-case. + +Specifically increasing the TX ring buffer in the NIC: + # ethtool -G ethX tx 1024 + +A larger TX ring can improve pktgen's performance, while it can hurt +in the general case, 1) because the TX ring buffer might get larger +than the CPU's L1/L2 cache, 2) because it allows more queueing in the +NIC HW layer (which is bad for bufferbloat). + +One should hesitate to conclude that packets/descriptors in the HW +TX ring cause delay. Drivers usually delay cleaning up the +ring-buffers for various performance reasons, and packets stalling +the TX ring might just be waiting for cleanup. + +This cleanup issue is specifically the case for the driver ixgbe +(Intel 82599 chip). This driver (ixgbe) combines TX+RX ring cleanups, +and the cleanup interval is affected by the ethtool --coalesce setting +of parameter "rx-usecs". + +For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6): + # ethtool -C ethX rx-usecs 30 + + +Kernel threads +============== +Pktgen creates a thread for each CPU with affinity to that CPU. +Which is controlled through procfile /proc/net/pktgen/kpktgend_X. + +Example: /proc/net/pktgen/kpktgend_0 + + Running: + Stopped: eth4@0 + Result: OK: add_device=eth4@0 + +Most important are the devices assigned to the thread. + +The two basic thread commands are: + * add_device DEVICE@NAME -- adds a single device + * rem_device_all -- remove all associated devices + +When adding a device to a thread, a corresponding procfile is created +which is used for configuring this device. Thus, device names need to +be unique. + +To support adding the same device to multiple threads, which is useful +with multi queue NICs, the device naming scheme is extended with "@": + device@something + +The part after "@" can be anything, but it is custom to use the thread +number. + +Viewing devices +=============== + +The Params section holds configured information. The Current section +holds running statistics. The Result is printed after a run or after +interruption. Example: + +/proc/net/pktgen/eth4@0 + + Params: count 100000 min_pkt_size: 60 max_pkt_size: 60 + frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0 + flows: 0 flowlen: 0 + queue_map_min: 0 queue_map_max: 0 + dst_min: 192.168.81.2 dst_max: + src_min: src_max: + src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8 + udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9 + src_mac_count: 0 dst_mac_count: 0 + Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU + Current: + pkts-sofar: 100000 errors: 0 + started: 623913381008us stopped: 623913396439us idle: 25us + seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 + cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2 + cur_udp_dst: 9 cur_udp_src: 42 + cur_queue_map: 0 + flows: 0 + Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags) + 6480562pps 3110Mb/sec (3110669760bps) errors: 0 + + +Configuring devices +=================== +This is done via the /proc interface, and most easily done via pgset +as defined in the sample scripts. +You need to specify PGDEV environment variable to use functions from sample +scripts, i.e.: +export PGDEV=/proc/net/pktgen/eth4@0 +source samples/pktgen/functions.sh + +Examples: + + pg_ctrl start starts injection. + pg_ctrl stop aborts injection. Also, ^C aborts generator. + + pgset "clone_skb 1" sets the number of copies of the same packet + pgset "clone_skb 0" use single SKB for all transmits + pgset "burst 8" uses xmit_more API to queue 8 copies of the same + packet and update HW tx queue tail pointer once. + "burst 1" is the default + pgset "pkt_size 9014" sets packet size to 9014 + pgset "frags 5" packet will consist of 5 fragments + pgset "count 200000" sets number of packets to send, set to zero + for continuous sends until explicitly stopped. + + pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds + + pgset "dst 10.0.0.1" sets IP destination address + (BEWARE! This generator is very aggressive!) + + pgset "dst_min 10.0.0.1" Same as dst + pgset "dst_max 10.0.0.254" Set the maximum destination IP. + pgset "src_min 10.0.0.1" Set the minimum (or only) source IP. + pgset "src_max 10.0.0.254" Set the maximum source IP. + pgset "dst6 fec0::1" IPV6 destination address + pgset "src6 fec0::2" IPV6 source address + pgset "dstmac 00:00:00:00:00:00" sets MAC destination address + pgset "srcmac 00:00:00:00:00:00" sets MAC source address + + pgset "queue_map_min 0" Sets the min value of tx queue interval + pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices + To select queue 1 of a given device, + use queue_map_min=1 and queue_map_max=1 + + pgset "src_mac_count 1" Sets the number of MACs we'll range through. + The 'minimum' MAC is what you set with srcmac. + + pgset "dst_mac_count 1" Sets the number of MACs we'll range through. + The 'minimum' MAC is what you set with dstmac. + + pgset "flag [name]" Set a flag to determine behaviour. Current flags + are: IPSRC_RND # IP source is random (between min/max) + IPDST_RND # IP destination is random + UDPSRC_RND, UDPDST_RND, + MACSRC_RND, MACDST_RND + TXSIZE_RND, IPV6, + MPLS_RND, VID_RND, SVID_RND + FLOW_SEQ, + QUEUE_MAP_RND # queue map random + QUEUE_MAP_CPU # queue map mirrors smp_processor_id() + UDPCSUM, + IPSEC # IPsec encapsulation (needs CONFIG_XFRM) + NODE_ALLOC # node specific memory allocation + NO_TIMESTAMP # disable timestamping + pgset 'flag ![name]' Clear a flag to determine behaviour. + Note that you might need to use single quote in + interactive mode, so that your shell wouldn't expand + the specified flag as a history command. + + pgset "spi [SPI_VALUE]" Set specific SA used to transform packet. + + pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then + cycle through the port range. + + pgset "udp_src_max 9" set UDP source port max. + pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then + cycle through the port range. + pgset "udp_dst_max 9" set UDP destination port max. + + pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example + outer label=16,middle label=32, + inner label=0 (IPv4 NULL)) Note that + there must be no spaces between the + arguments. Leading zeros are required. + Do not set the bottom of stack bit, + that's done automatically. If you do + set the bottom of stack bit, that + indicates that you want to randomly + generate that address and the flag + MPLS_RND will be turned on. You + can have any mix of random and fixed + labels in the label stack. + + pgset "mpls 0" turn off mpls (or any invalid argument works too!) + + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3" set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id 9999" > 4095 remove vlan and svlan tags + pgset "svlan 9999" > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + + pgset "rate 300M" set rate to 300 Mb/s + pgset "ratep 1000000" set rate to 1Mpps + + pgset "xmit_mode netif_receive" RX inject into stack netif_receive_skb() + Works with "burst" but not with "clone_skb". + Default xmit_mode is "start_xmit". + +Sample scripts +============== + +A collection of tutorial scripts and helpers for pktgen is in the +samples/pktgen directory. The helper parameters.sh file support easy +and consistent parameter parsing across the sample scripts. + +Usage example and help: + ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 + +Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX + -i : ($DEV) output interface/device (required) + -s : ($PKT_SIZE) packet size + -d : ($DEST_IP) destination IP + -m : ($DST_MAC) destination MAC-addr + -t : ($THREADS) threads to start + -c : ($SKB_CLONE) SKB clones send before alloc new SKB + -b : ($BURST) HW level bursting of SKBs + -v : ($VERBOSE) verbose + -x : ($DEBUG) debug + +The global variables being set are also listed. E.g. the required +interface/device parameter "-i" sets variable $DEV. Copy the +pktgen_sampleXX scripts and modify them to fit your own needs. + +The old scripts: + +pktgen.conf-1-2 # 1 CPU 2 dev +pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS +pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6 +pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS +pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows. + + +Interrupt affinity +=================== +Note that when adding devices to a specific CPU it is a good idea to +also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound +to the same CPU. This reduces cache bouncing when freeing skbs. + +Plus using the device flag QUEUE_MAP_CPU, which maps the SKBs TX queue +to the running threads CPU (directly from smp_processor_id()). + +Enable IPsec +============ +Default IPsec transformation with ESP encapsulation plus transport mode +can be enabled by simply setting: + +pgset "flag IPSEC" +pgset "flows 1" + +To avoid breaking existing testbed scripts for using AH type and tunnel mode, +you can use "pgset spi SPI_VALUE" to specify which transformation mode +to employ. + + +Current commands and configuration options +========================================== + +** Pgcontrol commands: + +start +stop +reset + +** Thread commands: + +add_device +rem_device_all + + +** Device commands: + +count +clone_skb +burst +debug + +frags +delay + +src_mac_count +dst_mac_count + +pkt_size +min_pkt_size +max_pkt_size + +queue_map_min +queue_map_max +skb_priority + +tos (ipv4) +traffic_class (ipv6) + +mpls + +udp_src_min +udp_src_max + +udp_dst_min +udp_dst_max + +node + +flag + IPSRC_RND + IPDST_RND + UDPSRC_RND + UDPDST_RND + MACSRC_RND + MACDST_RND + TXSIZE_RND + IPV6 + MPLS_RND + VID_RND + SVID_RND + FLOW_SEQ + QUEUE_MAP_RND + QUEUE_MAP_CPU + UDPCSUM + IPSEC + NODE_ALLOC + NO_TIMESTAMP + +spi (ipsec) + +dst_min +dst_max + +src_min +src_max + +dst_mac +src_mac + +clear_counters + +src6 +dst6 +dst6_max +dst6_min + +flows +flowlen + +rate +ratep + +xmit_mode <start_xmit|netif_receive> + +vlan_cfi +vlan_id +vlan_p + +svlan_cfi +svlan_id +svlan_p + + +References: +ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/ +ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ + +Paper from Linux-Kongress in Erlangen 2004. +ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf + +Thanks to: +Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek +Stephen Hemminger, Andi Kleen, Dave Miller and many others. + + +Good luck with the linux net-development. diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt new file mode 100644 index 000000000..61daf4b39 --- /dev/null +++ b/Documentation/networking/ppp_generic.txt @@ -0,0 +1,426 @@ + PPP Generic Driver and Channel Interface + ---------------------------------------- + + Paul Mackerras + paulus@samba.org + 7 Feb 2002 + +The generic PPP driver in linux-2.4 provides an implementation of the +functionality which is of use in any PPP implementation, including: + +* the network interface unit (ppp0 etc.) +* the interface to the networking code +* PPP multilink: splitting datagrams between multiple links, and + ordering and combining received fragments +* the interface to pppd, via a /dev/ppp character device +* packet compression and decompression +* TCP/IP header compression and decompression +* detecting network traffic for demand dialling and for idle timeouts +* simple packet filtering + +For sending and receiving PPP frames, the generic PPP driver calls on +the services of PPP `channels'. A PPP channel encapsulates a +mechanism for transporting PPP frames from one machine to another. A +PPP channel implementation can be arbitrarily complex internally but +has a very simple interface with the generic PPP code: it merely has +to be able to send PPP frames, receive PPP frames, and optionally +handle ioctl requests. Currently there are PPP channel +implementations for asynchronous serial ports, synchronous serial +ports, and for PPP over ethernet. + +This architecture makes it possible to implement PPP multilink in a +natural and straightforward way, by allowing more than one channel to +be linked to each ppp network interface unit. The generic layer is +responsible for splitting datagrams on transmit and recombining them +on receive. + + +PPP channel API +--------------- + +See include/linux/ppp_channel.h for the declaration of the types and +functions used to communicate between the generic PPP layer and PPP +channels. + +Each channel has to provide two functions to the generic PPP layer, +via the ppp_channel.ops pointer: + +* start_xmit() is called by the generic layer when it has a frame to + send. The channel has the option of rejecting the frame for + flow-control reasons. In this case, start_xmit() should return 0 + and the channel should call the ppp_output_wakeup() function at a + later time when it can accept frames again, and the generic layer + will then attempt to retransmit the rejected frame(s). If the frame + is accepted, the start_xmit() function should return 1. + +* ioctl() provides an interface which can be used by a user-space + program to control aspects of the channel's behaviour. This + procedure will be called when a user-space program does an ioctl + system call on an instance of /dev/ppp which is bound to the + channel. (Usually it would only be pppd which would do this.) + +The generic PPP layer provides seven functions to channels: + +* ppp_register_channel() is called when a channel has been created, to + notify the PPP generic layer of its presence. For example, setting + a serial port to the PPPDISC line discipline causes the ppp_async + channel code to call this function. + +* ppp_unregister_channel() is called when a channel is to be + destroyed. For example, the ppp_async channel code calls this when + a hangup is detected on the serial port. + +* ppp_output_wakeup() is called by a channel when it has previously + rejected a call to its start_xmit function, and can now accept more + packets. + +* ppp_input() is called by a channel when it has received a complete + PPP frame. + +* ppp_input_error() is called by a channel when it has detected that a + frame has been lost or dropped (for example, because of a FCS (frame + check sequence) error). + +* ppp_channel_index() returns the channel index assigned by the PPP + generic layer to this channel. The channel should provide some way + (e.g. an ioctl) to transmit this back to user-space, as user-space + will need it to attach an instance of /dev/ppp to this channel. + +* ppp_unit_number() returns the unit number of the ppp network + interface to which this channel is connected, or -1 if the channel + is not connected. + +Connecting a channel to the ppp generic layer is initiated from the +channel code, rather than from the generic layer. The channel is +expected to have some way for a user-level process to control it +independently of the ppp generic layer. For example, with the +ppp_async channel, this is provided by the file descriptor to the +serial port. + +Generally a user-level process will initialize the underlying +communications medium and prepare it to do PPP. For example, with an +async tty, this can involve setting the tty speed and modes, issuing +modem commands, and then going through some sort of dialog with the +remote system to invoke PPP service there. We refer to this process +as `discovery'. Then the user-level process tells the medium to +become a PPP channel and register itself with the generic PPP layer. +The channel then has to report the channel number assigned to it back +to the user-level process. From that point, the PPP negotiation code +in the PPP daemon (pppd) can take over and perform the PPP +negotiation, accessing the channel through the /dev/ppp interface. + +At the interface to the PPP generic layer, PPP frames are stored in +skbuff structures and start with the two-byte PPP protocol number. +The frame does *not* include the 0xff `address' byte or the 0x03 +`control' byte that are optionally used in async PPP. Nor is there +any escaping of control characters, nor are there any FCS or framing +characters included. That is all the responsibility of the channel +code, if it is needed for the particular medium. That is, the skbuffs +presented to the start_xmit() function contain only the 2-byte +protocol number and the data, and the skbuffs presented to ppp_input() +must be in the same format. + +The channel must provide an instance of a ppp_channel struct to +represent the channel. The channel is free to use the `private' field +however it wishes. The channel should initialize the `mtu' and +`hdrlen' fields before calling ppp_register_channel() and not change +them until after ppp_unregister_channel() returns. The `mtu' field +represents the maximum size of the data part of the PPP frames, that +is, it does not include the 2-byte protocol number. + +If the channel needs some headroom in the skbuffs presented to it for +transmission (i.e., some space free in the skbuff data area before the +start of the PPP frame), it should set the `hdrlen' field of the +ppp_channel struct to the amount of headroom required. The generic +PPP layer will attempt to provide that much headroom but the channel +should still check if there is sufficient headroom and copy the skbuff +if there isn't. + +On the input side, channels should ideally provide at least 2 bytes of +headroom in the skbuffs presented to ppp_input(). The generic PPP +code does not require this but will be more efficient if this is done. + + +Buffering and flow control +-------------------------- + +The generic PPP layer has been designed to minimize the amount of data +that it buffers in the transmit direction. It maintains a queue of +transmit packets for the PPP unit (network interface device) plus a +queue of transmit packets for each attached channel. Normally the +transmit queue for the unit will contain at most one packet; the +exceptions are when pppd sends packets by writing to /dev/ppp, and +when the core networking code calls the generic layer's start_xmit() +function with the queue stopped, i.e. when the generic layer has +called netif_stop_queue(), which only happens on a transmit timeout. +The start_xmit function always accepts and queues the packet which it +is asked to transmit. + +Transmit packets are dequeued from the PPP unit transmit queue and +then subjected to TCP/IP header compression and packet compression +(Deflate or BSD-Compress compression), as appropriate. After this +point the packets can no longer be reordered, as the decompression +algorithms rely on receiving compressed packets in the same order that +they were generated. + +If multilink is not in use, this packet is then passed to the attached +channel's start_xmit() function. If the channel refuses to take +the packet, the generic layer saves it for later transmission. The +generic layer will call the channel's start_xmit() function again +when the channel calls ppp_output_wakeup() or when the core +networking code calls the generic layer's start_xmit() function +again. The generic layer contains no timeout and retransmission +logic; it relies on the core networking code for that. + +If multilink is in use, the generic layer divides the packet into one +or more fragments and puts a multilink header on each fragment. It +decides how many fragments to use based on the length of the packet +and the number of channels which are potentially able to accept a +fragment at the moment. A channel is potentially able to accept a +fragment if it doesn't have any fragments currently queued up for it +to transmit. The channel may still refuse a fragment; in this case +the fragment is queued up for the channel to transmit later. This +scheme has the effect that more fragments are given to higher- +bandwidth channels. It also means that under light load, the generic +layer will tend to fragment large packets across all the channels, +thus reducing latency, while under heavy load, packets will tend to be +transmitted as single fragments, thus reducing the overhead of +fragmentation. + + +SMP safety +---------- + +The PPP generic layer has been designed to be SMP-safe. Locks are +used around accesses to the internal data structures where necessary +to ensure their integrity. As part of this, the generic layer +requires that the channels adhere to certain requirements and in turn +provides certain guarantees to the channels. Essentially the channels +are required to provide the appropriate locking on the ppp_channel +structures that form the basis of the communication between the +channel and the generic layer. This is because the channel provides +the storage for the ppp_channel structure, and so the channel is +required to provide the guarantee that this storage exists and is +valid at the appropriate times. + +The generic layer requires these guarantees from the channel: + +* The ppp_channel object must exist from the time that + ppp_register_channel() is called until after the call to + ppp_unregister_channel() returns. + +* No thread may be in a call to any of ppp_input(), ppp_input_error(), + ppp_output_wakeup(), ppp_channel_index() or ppp_unit_number() for a + channel at the time that ppp_unregister_channel() is called for that + channel. + +* ppp_register_channel() and ppp_unregister_channel() must be called + from process context, not interrupt or softirq/BH context. + +* The remaining generic layer functions may be called at softirq/BH + level but must not be called from a hardware interrupt handler. + +* The generic layer may call the channel start_xmit() function at + softirq/BH level but will not call it at interrupt level. Thus the + start_xmit() function may not block. + +* The generic layer will only call the channel ioctl() function in + process context. + +The generic layer provides these guarantees to the channels: + +* The generic layer will not call the start_xmit() function for a + channel while any thread is already executing in that function for + that channel. + +* The generic layer will not call the ioctl() function for a channel + while any thread is already executing in that function for that + channel. + +* By the time a call to ppp_unregister_channel() returns, no thread + will be executing in a call from the generic layer to that channel's + start_xmit() or ioctl() function, and the generic layer will not + call either of those functions subsequently. + + +Interface to pppd +----------------- + +The PPP generic layer exports a character device interface called +/dev/ppp. This is used by pppd to control PPP interface units and +channels. Although there is only one /dev/ppp, each open instance of +/dev/ppp acts independently and can be attached either to a PPP unit +or a PPP channel. This is achieved using the file->private_data field +to point to a separate object for each open instance of /dev/ppp. In +this way an effect similar to Solaris' clone open is obtained, +allowing us to control an arbitrary number of PPP interfaces and +channels without having to fill up /dev with hundreds of device names. + +When /dev/ppp is opened, a new instance is created which is initially +unattached. Using an ioctl call, it can then be attached to an +existing unit, attached to a newly-created unit, or attached to an +existing channel. An instance attached to a unit can be used to send +and receive PPP control frames, using the read() and write() system +calls, along with poll() if necessary. Similarly, an instance +attached to a channel can be used to send and receive PPP frames on +that channel. + +In multilink terms, the unit represents the bundle, while the channels +represent the individual physical links. Thus, a PPP frame sent by a +write to the unit (i.e., to an instance of /dev/ppp attached to the +unit) will be subject to bundle-level compression and to fragmentation +across the individual links (if multilink is in use). In contrast, a +PPP frame sent by a write to the channel will be sent as-is on that +channel, without any multilink header. + +A channel is not initially attached to any unit. In this state it can +be used for PPP negotiation but not for the transfer of data packets. +It can then be connected to a PPP unit with an ioctl call, which +makes it available to send and receive data packets for that unit. + +The ioctl calls which are available on an instance of /dev/ppp depend +on whether it is unattached, attached to a PPP interface, or attached +to a PPP channel. The ioctl calls which are available on an +unattached instance are: + +* PPPIOCNEWUNIT creates a new PPP interface and makes this /dev/ppp + instance the "owner" of the interface. The argument should point to + an int which is the desired unit number if >= 0, or -1 to assign the + lowest unused unit number. Being the owner of the interface means + that the interface will be shut down if this instance of /dev/ppp is + closed. + +* PPPIOCATTACH attaches this instance to an existing PPP interface. + The argument should point to an int containing the unit number. + This does not make this instance the owner of the PPP interface. + +* PPPIOCATTCHAN attaches this instance to an existing PPP channel. + The argument should point to an int containing the channel number. + +The ioctl calls available on an instance of /dev/ppp attached to a +channel are: + +* PPPIOCCONNECT connects this channel to a PPP interface. The + argument should point to an int containing the interface unit + number. It will return an EINVAL error if the channel is already + connected to an interface, or ENXIO if the requested interface does + not exist. + +* PPPIOCDISCONN disconnects this channel from the PPP interface that + it is connected to. It will return an EINVAL error if the channel + is not connected to an interface. + +* All other ioctl commands are passed to the channel ioctl() function. + +The ioctl calls that are available on an instance that is attached to +an interface unit are: + +* PPPIOCSMRU sets the MRU (maximum receive unit) for the interface. + The argument should point to an int containing the new MRU value. + +* PPPIOCSFLAGS sets flags which control the operation of the + interface. The argument should be a pointer to an int containing + the new flags value. The bits in the flags value that can be set + are: + SC_COMP_TCP enable transmit TCP header compression + SC_NO_TCP_CCID disable connection-id compression for + TCP header compression + SC_REJ_COMP_TCP disable receive TCP header decompression + SC_CCP_OPEN Compression Control Protocol (CCP) is + open, so inspect CCP packets + SC_CCP_UP CCP is up, may (de)compress packets + SC_LOOP_TRAFFIC send IP traffic to pppd + SC_MULTILINK enable PPP multilink fragmentation on + transmitted packets + SC_MP_SHORTSEQ expect short multilink sequence + numbers on received multilink fragments + SC_MP_XSHORTSEQ transmit short multilink sequence nos. + + The values of these flags are defined in <linux/ppp-ioctl.h>. Note + that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and + SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option + is not selected. + +* PPPIOCGFLAGS returns the value of the status/control flags for the + interface unit. The argument should point to an int where the ioctl + will store the flags value. As well as the values listed above for + PPPIOCSFLAGS, the following bits may be set in the returned value: + SC_COMP_RUN CCP compressor is running + SC_DECOMP_RUN CCP decompressor is running + SC_DC_ERROR CCP decompressor detected non-fatal error + SC_DC_FERROR CCP decompressor detected fatal error + +* PPPIOCSCOMPRESS sets the parameters for packet compression or + decompression. The argument should point to a ppp_option_data + structure (defined in <linux/ppp-ioctl.h>), which contains a + pointer/length pair which should describe a block of memory + containing a CCP option specifying a compression method and its + parameters. The ppp_option_data struct also contains a `transmit' + field. If this is 0, the ioctl will affect the receive path, + otherwise the transmit path. + +* PPPIOCGUNIT returns, in the int pointed to by the argument, the unit + number of this interface unit. + +* PPPIOCSDEBUG sets the debug flags for the interface to the value in + the int pointed to by the argument. Only the least significant bit + is used; if this is 1 the generic layer will print some debug + messages during its operation. This is only intended for debugging + the generic PPP layer code; it is generally not helpful for working + out why a PPP connection is failing. + +* PPPIOCGDEBUG returns the debug flags for the interface in the int + pointed to by the argument. + +* PPPIOCGIDLE returns the time, in seconds, since the last data + packets were sent and received. The argument should point to a + ppp_idle structure (defined in <linux/ppp_defs.h>). If the + CONFIG_PPP_FILTER option is enabled, the set of packets which reset + the transmit and receive idle timers is restricted to those which + pass the `active' packet filter. + +* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the + number of connection slots) for the TCP header compressor and + decompressor. The lower 16 bits of the int pointed to by the + argument specify the maximum connection-ID for the compressor. If + the upper 16 bits of that int are non-zero, they specify the maximum + connection-ID for the decompressor, otherwise the decompressor's + maximum connection-ID is set to 15. + +* PPPIOCSNPMODE sets the network-protocol mode for a given network + protocol. The argument should point to an npioctl struct (defined + in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol + number for the protocol to be affected, and the `mode' field + specifies what to do with packets for that protocol: + + NPMODE_PASS normal operation, transmit and receive packets + NPMODE_DROP silently drop packets for this protocol + NPMODE_ERROR drop packets and return an error on transmit + NPMODE_QUEUE queue up packets for transmit, drop received + packets + + At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as + NPMODE_DROP. + +* PPPIOCGNPMODE returns the network-protocol mode for a given + protocol. The argument should point to an npioctl struct with the + `protocol' field set to the PPP protocol number for the protocol of + interest. On return the `mode' field will be set to the network- + protocol mode for that protocol. + +* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet + filters. These ioctls are only available if the CONFIG_PPP_FILTER + option is selected. The argument should point to a sock_fprog + structure (defined in <linux/filter.h>) containing the compiled BPF + instructions for the filter. Packets are dropped if they fail the + `pass' filter; otherwise, if they fail the `active' filter they are + passed but they do not reset the transmit or receive idle timer. + +* PPPIOCSMRRU enables or disables multilink processing for received + packets and sets the multilink MRRU (maximum reconstructed receive + unit). The argument should point to an int containing the new MRRU + value. If the MRRU value is 0, processing of received multilink + fragments is disabled. This ioctl is only available if the + CONFIG_PPP_MULTILINK option is selected. + +Last modified: 7-feb-2002 diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.txt new file mode 100644 index 000000000..4a79209e7 --- /dev/null +++ b/Documentation/networking/proc_net_tcp.txt @@ -0,0 +1,48 @@ +This document describes the interfaces /proc/net/tcp and /proc/net/tcp6. +Note that these interfaces are deprecated in favor of tcp_diag. + +These /proc interfaces provide information about currently active TCP +connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c +and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively. + +It will first list all listening TCP sockets, and next list all established +TCP connections. A typical entry of /proc/net/tcp would look like this (split +up into 3 parts because of the length of the line): + + 46: 010310AC:9C4C 030310AC:1770 01 + | | | | | |--> connection state + | | | | |------> remote TCP port number + | | | |-------------> remote IPv4 address + | | |--------------------> local TCP port number + | |---------------------------> local IPv4 address + |----------------------------------> number of entry + + 00000150:00000000 01:00000019 00000000 + | | | | |--> number of unrecovered RTO timeouts + | | | |----------> number of jiffies until timer expires + | | |----------------> timer_active (see below) + | |----------------------> receive-queue + |-------------------------------> transmit-queue + + 1000 0 54165785 4 cd1e6040 25 4 27 3 -1 + | | | | | | | | | |--> slow start size threshold, + | | | | | | | | | or -1 if the threshold + | | | | | | | | | is >= 0xFFFF + | | | | | | | | |----> sending congestion window + | | | | | | | |-------> (ack.quick<<1)|ack.pingpong + | | | | | | |---------> Predicted tick of soft clock + | | | | | | (delayed ACK control data) + | | | | | |------------> retransmit timeout + | | | | |------------------> location of socket in memory + | | | |-----------------------> socket reference count + | | |-----------------------------> inode + | |----------------------------------> unanswered 0-window probes + |---------------------------------------------> uid + +timer_active: + 0 no timer is pending + 1 retransmit-timer is pending + 2 another timer (e.g. delayed ack or keepalive) is pending + 3 this is a socket in TIME_WAIT state. Not all fields will contain + data (or even exist) + 4 zero window probe timer is pending diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.txt new file mode 100644 index 000000000..953331c79 --- /dev/null +++ b/Documentation/networking/radiotap-headers.txt @@ -0,0 +1,152 @@ +How to use radiotap headers +=========================== + +Pointer to the radiotap include file +------------------------------------ + +Radiotap headers are variable-length and extensible, you can get most of the +information you need to know on them from: + +./include/net/ieee80211_radiotap.h + +This document gives an overview and warns on some corner cases. + + +Structure of the header +----------------------- + +There is a fixed portion at the start which contains a u32 bitmap that defines +if the possible argument associated with that bit is present or not. So if b0 +of the it_present member of ieee80211_radiotap_header is set, it means that +the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the +argument area. + + < 8-byte ieee80211_radiotap_header > + [ <possible argument bitmap extensions ... > ] + [ <argument> ... ] + +At the moment there are only 13 possible argument indexes defined, but in case +we run out of space in the u32 it_present member, it is defined that b31 set +indicates that there is another u32 bitmap following (shown as "possible +argument bitmap extensions..." above), and the start of the arguments is moved +forward 4 bytes each time. + +Note also that the it_len member __le16 is set to the total number of bytes +covered by the ieee80211_radiotap_header and any arguments following. + + +Requirements for arguments +-------------------------- + +After the fixed part of the header, the arguments follow for each argument +index whose matching bit is set in the it_present member of +ieee80211_radiotap_header. + + - the arguments are all stored little-endian! + + - the argument payload for a given argument index has a fixed size. So + IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is + present. See the comments in ./include/net/ieee80211_radiotap.h for a nice + breakdown of all the argument sizes + + - the arguments must be aligned to a boundary of the argument size using + padding. So a u16 argument must start on the next u16 boundary if it isn't + already on one, a u32 must start on the next u32 boundary and so on. + + - "alignment" is relative to the start of the ieee80211_radiotap_header, ie, + the first byte of the radiotap header. The absolute alignment of that first + byte isn't defined. So even if the whole radiotap header is starting at, eg, + address 0x00000003, still the first byte of the radiotap header is treated as + 0 for alignment purposes. + + - the above point that there may be no absolute alignment for multibyte + entities in the fixed radiotap header or the argument region means that you + have to take special evasive action when trying to access these multibyte + entities. Some arches like Blackfin cannot deal with an attempt to + dereference, eg, a u16 pointer that is pointing to an odd address. Instead + you have to use a kernel API get_unaligned() to dereference the pointer, + which will do it bytewise on the arches that require that. + + - The arguments for a given argument index can be a compound of multiple types + together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload + consisting of two u16s of total length 4. When this happens, the padding + rule is applied dealing with a u16, NOT dealing with a 4-byte single entity. + + +Example valid radiotap header +----------------------------- + + 0x00, 0x00, // <-- radiotap version + pad byte + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate (in 500kHz units) + 0x0c, //<-- tx power + 0x01 //<-- antenna + + +Using the Radiotap Parser +------------------------- + +If you are having to parse a radiotap struct, you can radically simplify the +job by using the radiotap parser that lives in net/wireless/radiotap.c and has +its prototypes available in include/net/cfg80211.h. You use it like this: + +#include <net/cfg80211.h> + +/* buf points to the start of the radiotap header part */ + +int MyFunction(u8 * buf, int buflen) +{ + int pkt_rate_100kHz = 0, antenna = 0, pwr = 0; + struct ieee80211_radiotap_iterator iterator; + int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen); + + while (!ret) { + + ret = ieee80211_radiotap_iterator_next(&iterator); + + if (ret) + continue; + + /* see if this argument is something we can use */ + + switch (iterator.this_arg_index) { + /* + * You must take care when dereferencing iterator.this_arg + * for multibyte types... the pointer is not aligned. Use + * get_unaligned((type *)iterator.this_arg) to dereference + * iterator.this_arg for type "type" safely on all arches. + */ + case IEEE80211_RADIOTAP_RATE: + /* radiotap "rate" u8 is in + * 500kbps units, eg, 0x02=1Mbps + */ + pkt_rate_100kHz = (*iterator.this_arg) * 5; + break; + + case IEEE80211_RADIOTAP_ANTENNA: + /* radiotap uses 0 for 1st ant */ + antenna = *iterator.this_arg); + break; + + case IEEE80211_RADIOTAP_DBM_TX_POWER: + pwr = *iterator.this_arg; + break; + + default: + break; + } + } /* while more rt headers */ + + if (ret != -ENOENT) + return TXRX_DROP; + + /* discard the radiotap header part */ + buf += iterator.max_length; + buflen -= iterator.max_length; + + ... + +} + +Andy Green <andy@warmcat.com> diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/ray_cs.txt new file mode 100644 index 000000000..c0c12307e --- /dev/null +++ b/Documentation/networking/ray_cs.txt @@ -0,0 +1,150 @@ +September 21, 1999 + +Copyright (c) 1998 Corey Thomas (corey@world.std.com) + +This file is the documentation for the Raylink Wireless LAN card driver for +Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE +802.11 compatible wireless network connectivity at 1 and 2 megabits/second. +See http://www.raytheon.com/micro/raylink/ for more information on the Raylink +card. This driver is in early development and does have bugs. See the known +bugs and limitations at the end of this document for more information. +This driver also works with WebGear's Aviator 2.4 and Aviator Pro +wireless LAN cards. + +As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel +source. My web page for the development of ray_cs is at +http://web.ralinktech.com/ralink/Home/Support/Linux.html +and I can be emailed at corey@world.std.com + +The kernel driver is based on ray_cs-1.62.tgz + +The driver at my web page is intended to be used as an add on to +David Hinds pcmcia package. All the command line parameters are +available when compiled as a module. When built into the kernel, only +the essid= string parameter is available via the kernel command line. +This will change after the method of sorting out parameters for all +the PCMCIA drivers is agreed upon. If you must have a built in driver +with nondefault parameters, they can be edited in +/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param +will find them all. + +Information on card services is available at: + http://pcmcia-cs.sourceforge.net/ + + +Card services user programs are still required for PCMCIA devices. +pcmcia-cs-3.1.1 or greater is required for the kernel version of +the driver. + +Currently, ray_cs is not part of David Hinds card services package, +so the following magic is required. + +At the end of the /etc/pcmcia/config.opts file, add the line: +source ./ray_cs.opts +This will make card services read the ray_cs.opts file +when starting. Create the file /etc/pcmcia/ray_cs.opts containing the +following: + +#### start of /etc/pcmcia/ray_cs.opts ################### +# Configuration options for Raylink Wireless LAN PCMCIA card +device "ray_cs" + class "network" module "misc/ray_cs" + +card "RayLink PC Card WLAN Adapter" + manfid 0x01a6, 0x0000 + bind "ray_cs" + +module "misc/ray_cs" opts "" +#### end of /etc/pcmcia/ray_cs.opts ##################### + + +To join an existing network with +different parameters, contact the network administrator for the +configuration information, and edit /etc/pcmcia/ray_cs.opts. +Add the parameters below between the empty quotes. + +Parameters for ray_cs driver which may be specified in ray_cs.opts: + +bc integer 0 = normal mode (802.11 timing) + 1 = slow down inter frame timing to allow + operation with older breezecom access + points. + +beacon_period integer beacon period in Kilo-microseconds + legal values = must be integer multiple + of hop dwell + default = 256 + +country integer 1 = USA (default) + 2 = Europe + 3 = Japan + 4 = Korea + 5 = Spain + 6 = France + 7 = Israel + 8 = Australia + +essid string ESS ID - network name to join + string with maximum length of 32 chars + default value = "ADHOC_ESSID" + +hop_dwell integer hop dwell time in Kilo-microseconds + legal values = 16,32,64,128(default),256 + +irq_mask integer linux standard 16 bit value 1bit/IRQ + lsb is IRQ 0, bit 1 is IRQ 1 etc. + Used to restrict choice of IRQ's to use. + Recommended method for controlling + interrupts is in /etc/pcmcia/config.opts + +net_type integer 0 (default) = adhoc network, + 1 = infrastructure + +phy_addr string string containing new MAC address in + hex, must start with x eg + x00008f123456 + +psm integer 0 = continuously active + 1 = power save mode (not useful yet) + +pc_debug integer (0-5) larger values for more verbose + logging. Replaces ray_debug. + +ray_debug integer Replaced with pc_debug + +ray_mem_speed integer defaults to 500 + +sniffer integer 0 = not sniffer (default) + 1 = sniffer which can be used to record all + network traffic using tcpdump or similar, + but no normal network use is allowed. + +translate integer 0 = no translation (encapsulate frames) + 1 = translation (RFC1042/802.1) + + +More on sniffer mode: + +tcpdump does not understand 802.11 headers, so it can't +interpret the contents, but it can record to a file. This is only +useful for debugging 802.11 lowlevel protocols that are not visible to +linux. If you want to watch ftp xfers, or do similar things, you +don't need to use sniffer mode. Also, some packet types are never +sent up by the card, so you will never see them (ack, rts, cts, probe +etc.) There is a simple program (showcap) included in the ray_cs +package which parses the 802.11 headers. + +Known Problems and missing features + + Does not work with non x86 + + Does not work with SMP + + Support for defragmenting frames is not yet debugged, and in + fact is known to not work. I have never encountered a net set + up to fragment, but still, it should be fixed. + + The ioctl support is incomplete. The hardware address cannot be set + using ifconfig yet. If a different hardware address is needed, it may + be set using the phy_addr parameter in ray_cs.opts. This requires + a card insertion to take effect. diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 000000000..0235ae69a --- /dev/null +++ b/Documentation/networking/rds.txt @@ -0,0 +1,423 @@ + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + AF_RDS and PF_RDS are the domain type to be used with socket(2) + to create RDS sockets. SOL_RDS is the socket-level to be used + with setsockopt(2) and getsockopt(2) for RDS specific socket + options. + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport, if one has not already been selected via the + SO_RDS_TRANSPORT socket option + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) + getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) + Set or read an integer defining the underlying + encapsulating transport to be used for RDS packets on the + socket. When setting the option, integer argument may be + one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the + value, RDS_TRANS_NONE will be returned on an unbound socket. + This socket option may only be set exactly once on the socket, + prior to binding it via the bind(2) system call. Attempts to + set SO_RDS_TRANSPORT on a socket for which the transport has + been previously attached explicitly (by SO_RDS_TRANSPORT) or + implicitly (via bind(2)) will return an error of EOPNOTSUPP. + An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will + always return EINVAL. + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + Fields: + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + CONG_BITMAP - this is a congestion update bitmap + ACK_REQUIRED - receiver must ack this packet + RETRANSMITTED - packet has previously been sent + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + struct rds_socket + per-socket information + struct rds_connection + per-connection information + struct rds_transport + pointers to transport-specific functions + struct rds_statistics + non-transport-specific statistics + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + struct rds_message built from incoming data + CMSGs parsed (e.g. RDMA ops) + transport connection alloced and connected if not already + rds_message placed on send queue + send worker awoken + rds_send_worker() + calls rds_send_xmit() until queue is empty + rds_send_xmit() + transmits congestion map if one is pending + may set ACK_REQUIRED + calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + rds_ib_xmit() + allocs work requests from send ring + adds any new send credits available to peer (h_credits) + maps the rds_message's sg list + piggybacks ack + populates work requests + post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + looks at write completions + unmaps recv buffer from device + no errors, call rds_ib_process_recv() + refill recv ring + rds_ib_process_recv() + validate header checksum + copy header to rds_ib_incoming struct if start of a new datagram + add to ibinc's fraglist + if competed datagram: + update cong map if datagram was cong update + call rds_recv_incoming() otherwise + note if ack is required + rds_recv_incoming() + drop duplicate packets + respond to pings + find the sock associated with this datagram + add to sock queue + wake up sock + do some congestion calculations + rds_recvmsg + copy data into user iovec + handle CMSGs + return to application + +Multipath RDS (mprds) +===================== + Mprds is multipathed-RDS, primarily intended for RDS-over-TCP + (though the concept can be extended to other transports). The classical + implementation of RDS-over-TCP is implemented by demultiplexing multiple + PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, + port]) over a single TCP socket between the 2 IP addresses involved. This + has the limitation that it ends up funneling multiple RDS flows over a + single TCP flow, thus it is + (a) upper-bounded to the single-flow bandwidth, + (b) suffers from head-of-line blocking for all the RDS sockets. + + Better throughput (for a fixed small packet size, MTU) can be achieved + by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed + RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp + connection. RDS sockets will be attached to a path based on some hash + (e.g., of local address and RDS port number) and packets for that RDS + socket will be sent over the attached path using TCP to segment/reassemble + RDS datagrams on that path. + + Multipathed RDS is implemented by splitting the struct rds_connection into + a common (to all paths) part, and a per-path struct rds_conn_path. All + I/O workqs and reconnect threads are driven from the rds_conn_path. + Transports such as TCP that are multipath capable may then set up a + TPC socket per rds_conn_path, and this is managed by the transport via + the transport privatee cp_transport_data pointer. + + Transports announce themselves as multipath capable by setting the + t_mp_capable bit during registration with the rds core module. When the + transport is multipath-capable, rds_sendmsg() hashes outgoing traffic + across multiple paths. The outgoing hash is computed based on the + local address and port that the PF_RDS socket is bound to. + + Additionally, even if the transport is MP capable, we may be + peering with some node that does not support mprds, or supports + a different number of paths. As a result, the peering nodes need + to agree on the number of paths to be used for the connection. + This is done by sending out a control packet exchange before the + first data packet. The control packet exchange must have completed + prior to outgoing hash completion in rds_sendmsg() when the transport + is mutlipath capable. + + The control packet is an RDS ping packet (i.e., packet to rds dest + port 0) with the ping packet having a rds extension header option of + type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the + number of paths supported by the sender. The "probe" ping packet will + get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) + The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately + be able to compute the min(sender_paths, rcvr_paths). The pong + sent in response to a probe-ping should contain the rcvr's npaths + when the rcvr is mprds-capable. + + If the rcvr is not mprds-capable, the exthdr in the ping will be + ignored. In this case the pong will not have any exthdrs, so the sender + of the probe-ping can default to single-path mprds. + diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt new file mode 100644 index 000000000..381e5b23d --- /dev/null +++ b/Documentation/networking/regulatory.txt @@ -0,0 +1,204 @@ +Linux wireless regulatory documentation +--------------------------------------- + +This document gives a brief review over how the Linux wireless +regulatory infrastructure works. + +More up to date information can be obtained at the project's web page: + +http://wireless.kernel.org/en/developers/Regulatory + +Keeping regulatory domains in userspace +--------------------------------------- + +Due to the dynamic nature of regulatory domains we keep them +in userspace and provide a framework for userspace to upload +to the kernel one regulatory domain to be used as the central +core regulatory domain all wireless devices should adhere to. + +How to get regulatory domains to the kernel +------------------------------------------- + +When the regulatory domain is first set up, the kernel will request a +database file (regulatory.db) containing all the regulatory rules. It +will then use that database when it needs to look up the rules for a +given country. + +How to get regulatory domains to the kernel (old CRDA solution) +--------------------------------------------------------------- + +Userspace gets a regulatory domain in the kernel by having +a userspace agent build it and send it via nl80211. Only +expected regulatory domains will be respected by the kernel. + +A currently available userspace agent which can accomplish this +is CRDA - central regulatory domain agent. Its documented here: + +http://wireless.kernel.org/en/developers/Regulatory/CRDA + +Essentially the kernel will send a udev event when it knows +it needs a new regulatory domain. A udev rule can be put in place +to trigger crda to send the respective regulatory domain for a +specific ISO/IEC 3166 alpha2. + +Below is an example udev rule which can be used: + +# Example file, should be put in /etc/udev/rules.d/regulatory.rules +KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" + +The alpha2 is passed as an environment variable under the variable COUNTRY. + +Who asks for regulatory domains? +-------------------------------- + +* Users + +Users can use iw: + +http://wireless.kernel.org/en/users/Documentation/iw + +An example: + + # set regulatory domain to "Costa Rica" + iw reg set CR + +This will request the kernel to set the regulatory domain to +the specificied alpha2. The kernel in turn will then ask userspace +to provide a regulatory domain for the alpha2 specified by the user +by sending a uevent. + +* Wireless subsystems for Country Information elements + +The kernel will send a uevent to inform userspace a new +regulatory domain is required. More on this to be added +as its integration is added. + +* Drivers + +If drivers determine they need a specific regulatory domain +set they can inform the wireless core using regulatory_hint(). +They have two options -- they either provide an alpha2 so that +crda can provide back a regulatory domain for that country or +they can build their own regulatory domain based on internal +custom knowledge so the wireless core can respect it. + +*Most* drivers will rely on the first mechanism of providing a +regulatory hint with an alpha2. For these drivers there is an additional +check that can be used to ensure compliance based on custom EEPROM +regulatory data. This additional check can be used by drivers by +registering on its struct wiphy a reg_notifier() callback. This notifier +is called when the core's regulatory domain has been changed. The driver +can use this to review the changes made and also review who made them +(driver, user, country IE) and determine what to allow based on its +internal EEPROM data. Devices drivers wishing to be capable of world +roaming should use this callback. More on world roaming will be +added to this document when its support is enabled. + +Device drivers who provide their own built regulatory domain +do not need a callback as the channels registered by them are +the only ones that will be allowed and therefore *additional* +channels cannot be enabled. + +Example code - drivers hinting an alpha2: +------------------------------------------ + +This example comes from the zd1211rw device driver. You can start +by having a mapping of your device's EEPROM country/regulatory +domain value to a specific alpha2 as follows: + +static struct zd_reg_alpha2_map reg_alpha2_map[] = { + { ZD_REGDOMAIN_FCC, "US" }, + { ZD_REGDOMAIN_IC, "CA" }, + { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ + { ZD_REGDOMAIN_JAPAN, "JP" }, + { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, + { ZD_REGDOMAIN_SPAIN, "ES" }, + { ZD_REGDOMAIN_FRANCE, "FR" }, + +Then you can define a routine to map your read EEPROM value to an alpha2, +as follows: + +static int zd_reg2alpha2(u8 regdomain, char *alpha2) +{ + unsigned int i; + struct zd_reg_alpha2_map *reg_map; + for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { + reg_map = ®_alpha2_map[i]; + if (regdomain == reg_map->reg) { + alpha2[0] = reg_map->alpha2[0]; + alpha2[1] = reg_map->alpha2[1]; + return 0; + } + } + return 1; +} + +Lastly, you can then hint to the core of your discovered alpha2, if a match +was found. You need to do this after you have registered your wiphy. You +are expected to do this during initialization. + + r = zd_reg2alpha2(mac->regdomain, alpha2); + if (!r) + regulatory_hint(hw->wiphy, alpha2); + +Example code - drivers providing a built in regulatory domain: +-------------------------------------------------------------- + +[NOTE: This API is not currently available, it can be added when required] + +If you have regulatory information you can obtain from your +driver and you *need* to use this we let you build a regulatory domain +structure and pass it to the wireless core. To do this you should +kmalloc() a structure big enough to hold your regulatory domain +structure and you should then fill it with your data. Finally you simply +call regulatory_hint() with the regulatory domain structure in it. + +Bellow is a simple example, with a regulatory domain cached using the stack. +Your implementation may vary (read EEPROM cache instead, for example). + +Example cache of some regulatory domain + +struct ieee80211_regdomain mydriver_jp_regdom = { + .n_reg_rules = 3, + .alpha2 = "JP", + //.alpha2 = "99", /* If I have no alpha2 to map it to */ + .reg_rules = { + /* IEEE 802.11b/g, channels 1..14 */ + REG_RULE(2412-10, 2484+10, 40, 6, 20, 0), + /* IEEE 802.11a, channels 34..48 */ + REG_RULE(5170-10, 5240+10, 40, 6, 20, + NL80211_RRF_NO_IR), + /* IEEE 802.11a, channels 52..64 */ + REG_RULE(5260-10, 5320+10, 40, 6, 20, + NL80211_RRF_NO_IR| + NL80211_RRF_DFS), + } +}; + +Then in some part of your code after your wiphy has been registered: + + struct ieee80211_regdomain *rd; + int size_of_regd; + int num_rules = mydriver_jp_regdom.n_reg_rules; + unsigned int i; + + size_of_regd = sizeof(struct ieee80211_regdomain) + + (num_rules * sizeof(struct ieee80211_reg_rule)); + + rd = kzalloc(size_of_regd, GFP_KERNEL); + if (!rd) + return -ENOMEM; + + memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); + + for (i=0; i < num_rules; i++) + memcpy(&rd->reg_rules[i], + &mydriver_jp_regdom.reg_rules[i], + sizeof(struct ieee80211_reg_rule)); + regulatory_struct_hint(rd); + +Statically compiled regulatory database +--------------------------------------- + +When a database should be fixed into the kernel, it can be provided as a +firmware file at build time that is then linked into the kernel. diff --git a/Documentation/networking/rmnet.txt b/Documentation/networking/rmnet.txt new file mode 100644 index 000000000..6b341eaf2 --- /dev/null +++ b/Documentation/networking/rmnet.txt @@ -0,0 +1,82 @@ +1. Introduction + +rmnet driver is used for supporting the Multiplexing and aggregation +Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm +Technologies, Inc. modems. + +This driver can be used to register onto any physical network device in +IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator. + +Multiplexing allows for creation of logical netdevices (rmnet devices) to +handle multiple private data networks (PDN) like a default internet, tethering, +multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends +packets with MAP headers to rmnet. Based on the multiplexer id, rmnet +routes to the appropriate PDN after removing the MAP header. + +Aggregation is required to achieve high data rates. This involves hardware +sending aggregated bunch of MAP frames. rmnet driver will de-aggregate +these MAP frames and send them to appropriate PDN's. + +2. Packet format + +a. MAP packet (data / control) + +MAP header has the same endianness of the IP packet. + +Packet format - + +Bit 0 1 2-7 8 - 15 16 - 31 +Function Command / Data Reserved Pad Multiplexer ID Payload length +Bit 32 - x +Function Raw Bytes + +Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command +or data packet. Control packet is used for transport level flow control. Data +packets are standard IP packets. + +Reserved bits are usually zeroed out and to be ignored by receiver. + +Padding is number of bytes to be added for 4 byte alignment if required by +hardware. + +Multiplexer ID is to indicate the PDN on which data has to be sent. + +Payload length includes the padding length but does not include MAP header +length. + +b. MAP packet (command specific) + +Bit 0 1 2-7 8 - 15 16 - 31 +Function Command Reserved Pad Multiplexer ID Payload length +Bit 32 - 39 40 - 45 46 - 47 48 - 63 +Function Command name Reserved Command Type Reserved +Bit 64 - 95 +Function Transaction ID +Bit 96 - 127 +Function Command data + +Command 1 indicates disabling flow while 2 is enabling flow + +Command types - +0 for MAP command request +1 is to acknowledge the receipt of a command +2 is for unsupported commands +3 is for error during processing of commands + +c. Aggregation + +Aggregation is multiple MAP packets (can be data or command) delivered to +rmnet in a single linear skb. rmnet will process the individual +packets and either ACK the MAP command or deliver the IP packet to the +network stack as needed + +MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding.... +MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad... + +3. Userspace configuration + +rmnet userspace configuration is done through netlink library librmnetctl +and command line utility rmnetcli. Utility is hosted in codeaurora forum git. +The driver uses rtnl_link_ops for communication. + +https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt new file mode 100644 index 000000000..b5407163d --- /dev/null +++ b/Documentation/networking/rxrpc.txt @@ -0,0 +1,1149 @@ + ====================== + RxRPC NETWORK PROTOCOL + ====================== + +The RxRPC protocol driver provides a reliable two-phase transport on top of UDP +that can be used to perform RxRPC remote operations. This is done over sockets +of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and +receive data, aborts and errors. + +Contents of this document: + + (*) Overview. + + (*) RxRPC protocol summary. + + (*) AF_RXRPC driver model. + + (*) Control messages. + + (*) Socket options. + + (*) Security. + + (*) Example client usage. + + (*) Example server usage. + + (*) AF_RXRPC kernel interface. + + (*) Configurable parameters. + + +======== +OVERVIEW +======== + +RxRPC is a two-layer protocol. There is a session layer which provides +reliable virtual connections using UDP over IPv4 (or IPv6) as the transport +layer, but implements a real network protocol; and there's the presentation +layer which renders structured data to binary blobs and back again using XDR +(as does SunRPC): + + +-------------+ + | Application | + +-------------+ + | XDR | Presentation + +-------------+ + | RxRPC | Session + +-------------+ + | UDP | Transport + +-------------+ + + +AF_RXRPC provides: + + (1) Part of an RxRPC facility for both kernel and userspace applications by + making the session part of it a Linux network protocol (AF_RXRPC). + + (2) A two-phase protocol. The client transmits a blob (the request) and then + receives a blob (the reply), and the server receives the request and then + transmits the reply. + + (3) Retention of the reusable bits of the transport system set up for one call + to speed up subsequent calls. + + (4) A secure protocol, using the Linux kernel's key retention facility to + manage security on the client end. The server end must of necessity be + more active in security negotiations. + +AF_RXRPC does not provide XDR marshalling/presentation facilities. That is +left to the application. AF_RXRPC only deals in blobs. Even the operation ID +is just the first four bytes of the request blob, and as such is beyond the +kernel's interest. + + +Sockets of AF_RXRPC family are: + + (1) created as type SOCK_DGRAM; + + (2) provided with a protocol of the type of underlying transport they're going + to use - currently only PF_INET is supported. + + +The Andrew File System (AFS) is an example of an application that uses this and +that has both kernel (filesystem) and userspace (utility) components. + + +====================== +RXRPC PROTOCOL SUMMARY +====================== + +An overview of the RxRPC protocol: + + (*) RxRPC sits on top of another networking protocol (UDP is the only option + currently), and uses this to provide network transport. UDP ports, for + example, provide transport endpoints. + + (*) RxRPC supports multiple virtual "connections" from any given transport + endpoint, thus allowing the endpoints to be shared, even to the same + remote endpoint. + + (*) Each connection goes to a particular "service". A connection may not go + to multiple services. A service may be considered the RxRPC equivalent of + a port number. AF_RXRPC permits multiple services to share an endpoint. + + (*) Client-originating packets are marked, thus a transport endpoint can be + shared between client and server connections (connections have a + direction). + + (*) Up to a billion connections may be supported concurrently between one + local transport endpoint and one service on one remote endpoint. An RxRPC + connection is described by seven numbers: + + Local address } + Local port } Transport (UDP) address + Remote address } + Remote port } + Direction + Connection ID + Service ID + + (*) Each RxRPC operation is a "call". A connection may make up to four + billion calls, but only up to four calls may be in progress on a + connection at any one time. + + (*) Calls are two-phase and asymmetric: the client sends its request data, + which the service receives; then the service transmits the reply data + which the client receives. + + (*) The data blobs are of indefinite size, the end of a phase is marked with a + flag in the packet. The number of packets of data making up one blob may + not exceed 4 billion, however, as this would cause the sequence number to + wrap. + + (*) The first four bytes of the request data are the service operation ID. + + (*) Security is negotiated on a per-connection basis. The connection is + initiated by the first data packet on it arriving. If security is + requested, the server then issues a "challenge" and then the client + replies with a "response". If the response is successful, the security is + set for the lifetime of that connection, and all subsequent calls made + upon it use that same security. In the event that the server lets a + connection lapse before the client, the security will be renegotiated if + the client uses the connection again. + + (*) Calls use ACK packets to handle reliability. Data packets are also + explicitly sequenced per call. + + (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. + A hard-ACK indicates to the far side that all the data received to a point + has been received and processed; a soft-ACK indicates that the data has + been received but may yet be discarded and re-requested. The sender may + not discard any transmittable packets until they've been hard-ACK'd. + + (*) Reception of a reply data packet implicitly hard-ACK's all the data + packets that make up the request. + + (*) An call is complete when the request has been sent, the reply has been + received and the final hard-ACK on the last packet of the reply has + reached the server. + + (*) An call may be aborted by either end at any time up to its completion. + + +===================== +AF_RXRPC DRIVER MODEL +===================== + +About the AF_RXRPC driver: + + (*) The AF_RXRPC protocol transparently uses internal sockets of the transport + protocol to represent transport endpoints. + + (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC + connections are handled transparently. One client socket may be used to + make multiple simultaneous calls to the same service. One server socket + may handle calls from many clients. + + (*) Additional parallel client connections will be initiated to support extra + concurrent calls, up to a tunable limit. + + (*) Each connection is retained for a certain amount of time [tunable] after + the last call currently using it has completed in case a new call is made + that could reuse it. + + (*) Each internal UDP socket is retained [tunable] for a certain amount of + time [tunable] after the last connection using it discarded, in case a new + connection is made that could use it. + + (*) A client-side connection is only shared between calls if they have have + the same key struct describing their security (and assuming the calls + would otherwise share the connection). Non-secured calls would also be + able to share connections with each other. + + (*) A server-side connection is shared if the client says it is. + + (*) ACK'ing is handled by the protocol driver automatically, including ping + replying. + + (*) SO_KEEPALIVE automatically pings the other side to keep the connection + alive [TODO]. + + (*) If an ICMP error is received, all calls affected by that error will be + aborted with an appropriate network error passed through recvmsg(). + + +Interaction with the user of the RxRPC socket: + + (*) A socket is made into a server socket by binding an address with a + non-zero service ID. + + (*) In the client, sending a request is achieved with one or more sendmsgs, + followed by the reply being received with one or more recvmsgs. + + (*) The first sendmsg for a request to be sent from a client contains a tag to + be used in all other sendmsgs or recvmsgs associated with that call. The + tag is carried in the control data. + + (*) connect() is used to supply a default destination address for a client + socket. This may be overridden by supplying an alternate address to the + first sendmsg() of a call (struct msghdr::msg_name). + + (*) If connect() is called on an unbound client, a random local port will + bound before the operation takes place. + + (*) A server socket may also be used to make client calls. To do this, the + first sendmsg() of the call must specify the target address. The server's + transport endpoint is used to send the packets. + + (*) Once the application has received the last message associated with a call, + the tag is guaranteed not to be seen again, and so it can be used to pin + client resources. A new call can then be initiated with the same tag + without fear of interference. + + (*) In the server, a request is received with one or more recvmsgs, then the + the reply is transmitted with one or more sendmsgs, and then the final ACK + is received with a last recvmsg. + + (*) When sending data for a call, sendmsg is given MSG_MORE if there's more + data to come on that call. + + (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more + data to come for that call. + + (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg + to indicate the terminal message for that call. + + (*) A call may be aborted by adding an abort control message to the control + data. Issuing an abort terminates the kernel's use of that call's tag. + Any messages waiting in the receive queue for that call will be discarded. + + (*) Aborts, busy notifications and challenge packets are delivered by recvmsg, + and control data messages will be set to indicate the context. Receiving + an abort or a busy message terminates the kernel's use of that call's tag. + + (*) The control data part of the msghdr struct is used for a number of things: + + (*) The tag of the intended or affected call. + + (*) Sending or receiving errors, aborts and busy notifications. + + (*) Notifications of incoming calls. + + (*) Sending debug requests and receiving debug replies [TODO]. + + (*) When the kernel has received and set up an incoming call, it sends a + message to server application to let it know there's a new call awaiting + its acceptance [recvmsg reports a special control message]. The server + application then uses sendmsg to assign a tag to the new call. Once that + is done, the first part of the request data will be delivered by recvmsg. + + (*) The server application has to provide the server socket with a keyring of + secret keys corresponding to the security types it permits. When a secure + connection is being set up, the kernel looks up the appropriate secret key + in the keyring and then sends a challenge packet to the client and + receives a response packet. The kernel then checks the authorisation of + the packet and either aborts the connection or sets up the security. + + (*) The name of the key a client will use to secure its communications is + nominated by a socket option. + + +Notes on sendmsg: + + (*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is + making progress at accepting packets within a reasonable time such that we + manage to queue up all the data for transmission. This requires the + client to accept at least one packet per 2*RTT time period. + + If this isn't set, sendmsg() will return immediately, either returning + EINTR/ERESTARTSYS if nothing was consumed or returning the amount of data + consumed. + + +Notes on recvmsg: + + (*) If there's a sequence of data messages belonging to a particular call on + the receive queue, then recvmsg will keep working through them until: + + (a) it meets the end of that call's received data, + + (b) it meets a non-data message, + + (c) it meets a message belonging to a different call, or + + (d) it fills the user buffer. + + If recvmsg is called in blocking mode, it will keep sleeping, awaiting the + reception of further data, until one of the above four conditions is met. + + (2) MSG_PEEK operates similarly, but will return immediately if it has put any + data in the buffer rather than sleeping until it can fill the buffer. + + (3) If a data message is only partially consumed in filling a user buffer, + then the remainder of that message will be left on the front of the queue + for the next taker. MSG_TRUNC will never be flagged. + + (4) If there is more data to be had on a call (it hasn't copied the last byte + of the last data message in that phase yet), then MSG_MORE will be + flagged. + + +================ +CONTROL MESSAGES +================ + +AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex +calls, to invoke certain actions and to report certain conditions. These are: + + MESSAGE ID SRT DATA MEANING + ======================= === =========== =============================== + RXRPC_USER_CALL_ID sr- User ID App's call specifier + RXRPC_ABORT srt Abort code Abort code to issue/received + RXRPC_ACK -rt n/a Final ACK received + RXRPC_NET_ERROR -rt error num Network error on call + RXRPC_BUSY -rt n/a Call rejected (server busy) + RXRPC_LOCAL_ERROR -rt error num Local error encountered + RXRPC_NEW_CALL -r- n/a New call received + RXRPC_ACCEPT s-- n/a Accept new call + RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call + RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded + RXRPC_TX_LENGTH s-- data len Total length of Tx data + + (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) + + (*) RXRPC_USER_CALL_ID + + This is used to indicate the application's call ID. It's an unsigned long + that the app specifies in the client by attaching it to the first data + message or in the server by passing it in association with an RXRPC_ACCEPT + message. recvmsg() passes it in conjunction with all messages except + those of the RXRPC_NEW_CALL message. + + (*) RXRPC_ABORT + + This is can be used by an application to abort a call by passing it to + sendmsg, or it can be delivered by recvmsg to indicate a remote abort was + received. Either way, it must be associated with an RXRPC_USER_CALL_ID to + specify the call affected. If an abort is being sent, then error EBADSLT + will be returned if there is no call with that user ID. + + (*) RXRPC_ACK + + This is delivered to a server application to indicate that the final ACK + of a call was received from the client. It will be associated with an + RXRPC_USER_CALL_ID to indicate the call that's now complete. + + (*) RXRPC_NET_ERROR + + This is delivered to an application to indicate that an ICMP error message + was encountered in the process of trying to talk to the peer. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (*) RXRPC_BUSY + + This is delivered to a client application to indicate that a call was + rejected by the server due to the server being busy. It will be + associated with an RXRPC_USER_CALL_ID to indicate the rejected call. + + (*) RXRPC_LOCAL_ERROR + + This is delivered to an application to indicate that a local error was + encountered and that a call has been aborted because of it. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (*) RXRPC_NEW_CALL + + This is delivered to indicate to a server application that a new call has + arrived and is awaiting acceptance. No user ID is associated with this, + as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. + + (*) RXRPC_ACCEPT + + This is used by a server application to attempt to accept a call and + assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID + to indicate the user ID to be assigned. If there is no call to be + accepted (it may have timed out, been aborted, etc.), then sendmsg will + return error ENODATA. If the user ID is already in use by another call, + then error EBADSLT will be returned. + + (*) RXRPC_EXCLUSIVE_CALL + + This is used to indicate that a client call should be made on a one-off + connection. The connection is discarded once the call has terminated. + + (*) RXRPC_UPGRADE_SERVICE + + This is used to make a client call to probe if the specified service ID + may be upgraded by the server. The caller must check msg_name returned to + recvmsg() for the service ID actually in use. The operation probed must + be one that takes the same arguments in both services. + + Once this has been used to establish the upgrade capability (or lack + thereof) of the server, the service ID returned should be used for all + future communication to that server and RXRPC_UPGRADE_SERVICE should no + longer be set. + + (*) RXRPC_TX_LENGTH + + This is used to inform the kernel of the total amount of data that is + going to be transmitted by a call (whether in a client request or a + service response). If given, it allows the kernel to encrypt from the + userspace buffer directly to the packet buffers, rather than copying into + the buffer and then encrypting in place. This may only be given with the + first sendmsg() providing data for a call. EMSGSIZE will be generated if + the amount of data actually given is different. + + This takes a parameter of __s64 type that indicates how much will be + transmitted. This may not be less than zero. + +The symbol RXRPC__SUPPORTED is defined as one more than the highest control +message type supported. At run time this can be queried by means of the +RXRPC_SUPPORTED_CMSG socket option (see below). + + +============== +SOCKET OPTIONS +============== + +AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: + + (*) RXRPC_SECURITY_KEY + + This is used to specify the description of the key to be used. The key is + extracted from the calling process's keyrings with request_key() and + should be of "rxrpc" type. + + The optval pointer points to the description string, and optlen indicates + how long the string is, without the NUL terminator. + + (*) RXRPC_SECURITY_KEYRING + + Similar to above but specifies a keyring of server secret keys to use (key + type "keyring"). See the "Security" section. + + (*) RXRPC_EXCLUSIVE_CONNECTION + + This is used to request that new connections should be used for each call + made subsequently on this socket. optval should be NULL and optlen 0. + + (*) RXRPC_MIN_SECURITY_LEVEL + + This is used to specify the minimum security level required for calls on + this socket. optval must point to an int containing one of the following + values: + + (a) RXRPC_SECURITY_PLAIN + + Encrypted checksum only. + + (b) RXRPC_SECURITY_AUTH + + Encrypted checksum plus packet padded and first eight bytes of packet + encrypted - which includes the actual packet length. + + (c) RXRPC_SECURITY_ENCRYPTED + + Encrypted checksum plus entire packet padded and encrypted, including + actual packet length. + + (*) RXRPC_UPGRADEABLE_SERVICE + + This is used to indicate that a service socket with two bindings may + upgrade one bound service to the other if requested by the client. optval + must point to an array of two unsigned short ints. The first is the + service ID to upgrade from and the second the service ID to upgrade to. + + (*) RXRPC_SUPPORTED_CMSG + + This is a read-only option that writes an int into the buffer indicating + the highest control message type supported. + + +======== +SECURITY +======== + +Currently, only the kerberos 4 equivalent protocol has been implemented +(security index 2 - rxkad). This requires the rxkad module to be loaded and, +on the client, tickets of the appropriate type to be obtained from the AFS +kaserver or the kerberos server and installed as "rxrpc" type keys. This is +normally done using the klog program. An example simple klog program can be +found at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +The payload provided to add_key() on the client should be of the following +form: + + struct rxrpc_key_sec2_v1 { + uint16_t security_index; /* 2 */ + uint16_t ticket_length; /* length of ticket[] */ + uint32_t expiry; /* time at which expires */ + uint8_t kvno; /* key version number */ + uint8_t __pad[3]; + uint8_t session_key[8]; /* DES session key */ + uint8_t ticket[0]; /* the encrypted ticket */ + }; + +Where the ticket blob is just appended to the above structure. + + +For the server, keys of type "rxrpc_s" must be made available to the server. +They have a description of "<serviceID>:<securityIndex>" (eg: "52:2" for an +rxkad key for the AFS VL service). When such a key is created, it should be +given the server's secret key as the instantiation data (see the example +below). + + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + +A keyring is passed to the server socket by naming it in a sockopt. The server +socket then looks the server secret keys up in this keyring when secure +incoming connections are made. This can be seen in an example program that can +be found at: + + http://people.redhat.com/~dhowells/rxrpc/listen.c + + +==================== +EXAMPLE CLIENT USAGE +==================== + +A client would issue an operation by: + + (1) An RxRPC socket is set up by: + + client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the protocol family of the transport + socket used - usually IPv4 but it can also be IPv6 [TODO]. + + (2) A local address can optionally be bound: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = 0, /* we're a client */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(client, &srx, sizeof(srx)); + + This specifies the local UDP port to be used. If not given, a random + non-privileged port will be used. A UDP port may be shared between + several unrelated RxRPC sockets. Security is handled on a basis of + per-RxRPC virtual connection. + + (3) The security is set: + + const char *key = "AFS:cambridge.redhat.com"; + setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); + + This issues a request_key() to get the key representing the security + context. The minimum security level can be set: + + unsigned int sec = RXRPC_SECURITY_ENCRYPTED; + setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, + &sec, sizeof(sec)); + + (4) The server to be contacted can then be specified (alternatively this can + be done through sendmsg): + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7005), /* AFS volume manager */ + .transport.sin_address = ..., + }; + connect(client, &srx, sizeof(srx)); + + (5) The request data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + MSG_MORE should be set in msghdr::msg_flags on all but the last part of + the request. Multiple requests may be made simultaneously. + + An RXRPC_TX_LENGTH control message can also be specified on the first + sendmsg() call. + + If a call is intended to go to a destination other than the default + specified through connect(), then msghdr::msg_name should be set on the + first request message of that call. + + (6) The reply data will then be posted to the server socket for recvmsg() to + pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data + for a particular call to be read. MSG_EOR will be set on the terminal + read for a call. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + If an abort or error occurred, this will be returned in the control data + buffer instead, and MSG_EOR will be flagged to indicate the end of that + call. + +A client may ask for a service ID it knows and ask that this be upgraded to a +better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the +first sendmsg() of a call. The client should then check srx_service in the +msg_name filled in by recvmsg() when collecting the result. srx_service will +hold the same value as given to sendmsg() if the upgrade request was ignored by +the service - otherwise it will be altered to indicate the service ID the +server upgraded to. Note that the upgraded service ID is chosen by the server. +The caller has to wait until it sees the service ID in the reply before sending +any more calls (further calls to the same destination will be blocked until the +probe is concluded). + + +==================== +EXAMPLE SERVER USAGE +==================== + +A server would be set up to accept operations in the following manner: + + (1) An RxRPC socket is created by: + + server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the address type of the transport + socket used - usually IPv4. + + (2) Security is set up if desired by giving the socket a keyring with server + secret keys in it: + + keyring = add_key("keyring", "AFSkeys", NULL, 0, + KEY_SPEC_PROCESS_KEYRING); + + const char secret_key[8] = { + 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + + setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); + + The keyring can be manipulated after it has been given to the socket. This + permits the server to add more keys, replace keys, etc. whilst it is live. + + (3) A local address must then be bound: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(server, &srx, sizeof(srx)); + + More than one service ID may be bound to a socket, provided the transport + parameters are the same. The limit is currently two. To do this, bind() + should be called twice. + + (4) If service upgrading is required, first two service IDs must have been + bound and then the following option must be set: + + unsigned short service_ids[2] = { from_ID, to_ID }; + setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, + service_ids, sizeof(service_ids)); + + This will automatically upgrade connections on service from_ID to service + to_ID if they request it. This will be reflected in msg_name obtained + through recvmsg() when the request data is delivered to userspace. + + (5) The server is then set to listen out for incoming calls: + + listen(server, 100); + + (6) The kernel notifies the server of pending incoming connections by sending + it a message for each. This is received with recvmsg() on the server + socket. It has no data, and has a single dataless control message + attached: + + RXRPC_NEW_CALL + + The address that can be passed back by recvmsg() at this point should be + ignored since the call for which the message was posted may have gone by + the time it is accepted - in which case the first call still on the queue + will be accepted. + + (7) The server then accepts the new call by issuing a sendmsg() with two + pieces of control data and no actual data: + + RXRPC_ACCEPT - indicate connection acceptance + RXRPC_USER_CALL_ID - specify user ID for this call + + (8) The first request data packet will then be posted to the server socket for + recvmsg() to pick up. At that point, the RxRPC address for the call can + be read from the address fields in the msghdr struct. + + Subsequent request data will be posted to the server socket for recvmsg() + to collect as it arrives. All but the last piece of the request data will + be delivered with MSG_MORE flagged. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + (9) The reply data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + MSG_MORE should be set in msghdr::msg_flags on all but the last message + for a particular call. + +(10) The final ACK from the client will be posted for retrieval by recvmsg() + when it is received. It will take the form of a dataless message with two + control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + RXRPC_ACK - indicates final ACK (no data) + + MSG_EOR will be flagged to indicate that this is the final message for + this call. + +(11) Up to the point the final packet of reply data is sent, the call can be + aborted by calling sendmsg() with a dataless message with the following + control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + RXRPC_ABORT - indicates abort code (4 byte data) + + Any packets waiting in the socket's receive queue will be discarded if + this is issued. + +Note that all the communications for a particular service take place through +the one server socket, using control messages on sendmsg() and recvmsg() to +determine the call affected. + + +========================= +AF_RXRPC KERNEL INTERFACE +========================= + +The AF_RXRPC module also provides an interface for use by in-kernel utilities +such as the AFS filesystem. This permits such a utility to: + + (1) Use different keys directly on individual client calls on one socket + rather than having to open a whole slew of sockets, one for each key it + might want to use. + + (2) Avoid having RxRPC call request_key() at the point of issue of a call or + opening of a socket. Instead the utility is responsible for requesting a + key at the appropriate point. AFS, for instance, would do this during VFS + operations such as open() or unlink(). The key is then handed through + when the call is initiated. + + (3) Request the use of something other than GFP_KERNEL to allocate memory. + + (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be + intercepted before they get put into the socket Rx queue and the socket + buffers manipulated directly. + +To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, +bind an address as appropriate and listen if it's to be a server socket, but +then it passes this to the kernel interface functions. + +The kernel interface functions are as follows: + + (*) Begin a new client call. + + struct rxrpc_call * + rxrpc_kernel_begin_call(struct socket *sock, + struct sockaddr_rxrpc *srx, + struct key *key, + unsigned long user_call_ID, + s64 tx_total_len, + gfp_t gfp, + rxrpc_notify_rx_t notify_rx, + bool upgrade); + + This allocates the infrastructure to make a new RxRPC call and assigns + call and connection numbers. The call will be made on the UDP port that + the socket is bound to. The call will go to the destination address of a + connected client socket unless an alternative is supplied (srx is + non-NULL). + + If a key is supplied then this will be used to secure the call instead of + the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls + secured in this way will still share connections if at all possible. + + The user_call_ID is equivalent to that supplied to sendmsg() in the + control data buffer. It is entirely feasible to use this to point to a + kernel data structure. + + tx_total_len is the amount of data the caller is intending to transmit + with this call (or -1 if unknown at this point). Setting the data size + allows the kernel to encrypt directly to the packet buffers, thereby + saving a copy. The value may not be less than -1. + + notify_rx is a pointer to a function to be called when events such as + incoming data packets or remote aborts happen. + + upgrade should be set to true if a client operation should request that + the server upgrade the service to a better one. The resultant service ID + is returned by rxrpc_kernel_recv_data(). + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (*) End a client call. + + void rxrpc_kernel_end_call(struct socket *sock, + struct rxrpc_call *call); + + This is used to end a previously begun call. The user_call_ID is expunged + from AF_RXRPC's knowledge and will not be seen again in association with + the specified call. + + (*) Send data through a call. + + typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + int rxrpc_kernel_send_data(struct socket *sock, + struct rxrpc_call *call, + struct msghdr *msg, + size_t len, + rxrpc_notify_end_tx_t notify_end_rx); + + This is used to supply either the request part of a client call or the + reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the + data buffers to be used. msg_iov may not be NULL and must point + exclusively to in-kernel virtual addresses. msg.msg_flags may be given + MSG_MORE if there will be subsequent data sends for this call. + + The msg must not specify a destination address, control data or any flags + other than MSG_MORE. len is the total amount of data to transmit. + + notify_end_rx can be NULL or it can be used to specify a function to be + called when the call changes state to end the Tx phase. This function is + called with the call-state spinlock held to prevent any reply or final ACK + from being delivered first. + + (*) Receive data from a call. + + int rxrpc_kernel_recv_data(struct socket *sock, + struct rxrpc_call *call, + void *buf, + size_t size, + size_t *_offset, + bool want_more, + u32 *_abort, + u16 *_service) + + This is used to receive data from either the reply part of a client call + or the request part of a service call. buf and size specify how much + data is desired and where to store it. *_offset is added on to buf and + subtracted from size internally; the amount copied into the buffer is + added to *_offset before returning. + + want_more should be true if further data will be required after this is + satisfied and false if this is the last item of the receive phase. + + There are three normal returns: 0 if the buffer was filled and want_more + was true; 1 if the buffer was filled, the last DATA packet has been + emptied and want_more was false; and -EAGAIN if the function needs to be + called again. + + If the last DATA packet is processed but the buffer contains less than + the amount requested, EBADMSG is returned. If want_more wasn't set, but + more data was available, EMSGSIZE is returned. + + If a remote ABORT is detected, the abort code received will be stored in + *_abort and ECONNABORTED will be returned. + + The service ID that the call ended up with is returned into *_service. + This can be used to see if a call got a service upgrade. + + (*) Abort a call. + + void rxrpc_kernel_abort_call(struct socket *sock, + struct rxrpc_call *call, + u32 abort_code); + + This is used to abort a call if it's still in an abortable state. The + abort code specified will be placed in the ABORT message sent. + + (*) Intercept received RxRPC messages. + + typedef void (*rxrpc_interceptor_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + void + rxrpc_kernel_intercept_rx_messages(struct socket *sock, + rxrpc_interceptor_t interceptor); + + This installs an interceptor function on the specified AF_RXRPC socket. + All messages that would otherwise wind up in the socket's Rx queue are + then diverted to this function. Note that care must be taken to process + the messages in the right order to maintain DATA message sequentiality. + + The interceptor function itself is provided with the address of the socket + and handling the incoming message, the ID assigned by the kernel utility + to the call and the socket buffer containing the message. + + The skb->mark field indicates the type of message: + + MARK MEANING + =============================== ======================================= + RXRPC_SKB_MARK_DATA Data message + RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call + RXRPC_SKB_MARK_BUSY Client call rejected as server busy + RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer + RXRPC_SKB_MARK_NET_ERROR Network error detected + RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered + RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance + + The remote abort message can be probed with rxrpc_kernel_get_abort_code(). + The two error messages can be probed with rxrpc_kernel_get_error_number(). + A new call can be accepted with rxrpc_kernel_accept_call(). + + Data messages can have their contents extracted with the usual bunch of + socket buffer manipulation functions. A data message can be determined to + be the last one in a sequence with rxrpc_kernel_is_data_last(). When a + data message has been used up, rxrpc_kernel_data_consumed() should be + called on it. + + Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It + is possible to get extra refs on all types of message for later freeing, + but this may pin the state of a call until the message is finally freed. + + (*) Accept an incoming call. + + struct rxrpc_call * + rxrpc_kernel_accept_call(struct socket *sock, + unsigned long user_call_ID); + + This is used to accept an incoming call and to assign it a call ID. This + function is similar to rxrpc_kernel_begin_call() and calls accepted must + be ended in the same way. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (*) Reject an incoming call. + + int rxrpc_kernel_reject_call(struct socket *sock); + + This is used to reject the first incoming call on the socket's queue with + a BUSY message. -ENODATA is returned if there were no incoming calls. + Other errors may be returned if the call had been aborted (-ECONNABORTED) + or had timed out (-ETIME). + + (*) Allocate a null key for doing anonymous security. + + struct key *rxrpc_get_null_key(const char *keyname); + + This is used to allocate a null RxRPC key that can be used to indicate + anonymous security for a particular domain. + + (*) Get the peer address of a call. + + void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call, + struct sockaddr_rxrpc *_srx); + + This is used to find the remote peer address of a call. + + (*) Set the total transmit data size on a call. + + void rxrpc_kernel_set_tx_length(struct socket *sock, + struct rxrpc_call *call, + s64 tx_total_len); + + This sets the amount of data that the caller is intending to transmit on a + call. It's intended to be used for setting the reply size as the request + size should be set when the call is begun. tx_total_len may not be less + than zero. + + (*) Check to see the completion state of a call so that the caller can assess + whether it needs to be retried. + + enum rxrpc_call_completion { + RXRPC_CALL_SUCCEEDED, + RXRPC_CALL_REMOTELY_ABORTED, + RXRPC_CALL_LOCALLY_ABORTED, + RXRPC_CALL_LOCAL_ERROR, + RXRPC_CALL_NETWORK_ERROR, + }; + + int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call, + enum rxrpc_call_completion *_compl, + u32 *_abort_code); + + On return, -EINPROGRESS will be returned if the call is still ongoing; if + it is finished, *_compl will be set to indicate the manner of completion, + *_abort_code will be set to any abort code that occurred. 0 will be + returned on a successful completion, -ECONNABORTED will be returned if the + client failed due to a remote abort and anything else will return an + appropriate error code. + + The caller should look at this information to decide if it's worth + retrying the call. + + (*) Retry a client call. + + int rxrpc_kernel_retry_call(struct socket *sock, + struct rxrpc_call *call, + struct sockaddr_rxrpc *srx, + struct key *key); + + This attempts to partially reinitialise a call and submit it again whilst + reusing the original call's Tx queue to avoid the need to repackage and + re-encrypt the data to be sent. call indicates the call to retry, srx the + new address to send it to and key the encryption key to use for signing or + encrypting the packets. + + For this to work, the first Tx data packet must still be in the transmit + queue, and currently this is only permitted for local and network errors + and the call must not have been aborted. Any partially constructed Tx + packet is left as is and can continue being filled afterwards. + + It returns 0 if the call was requeued and an error otherwise. + + (*) Get call RTT. + + u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call); + + Get the RTT time to the peer in use by a call. The value returned is in + nanoseconds. + + (*) Check call still alive. + + u32 rxrpc_kernel_check_life(struct socket *sock, + struct rxrpc_call *call); + + This returns a number that is updated when ACKs are received from the peer + (notably including PING RESPONSE ACKs which we can elicit by sending PING + ACKs to see if the call still exists on the server). The caller should + compare the numbers of two calls to see if the call is still alive after + waiting for a suitable interval. + + This allows the caller to work out if the server is still contactable and + if the call is still alive on the server whilst waiting for the server to + process a client operation. + + This function may transmit a PING ACK. + + +======================= +CONFIGURABLE PARAMETERS +======================= + +The RxRPC protocol driver has a number of configurable parameters that can be +adjusted through sysctls in /proc/net/rxrpc/: + + (*) req_ack_delay + + The amount of time in milliseconds after receiving a packet with the + request-ack flag set before we honour the flag and actually send the + requested ack. + + Usually the other side won't stop sending packets until the advertised + reception window is full (to a maximum of 255 packets), so delaying the + ACK permits several packets to be ACK'd in one go. + + (*) soft_ack_delay + + The amount of time in milliseconds after receiving a new packet before we + generate a soft-ACK to tell the sender that it doesn't need to resend. + + (*) idle_ack_delay + + The amount of time in milliseconds after all the packets currently in the + received queue have been consumed before we generate a hard-ACK to tell + the sender it can free its buffers, assuming no other reason occurs that + we would send an ACK. + + (*) resend_timeout + + The amount of time in milliseconds after transmitting a packet before we + transmit it again, assuming no ACK is received from the receiver telling + us they got it. + + (*) max_call_lifetime + + The maximum amount of time in seconds that a call may be in progress + before we preemptively kill it. + + (*) dead_call_expiry + + The amount of time in seconds before we remove a dead call from the call + list. Dead calls are kept around for a little while for the purpose of + repeating ACK and ABORT packets. + + (*) connection_expiry + + The amount of time in seconds after a connection was last used before we + remove it from the connection list. Whilst a connection is in existence, + it serves as a placeholder for negotiated security; when it is deleted, + the security must be renegotiated. + + (*) transport_expiry + + The amount of time in seconds after a transport was last used before we + remove it from the transport list. Whilst a transport is in existence, it + serves to anchor the peer data and keeps the connection ID counter. + + (*) rxrpc_rx_window_size + + The size of the receive window in packets. This is the maximum number of + unconsumed received packets we're willing to hold in memory for any + particular call. + + (*) rxrpc_rx_mtu + + The maximum packet MTU size that we're willing to receive in bytes. This + indicates to the peer whether we're willing to accept jumbo packets. + + (*) rxrpc_rx_jumbo_max + + The maximum number of packets that we're willing to accept in a jumbo + packet. Non-terminal packets in a jumbo packet must contain a four byte + header plus exactly 1412 bytes of data. The terminal packet must contain + a four byte header plus any amount of data. In any event, a jumbo packet + may not exceed rxrpc_rx_mtu in size. diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/s2io.txt new file mode 100644 index 000000000..0362a42f7 --- /dev/null +++ b/Documentation/networking/s2io.txt @@ -0,0 +1,141 @@ +Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver. + +Contents +======= +- 1. Introduction +- 2. Identifying the adapter/interface +- 3. Features supported +- 4. Command line parameters +- 5. Performance suggestions +- 6. Available Downloads + + +1. Introduction: +This Linux driver supports Neterion's Xframe I PCI-X 1.0 and +Xframe II PCI-X 2.0 adapters. It supports several features +such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on. +See below for complete list of features. +All features are supported for both IPv4 and IPv6. + +2. Identifying the adapter/interface: +a. Insert the adapter(s) in your system. +b. Build and load driver +# insmod s2io.ko +c. View log messages +# dmesg | tail -40 +You will see messages similar to: +eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA +eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA +eth4: Device is on 64 bit 133MHz PCIX(M1) bus + +The above messages identify the adapter type(Xframe I/II), adapter revision, +driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X). +In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed +as well. + +To associate an interface with a physical adapter use "ethtool -p <ethX>". +The corresponding adapter's LED will blink multiple times. + +3. Features supported: +a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes, +modifiable using ip command. + +b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit +and receive, TSO. + +c. Multi-buffer receive mode. Scattering of packet across multiple +buffers. Currently driver supports 2-buffer mode which yields +significant performance improvement on certain platforms(SGI Altix, +IBM xSeries). + +d. MSI/MSI-X. Can be enabled on platforms which support this feature +(IA64, Xeon) resulting in noticeable performance improvement(up to 7% +on certain platforms). + +e. Statistics. Comprehensive MAC-level and software statistics displayed +using "ethtool -S" option. + +f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings, +with multiple steering options. + +4. Command line parameters +a. tx_fifo_num +Number of transmit queues +Valid range: 1-8 +Default: 1 + +b. rx_ring_num +Number of receive rings +Valid range: 1-8 +Default: 1 + +c. tx_fifo_len +Size of each transmit queue +Valid range: Total length of all queues should not exceed 8192 +Default: 4096 + +d. rx_ring_sz +Size of each receive ring(in 4K blocks) +Valid range: Limited by memory on system +Default: 30 + +e. intr_type +Specifies interrupt type. Possible values 0(INTA), 2(MSI-X) +Valid values: 0, 2 +Default: 2 + +5. Performance suggestions +General: +a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration) +b. Set TCP windows size to optimal value. +For instance, for MTU=1500 a value of 210K has been observed to result in +good performance. +# sysctl -w net.ipv4.tcp_rmem="210000 210000 210000" +# sysctl -w net.ipv4.tcp_wmem="210000 210000 210000" +For MTU=9000, TCP window size of 10 MB is recommended. +# sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" +# sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + +Transmit performance: +a. By default, the driver respects BIOS settings for PCI bus parameters. +However, you may want to experiment with PCI bus parameters +max-split-transactions(MOST) and MMRBC (use setpci command). +A MOST value of 2 has been found optimal for Opterons and 3 for Itanium. +It could be different for your hardware. +Set MMRBC to 4K**. + +For example you can set +For opteron +#setpci -d 17d5:* 62=1d +For Itanium +#setpci -d 17d5:* 62=3d + +For detailed description of the PCI registers, please see Xframe User Guide. + +b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this +parameter. +c. Turn on TSO(using "ethtool -K") +# ethtool -K <ethX> tso on + +Receive performance: +a. By default, the driver respects BIOS settings for PCI bus parameters. +However, you may want to set PCI latency timer to 248. +#setpci -d 17d5:* LATENCY_TIMER=f8 +For detailed description of the PCI registers, please see Xframe User Guide. +b. Use 2-buffer mode. This results in large performance boost on +certain platforms(eg. SGI Altix, IBM xSeries). +c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to +set/verify this option. +d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network +device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to +bring down CPU utilization. + +** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are +recommended as safe parameters. +For more information, please review the AMD8131 errata at +http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ +26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf + +6. Support +For further support please contact either your 10GbE Xframe NIC vendor (IBM, +HP, SGI etc.) diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt new file mode 100644 index 000000000..b7056a8a0 --- /dev/null +++ b/Documentation/networking/scaling.txt @@ -0,0 +1,484 @@ +Scaling in the Linux Networking Stack + + +Introduction +============ + +This document describes a set of complementary techniques in the Linux +networking stack to increase parallelism and improve performance for +multi-processor systems. + +The following technologies are described: + + RSS: Receive Side Scaling + RPS: Receive Packet Steering + RFS: Receive Flow Steering + Accelerated Receive Flow Steering + XPS: Transmit Packet Steering + + +RSS: Receive Side Scaling +========================= + +Contemporary NICs support multiple receive and transmit descriptor queues +(multi-queue). On reception, a NIC can send different packets to different +queues to distribute processing among CPUs. The NIC distributes packets by +applying a filter to each packet that assigns it to one of a small number +of logical flows. Packets for each flow are steered to a separate receive +queue, which in turn can be processed by separate CPUs. This mechanism is +generally known as “Receive-side Scaling” (RSS). The goal of RSS and +the other scaling techniques is to increase performance uniformly. +Multi-queue distribution can also be used for traffic prioritization, but +that is not the focus of these techniques. + +The filter used in RSS is typically a hash function over the network +and/or transport layer headers-- for example, a 4-tuple hash over +IP addresses and TCP ports of a packet. The most common hardware +implementation of RSS uses a 128-entry indirection table where each entry +stores a queue number. The receive queue for a packet is determined +by masking out the low order seven bits of the computed hash for the +packet (usually a Toeplitz hash), taking this number as a key into the +indirection table and reading the corresponding value. + +Some advanced NICs allow steering packets to queues based on +programmable filters. For example, webserver bound TCP port 80 packets +can be directed to their own receive queue. Such “n-tuple” filters can +be configured from ethtool (--config-ntuple). + +==== RSS Configuration + +The driver for a multi-queue capable NIC typically provides a kernel +module parameter for specifying the number of hardware queues to +configure. In the bnx2x driver, for instance, this parameter is called +num_queues. A typical RSS configuration would be to have one receive queue +for each CPU if the device supports enough queues, or otherwise at least +one for each memory domain, where a memory domain is a set of CPUs that +share a particular memory level (L1, L2, NUMA node, etc.). + +The indirection table of an RSS device, which resolves a queue by masked +hash, is usually programmed by the driver at initialization. The +default mapping is to distribute the queues evenly in the table, but the +indirection table can be retrieved and modified at runtime using ethtool +commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the +indirection table could be done to give different queues different +relative weights. + +== RSS IRQ Configuration + +Each receive queue has a separate IRQ associated with it. The NIC triggers +this to notify a CPU when new packets arrive on the given queue. The +signaling path for PCIe devices uses message signaled interrupts (MSI-X), +that can route each interrupt to a particular CPU. The active mapping +of queues to IRQs can be determined from /proc/interrupts. By default, +an IRQ may be handled on any CPU. Because a non-negligible part of packet +processing takes place in receive interrupt handling, it is advantageous +to spread receive interrupts between CPUs. To manually adjust the IRQ +affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems +will be running irqbalance, a daemon that dynamically optimizes IRQ +assignments and as a result may override any manual settings. + +== Suggested Configuration + +RSS should be enabled when latency is a concern or whenever receive +interrupt processing forms a bottleneck. Spreading load between CPUs +decreases queue length. For low latency networking, the optimal setting +is to allocate as many queues as there are CPUs in the system (or the +NIC maximum, if lower). The most efficient high-rate configuration +is likely the one with the smallest number of receive queues where no +receive queue overflows due to a saturated CPU, because in default +mode with interrupt coalescing enabled, the aggregate number of +interrupts (and thus work) grows with each additional queue. + +Per-cpu load can be observed using the mpstat utility, but note that on +processors with hyperthreading (HT), each hyperthread is represented as +a separate CPU. For interrupt handling, HT has shown no benefit in +initial tests, so limit the number of queues to the number of CPU cores +in the system. + + +RPS: Receive Packet Steering +============================ + +Receive Packet Steering (RPS) is logically a software implementation of +RSS. Being in software, it is necessarily called later in the datapath. +Whereas RSS selects the queue and hence CPU that will run the hardware +interrupt handler, RPS selects the CPU to perform protocol processing +above the interrupt handler. This is accomplished by placing the packet +on the desired CPU’s backlog queue and waking up the CPU for processing. +RPS has some advantages over RSS: 1) it can be used with any NIC, +2) software filters can easily be added to hash over new protocols, +3) it does not increase hardware device interrupt rate (although it does +introduce inter-processor interrupts (IPIs)). + +RPS is called during bottom half of the receive interrupt handler, when +a driver sends a packet up the network stack with netif_rx() or +netif_receive_skb(). These call the get_rps_cpu() function, which +selects the queue that should process a packet. + +The first step in determining the target CPU for RPS is to calculate a +flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash +depending on the protocol). This serves as a consistent hash of the +associated flow of the packet. The hash is either provided by hardware +or will be computed in the stack. Capable hardware can pass the hash in +the receive descriptor for the packet; this would usually be the same +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in +skb->hash and can be used elsewhere in the stack as a hash of the +packet’s flow. + +Each receive hardware queue has an associated list of CPUs to which +RPS may enqueue packets for processing. For each received packet, +an index into the list is computed from the flow hash modulo the size +of the list. The indexed CPU is the target for processing the packet, +and the packet is queued to the tail of that CPU’s backlog queue. At +the end of the bottom half routine, IPIs are sent to any CPUs for which +packets have been queued to their backlog queue. The IPI wakes backlog +processing on the remote CPU, and any queued packets are then processed +up the networking stack. + +==== RPS Configuration + +RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on +by default for SMP). Even when compiled in, RPS remains disabled until +explicitly configured. The list of CPUs to which RPS may forward traffic +can be configured for each receive queue using a sysfs file entry: + + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus + +This file implements a bitmap of CPUs. RPS is disabled when it is zero +(the default), in which case packets are processed on the interrupting +CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to +the bitmap. + +== Suggested Configuration + +For a single queue device, a typical RPS configuration would be to set +the rps_cpus to the CPUs in the same memory domain of the interrupting +CPU. If NUMA locality is not an issue, this could also be all CPUs in +the system. At high interrupt rate, it might be wise to exclude the +interrupting CPU from the map since that already performs much work. + +For a multi-queue system, if RSS is configured so that a hardware +receive queue is mapped to each CPU, then RPS is probably redundant +and unnecessary. If there are fewer hardware queues than CPUs, then +RPS might be beneficial if the rps_cpus for each queue are the ones that +share the same memory domain as the interrupting CPU for that queue. + +==== RPS Flow Limit + +RPS scales kernel receive processing across CPUs without introducing +reordering. The trade-off to sending all packets from the same flow +to the same CPU is CPU load imbalance if flows vary in packet rate. +In the extreme case a single flow dominates traffic. Especially on +common server workloads with many concurrent connections, such +behavior indicates a problem such as a misconfiguration or spoofed +source Denial of Service attack. + +Flow Limit is an optional RPS feature that prioritizes small flows +during CPU contention by dropping packets from large flows slightly +ahead of those from small flows. It is active only when an RPS or RFS +destination CPU approaches saturation. Once a CPU's input packet +queue exceeds half the maximum queue length (as set by sysctl +net.core.netdev_max_backlog), the kernel starts a per-flow packet +count over the last 256 packets. If a flow exceeds a set ratio (by +default, half) of these packets when a new packet arrives, then the +new packet is dropped. Packets from other flows are still only +dropped once the input packet queue reaches netdev_max_backlog. +No packets are dropped when the input packet queue length is below +the threshold, so flow limit does not sever connections outright: +even large flows maintain connectivity. + +== Interface + +Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not +turned on. It is implemented for each CPU independently (to avoid lock +and cache contention) and toggled per CPU by setting the relevant bit +in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU +bitmap interface as rps_cpus (see above) when called from procfs: + + /proc/sys/net/core/flow_limit_cpu_bitmap + +Per-flow rate is calculated by hashing each packet into a hashtable +bucket and incrementing a per-bucket counter. The hash function is +the same that selects a CPU in RPS, but as the number of buckets can +be much larger than the number of CPUs, flow limit has finer-grained +identification of large flows and fewer false positives. The default +table has 4096 buckets. This value can be modified through sysctl + + net.core.flow_limit_table_len + +The value is only consulted when a new table is allocated. Modifying +it does not update active tables. + +== Suggested Configuration + +Flow limit is useful on systems with many concurrent connections, +where a single connection taking up 50% of a CPU indicates a problem. +In such environments, enable the feature on all CPUs that handle +network rx interrupts (as set in /proc/irq/N/smp_affinity). + +The feature depends on the input packet queue length to exceed +the flow limit threshold (50%) + the flow history length (256). +Setting net.core.netdev_max_backlog to either 1000 or 10000 +performed well in experiments. + + +RFS: Receive Flow Steering +========================== + +While RPS steers packets solely based on hash, and thus generally +provides good load distribution, it does not take into account +application locality. This is accomplished by Receive Flow Steering +(RFS). The goal of RFS is to increase datacache hitrate by steering +kernel processing of packets to the CPU where the application thread +consuming the packet is running. RFS relies on the same RPS mechanisms +to enqueue packets onto the backlog of another CPU and to wake up that +CPU. + +In RFS, packets are not forwarded directly by the value of their hash, +but the hash is used as index into a flow lookup table. This table maps +flows to the CPUs where those flows are being processed. The flow hash +(see RPS section above) is used to calculate the index into this table. +The CPU recorded in each entry is the one which last processed the flow. +If an entry does not hold a valid CPU, then packets mapped to that entry +are steered using plain RPS. Multiple table entries may point to the +same CPU. Indeed, with many flows and few CPUs, it is very likely that +a single application thread handles flows with many different flow hashes. + +rps_sock_flow_table is a global flow table that contains the *desired* CPU +for flows: the CPU that is currently processing the flow in userspace. +Each table value is a CPU index that is updated during calls to recvmsg +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() +and tcp_splice_read()). + +When the scheduler moves a thread to a new CPU while it has outstanding +receive packets on the old CPU, packets may arrive out of order. To +avoid this, RFS uses a second flow table to track outstanding packets +for each flow: rps_dev_flow_table is a table specific to each hardware +receive queue of each device. Each table value stores a CPU index and a +counter. The CPU index represents the *current* CPU onto which packets +for this flow are enqueued for further kernel processing. Ideally, kernel +and userspace processing occur on the same CPU, and hence the CPU index +in both tables is identical. This is likely false if the scheduler has +recently migrated a userspace thread while the kernel still has packets +enqueued for kernel processing on the old CPU. + +The counter in rps_dev_flow_table values records the length of the current +CPU's backlog when a packet in this flow was last enqueued. Each backlog +queue has a head counter that is incremented on dequeue. A tail counter +is computed as head counter + queue length. In other words, the counter +in rps_dev_flow[i] records the last element in flow i that has +been enqueued onto the currently designated CPU for flow i (of course, +entry i is actually selected by hash and multiple flows may hash to the +same entry i). + +And now the trick for avoiding out of order packets: when selecting the +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table +and the rps_dev_flow table of the queue that the packet was received on +are compared. If the desired CPU for the flow (found in the +rps_sock_flow table) matches the current CPU (found in the rps_dev_flow +table), the packet is enqueued onto that CPU’s backlog. If they differ, +the current CPU is updated to match the desired CPU if one of the +following is true: + +- The current CPU's queue head counter >= the recorded tail counter + value in rps_dev_flow[i] +- The current CPU is unset (>= nr_cpu_ids) +- The current CPU is offline + +After this check, the packet is sent to the (possibly updated) current +CPU. These rules aim to ensure that a flow only moves to a new CPU when +there are no packets outstanding on the old CPU, as the outstanding +packets could arrive later than those about to be processed on the new +CPU. + +==== RFS Configuration + +RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on +by default for SMP). The functionality remains disabled until explicitly +configured. The number of entries in the global flow table is set through: + + /proc/sys/net/core/rps_sock_flow_entries + +The number of entries in the per-queue flow table are set through: + + /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt + +== Suggested Configuration + +Both of these need to be set before RFS is enabled for a receive queue. +Values for both are rounded up to the nearest power of two. The +suggested flow count depends on the expected number of active connections +at any given time, which may be significantly less than the number of open +connections. We have found that a value of 32768 for rps_sock_flow_entries +works fairly well on a moderately loaded server. + +For a single queue device, the rps_flow_cnt value for the single queue +would normally be configured to the same value as rps_sock_flow_entries. +For a multi-queue device, the rps_flow_cnt for each queue might be +configured as rps_sock_flow_entries / N, where N is the number of +queues. So for instance, if rps_sock_flow_entries is set to 32768 and there +are 16 configured receive queues, rps_flow_cnt for each queue might be +configured as 2048. + + +Accelerated RFS +=============== + +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load +balancing mechanism that uses soft state to steer flows based on where +the application thread consuming the packets of each flow is running. +Accelerated RFS should perform better than RFS since packets are sent +directly to a CPU local to the thread consuming the data. The target CPU +will either be the same CPU where the application runs, or at least a CPU +which is local to the application thread’s CPU in the cache hierarchy. + +To enable accelerated RFS, the networking stack calls the +ndo_rx_flow_steer driver function to communicate the desired hardware +queue for packets matching a particular flow. The network stack +automatically calls this function every time a flow entry in +rps_dev_flow_table is updated. The driver in turn uses a device specific +method to program the NIC to steer the packets. + +The hardware queue for a flow is derived from the CPU recorded in +rps_dev_flow_table. The stack consults a CPU to hardware queue map which +is maintained by the NIC driver. This is an auto-generated reverse map of +the IRQ affinity table shown by /proc/interrupts. Drivers can use +functions in the cpu_rmap (“CPU affinity reverse map”) kernel library +to populate the map. For each CPU, the corresponding queue in the map is +set to be one whose processing CPU is closest in cache locality. + +==== Accelerated RFS Configuration + +Accelerated RFS is only available if the kernel is compiled with +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. +It also requires that ntuple filtering is enabled via ethtool. The map +of CPU to queues is automatically deduced from the IRQ affinities +configured for each receive queue by the driver, so no additional +configuration should be necessary. + +== Suggested Configuration + +This technique should be enabled whenever one wants to use RFS and the +NIC supports hardware acceleration. + +XPS: Transmit Packet Steering +============================= + +Transmit Packet Steering is a mechanism for intelligently selecting +which transmit queue to use when transmitting a packet on a multi-queue +device. This can be accomplished by recording two kinds of maps, either +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) +to hardware transmit queue(s). + +1. XPS using CPUs map + +The goal of this mapping is usually to assign queues +exclusively to a subset of CPUs, where the transmit completions for +these queues are processed on a CPU within this set. This choice +provides two benefits. First, contention on the device queue lock is +significantly reduced since fewer CPUs contend for the same queue +(contention can be eliminated completely if each CPU has its own +transmit queue). Secondly, cache miss rate on transmit completion is +reduced, in particular for data cache lines that hold the sk_buff +structures. + +2. XPS using receive queues map + +This mapping is used to pick transmit queue based on the receive +queue(s) map configuration set by the administrator. A set of receive +queues can be mapped to a set of transmit queues (many:many), although +the common use case is a 1:1 mapping. This will enable sending packets +on the same queue associations for transmit and receive. This is useful for +busy polling multi-threaded workloads where there are challenges in +associating a given CPU to a given application thread. The application +threads are not pinned to CPUs and each thread handles packets +received on a single queue. The receive queue number is cached in the +socket for the connection. In this model, sending the packets on the same +transmit queue corresponding to the associated receive queue has benefits +in keeping the CPU overhead low. Transmit completion work is locked into +the same queue-association that a given application is polling on. This +avoids the overhead of triggering an interrupt on another CPU. When the +application cleans up the packets during the busy poll, transmit completion +may be processed along with it in the same thread context and so result in +reduced latency. + +XPS is configured per transmit queue by setting a bitmap of +CPUs/receive-queues that may use that queue to transmit. The reverse +mapping, from CPUs to transmit queues or from receive-queues to transmit +queues, is computed and maintained for each network device. When +transmitting the first packet in a flow, the function get_xps_queue() is +called to select a queue. This function uses the ID of the receive queue +for the socket connection for a match in the receive queue-to-transmit queue +lookup table. Alternatively, this function can also use the ID of the +running CPU as a key into the CPU-to-queue lookup table. If the +ID matches a single queue, that is used for transmission. If multiple +queues match, one is selected by using the flow hash to compute an index +into the set. When selecting the transmit queue based on receive queue(s) +map, the transmit device is not validated against the receive device as it +requires expensive lookup operation in the datapath. + +The queue chosen for transmitting a particular flow is saved in the +corresponding socket structure for the flow (e.g. a TCP connection). +This transmit queue is used for subsequent packets sent on the flow to +prevent out of order (ooo) packets. The choice also amortizes the cost +of calling get_xps_queues() over all packets in the flow. To avoid +ooo packets, the queue for a flow can subsequently only be changed if +skb->ooo_okay is set for a packet in the flow. This flag indicates that +there are no outstanding packets in the flow, so the transmit queue can +change without the risk of generating out of order packets. The +transport layer is responsible for setting ooo_okay appropriately. TCP, +for instance, sets the flag when all data for a connection has been +acknowledged. + +==== XPS Configuration + +XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by +default for SMP). The functionality remains disabled until explicitly +configured. To enable XPS, the bitmap of CPUs/receive-queues that may +use a transmit queue is configured using the sysfs file entry: + +For selection based on CPUs map: +/sys/class/net/<dev>/queues/tx-<n>/xps_cpus + +For selection based on receive-queues map: +/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs + +== Suggested Configuration + +For a network device with a single transmission queue, XPS configuration +has no effect, since there is no choice in this case. In a multi-queue +system, XPS is preferably configured so that each CPU maps onto one queue. +If there are as many queues as there are CPUs in the system, then each +queue can also map onto one CPU, resulting in exclusive pairings that +experience no contention. If there are fewer queues than CPUs, then the +best CPUs to share a given queue are probably those that share the cache +with the CPU that processes transmit completions for that queue +(transmit interrupts). + +For transmit queue selection based on receive queue(s), XPS has to be +explicitly configured mapping receive-queue(s) to transmit queue(s). If the +user configuration for receive-queue map does not apply, then the transmit +queue is selected based on the CPUs map. + +Per TX Queue rate limitation: +============================= + +These are rate-limitation mechanisms implemented by HW, where currently +a max-rate attribute is supported, by setting a Mbps value to + +/sys/class/net/<dev>/queues/tx-<n>/tx_maxrate + +A value of zero means disabled, and this is the default. + +Further Information +=================== +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into +2.6.38. Original patches were submitted by Tom Herbert +(therbert@google.com) + +Accelerated RFS was introduced in 2.6.35. Original patches were +submitted by Ben Hutchings (bwh@kernel.org) + +Authors: +Tom Herbert (therbert@google.com) +Willem de Bruijn (willemb@google.com) diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt new file mode 100644 index 000000000..97b810ca9 --- /dev/null +++ b/Documentation/networking/sctp.txt @@ -0,0 +1,35 @@ +Linux Kernel SCTP + +This is the current BETA release of the Linux Kernel SCTP reference +implementation. + +SCTP (Stream Control Transmission Protocol) is a IP based, message oriented, +reliable transport protocol, with congestion control, support for +transparent multi-homing, and multiple ordered streams of messages. +RFC2960 defines the core protocol. The IETF SIGTRAN working group originally +developed the SCTP protocol and later handed the protocol over to the +Transport Area (TSVWG) working group for the continued evolvement of SCTP as a +general purpose transport. + +See the IETF website (http://www.ietf.org) for further documents on SCTP. +See http://www.ietf.org/rfc/rfc2960.txt + +The initial project goal is to create an Linux kernel reference implementation +of SCTP that is RFC 2960 compliant and provides an programming interface +referred to as the UDP-style API of the Sockets Extensions for SCTP, as +proposed in IETF Internet-Drafts. + +Caveats: + +-lksctp can be built as statically or as a module. However, be aware that +module removal of lksctp is not yet a safe activity. + +-There is tentative support for IPv6, but most work has gone towards +implementation and testing lksctp on IPv4. + + +For more information, please visit the lksctp project website: + http://www.sf.net/projects/lksctp + +Or contact the lksctp developers through the mailing list: + <linux-sctp@vger.kernel.org> diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt new file mode 100644 index 000000000..95ea06784 --- /dev/null +++ b/Documentation/networking/secid.txt @@ -0,0 +1,14 @@ +flowi structure: + +The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate +the label of the flow. This label of the flow is currently used in selecting +matching labeled xfrm(s). + +If this is an outbound flow, the label is derived from the socket, if any, or +the incoming packet this flow is being generated as a response to (e.g. tcp +resets, timewait ack, etc.). It is also conceivable that the label could be +derived from other sources such as process context, device, etc., in special +cases, as may be appropriate. + +If this is an inbound flow, the label is derived from the IPSec security +associations, if any, used by the packet. diff --git a/Documentation/networking/seg6-sysctl.txt b/Documentation/networking/seg6-sysctl.txt new file mode 100644 index 000000000..bdbde23b1 --- /dev/null +++ b/Documentation/networking/seg6-sysctl.txt @@ -0,0 +1,18 @@ +/proc/sys/net/conf/<iface>/seg6_* variables: + +seg6_enabled - BOOL + Accept or drop SR-enabled IPv6 packets on this interface. + + Relevant packets are those with SRH present and DA = local. + + 0 - disabled (default) + not 0 - enabled + +seg6_require_hmac - INTEGER + Define HMAC policy for ingress SR-enabled packets on this interface. + + -1 - Ignore HMAC field + 0 - Accept SR packets without HMAC, validate SR packets with HMAC + 1 - Drop SR packets without HMAC, validate SR packets with HMAC + + Default is 0. diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt new file mode 100644 index 000000000..aca542ec1 --- /dev/null +++ b/Documentation/networking/segmentation-offloads.txt @@ -0,0 +1,170 @@ +Segmentation Offloads in the Linux Networking Stack + +Introduction +============ + +This document describes a set of techniques in the Linux networking stack +to take advantage of segmentation offload capabilities of various NICs. + +The following technologies are described: + * TCP Segmentation Offload - TSO + * UDP Fragmentation Offload - UFO + * IPIP, SIT, GRE, and UDP Tunnel Offloads + * Generic Segmentation Offload - GSO + * Generic Receive Offload - GRO + * Partial Generic Segmentation Offload - GSO_PARTIAL + * SCTP accelleration with GSO - GSO_BY_FRAGS + +TCP Segmentation Offload +======================== + +TCP segmentation allows a device to segment a single frame into multiple +frames with a data payload size specified in skb_shinfo()->gso_size. +When TCP segmentation requested the bit for either SKB_GSO_TCPV4 or +SKB_GSO_TCPV6 should be set in skb_shinfo()->gso_type and +skb_shinfo()->gso_size should be set to a non-zero value. + +TCP segmentation is dependent on support for the use of partial checksum +offload. For this reason TSO is normally disabled if the Tx checksum +offload for a given device is disabled. + +In order to support TCP segmentation offload it is necessary to populate +the network and transport header offsets of the skbuff so that the device +drivers will be able determine the offsets of the IP or IPv6 header and the +TCP header. In addition as CHECKSUM_PARTIAL is required csum_start should +also point to the TCP header of the packet. + +For IPv4 segmentation we support one of two types in terms of the IP ID. +The default behavior is to increment the IP ID with every segment. If the +GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP +ID and all segments will use the same IP ID. If a device has +NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO +and we will either increment the IP ID for all frames, or leave it at a +static value based on driver preference. + +UDP Fragmentation Offload +========================= + +UDP fragmentation offload allows a device to fragment an oversized UDP +datagram into multiple IPv4 fragments. Many of the requirements for UDP +fragmentation offload are the same as TSO. However the IPv4 ID for +fragments should not increment as a single IPv4 datagram is fragmented. + +UFO is deprecated: modern kernels will no longer generate UFO skbs, but can +still receive them from tuntap and similar devices. Offload of UDP-based +tunnel protocols is still supported. + +IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads +======================================================== + +In addition to the offloads described above it is possible for a frame to +contain additional headers such as an outer tunnel. In order to account +for such instances an additional set of segmentation offload types were +introduced including SKB_GSO_IPXIP4, SKB_GSO_IPXIP6, SKB_GSO_GRE, and +SKB_GSO_UDP_TUNNEL. These extra segmentation types are used to identify +cases where there are more than just 1 set of headers. For example in the +case of IPIP and SIT we should have the network and transport headers moved +from the standard list of headers to "inner" header offsets. + +Currently only two levels of headers are supported. The convention is to +refer to the tunnel headers as the outer headers, while the encapsulated +data is normally referred to as the inner headers. Below is the list of +calls to access the given headers: + +IPIP/SIT Tunnel: + Outer Inner +MAC skb_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header + +UDP/GRE Tunnel: + Outer Inner +MAC skb_mac_header skb_inner_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header skb_inner_transport_header + +In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and +SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the +fact that the outer header also requests to have a non-zero checksum +included in the outer header. + +Finally there is SKB_GSO_TUNNEL_REMCSUM which indicates that a given tunnel +header has requested a remote checksum offload. In this case the inner +headers will be left with a partial checksum and only the outer header +checksum will be computed. + +Generic Segmentation Offload +============================ + +Generic segmentation offload is a pure software offload that is meant to +deal with cases where device drivers cannot perform the offloads described +above. What occurs in GSO is that a given skbuff will have its data broken +out over multiple skbuffs that have been resized to match the MSS provided +via skb_shinfo()->gso_size. + +Before enabling any hardware segmentation offload a corresponding software +offload is required in GSO. Otherwise it becomes possible for a frame to +be re-routed between devices and end up being unable to be transmitted. + +Generic Receive Offload +======================= + +Generic receive offload is the complement to GSO. Ideally any frame +assembled by GRO should be segmented to create an identical sequence of +frames using GSO, and any sequence of frames segmented by GSO should be +able to be reassembled back to the original by GRO. The only exception to +this is IPv4 ID in the case that the DF bit is set for a given IP header. +If the value of the IPv4 ID is not sequentially incrementing it will be +altered so that it is when a frame assembled via GRO is segmented via GSO. + +Partial Generic Segmentation Offload +==================================== + +Partial generic segmentation offload is a hybrid between TSO and GSO. What +it effectively does is take advantage of certain traits of TCP and tunnels +so that instead of having to rewrite the packet headers for each segment +only the inner-most transport header and possibly the outer-most network +header need to be updated. This allows devices that do not support tunnel +offloads or tunnel offloads with checksum to still make use of segmentation. + +With the partial offload what occurs is that all headers excluding the +inner transport header are updated such that they will contain the correct +values for if the header was simply duplicated. The one exception to this +is the outer IPv4 ID field. It is up to the device drivers to guarantee +that the IPv4 ID field is incremented in the case that a given header does +not have the DF bit set. + +SCTP accelleration with GSO +=========================== + +SCTP - despite the lack of hardware support - can still take advantage of +GSO to pass one large packet through the network stack, rather than +multiple small packets. + +This requires a different approach to other offloads, as SCTP packets +cannot be just segmented to (P)MTU. Rather, the chunks must be contained in +IP segments, padding respected. So unlike regular GSO, SCTP can't just +generate a big skb, set gso_size to the fragmentation point and deliver it +to IP layer. + +Instead, the SCTP protocol layer builds an skb with the segments correctly +padded and stored as chained skbs, and skb_segment() splits based on those. +To signal this, gso_size is set to the special value GSO_BY_FRAGS. + +Therefore, any code in the core networking stack must be aware of the +possibility that gso_size will be GSO_BY_FRAGS and handle that case +appropriately. + +There are some helpers to make this easier: + + - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if + an skb is an SCTP GSO skb. + + - For size checks, the skb_gso_validate_*_len family of helpers correctly + considers GSO_BY_FRAGS. + + - For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size + will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs. + +This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits +set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE. diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/skfp.txt new file mode 100644 index 000000000..203ec66c9 --- /dev/null +++ b/Documentation/networking/skfp.txt @@ -0,0 +1,220 @@ +(C)Copyright 1998-2000 SysKonnect, +=========================================================================== + +skfp.txt created 11-May-2000 + +Readme File for skfp.o v2.06 + + +This file contains +(1) OVERVIEW +(2) SUPPORTED ADAPTERS +(3) GENERAL INFORMATION +(4) INSTALLATION +(5) INCLUSION OF THE ADAPTER IN SYSTEM START +(6) TROUBLESHOOTING +(7) FUNCTION OF THE ADAPTER LEDS +(8) HISTORY + +=========================================================================== + + + +(1) OVERVIEW +============ + +This README explains how to use the driver 'skfp' for Linux with your +network adapter. + +Chapter 2: Contains a list of all network adapters that are supported by + this driver. + +Chapter 3: Gives some general information. + +Chapter 4: Describes common problems and solutions. + +Chapter 5: Shows the changed functionality of the adapter LEDs. + +Chapter 6: History of development. + +*** + + +(2) SUPPORTED ADAPTERS +====================== + +The network driver 'skfp' supports the following network adapters: +SysKonnect adapters: + - SK-5521 (SK-NET FDDI-UP) + - SK-5522 (SK-NET FDDI-UP DAS) + - SK-5541 (SK-NET FDDI-FP) + - SK-5543 (SK-NET FDDI-LP) + - SK-5544 (SK-NET FDDI-LP DAS) + - SK-5821 (SK-NET FDDI-UP64) + - SK-5822 (SK-NET FDDI-UP64 DAS) + - SK-5841 (SK-NET FDDI-FP64) + - SK-5843 (SK-NET FDDI-LP64) + - SK-5844 (SK-NET FDDI-LP64 DAS) +Compaq adapters (not tested): + - Netelligent 100 FDDI DAS Fibre SC + - Netelligent 100 FDDI SAS Fibre SC + - Netelligent 100 FDDI DAS UTP + - Netelligent 100 FDDI SAS UTP + - Netelligent 100 FDDI SAS Fibre MIC +*** + + +(3) GENERAL INFORMATION +======================= + +From v2.01 on, the driver is integrated in the linux kernel sources. +Therefore, the installation is the same as for any other adapter +supported by the kernel. +Refer to the manual of your distribution about the installation +of network adapters. +Makes my life much easier :-) +*** + + +(4) TROUBLESHOOTING +=================== + +If you run into problems during installation, check those items: + +Problem: The FDDI adapter cannot be found by the driver. +Reason: Look in /proc/pci for the following entry: + 'FDDI network controller: SysKonnect SK-FDDI-PCI ...' + If this entry exists, then the FDDI adapter has been + found by the system and should be able to be used. + If this entry does not exist or if the file '/proc/pci' + is not there, then you may have a hardware problem or PCI + support may not be enabled in your kernel. + The adapter can be checked using the diagnostic program + which is available from the SysKonnect web site: + www.syskonnect.de + Some COMPAQ machines have a problem with PCI under + Linux. This is described in the 'PCI howto' document + (included in some distributions or available from the + www, e.g. at 'www.linux.org') and no workaround is available. + +Problem: You want to use your computer as a router between + multiple IP subnetworks (using multiple adapters), but + you cannot reach computers in other subnetworks. +Reason: Either the router's kernel is not configured for IP + forwarding or there is a problem with the routing table + and gateway configuration in at least one of the + computers. + +If your problem is not listed here, please contact our +technical support for help. +You can send email to: + linux@syskonnect.de +When contacting our technical support, +please ensure that the following information is available: +- System Manufacturer and Model +- Boards in your system +- Distribution +- Kernel version + +*** + + +(5) FUNCTION OF THE ADAPTER LEDS +================================ + + The functionality of the LED's on the FDDI network adapters was + changed in SMT version v2.82. With this new SMT version, the yellow + LED works as a ring operational indicator. An active yellow LED + indicates that the ring is down. The green LED on the adapter now + works as a link indicator where an active GREEN LED indicates that + the respective port has a physical connection. + + With versions of SMT prior to v2.82 a ring up was indicated if the + yellow LED was off while the green LED(s) showed the connection + status of the adapter. During a ring down the green LED was off and + the yellow LED was on. + + All implementations indicate that a driver is not loaded if + all LEDs are off. + +*** + + +(6) HISTORY +=========== + +v2.06 (20000511) (In-Kernel version) + New features: + - 64 bit support + - new pci dma interface + - in kernel 2.3.99 + +v2.05 (20000217) (In-Kernel version) + New features: + - Changes for 2.3.45 kernel + +v2.04 (20000207) (Standalone version) + New features: + - Added rx/tx byte counter + +v2.03 (20000111) (Standalone version) + Problems fixed: + - Fixed printk statements from v2.02 + +v2.02 (991215) (Standalone version) + Problems fixed: + - Removed unnecessary output + - Fixed path for "printver.sh" in makefile + +v2.01 (991122) (In-Kernel version) + New features: + - Integration in Linux kernel sources + - Support for memory mapped I/O. + +v2.00 (991112) + New features: + - Full source released under GPL + +v1.05 (991023) + Problems fixed: + - Compilation with kernel version 2.2.13 failed + +v1.04 (990427) + Changes: + - New SMT module included, changing LED functionality + Problems fixed: + - Synchronization on SMP machines was buggy + +v1.03 (990325) + Problems fixed: + - Interrupt routing on SMP machines could be incorrect + +v1.02 (990310) + New features: + - Support for kernel versions 2.2.x added + - Kernel patch instead of private duplicate of kernel functions + +v1.01 (980812) + Problems fixed: + Connection hangup with telnet + Slow telnet connection + +v1.00 beta 01 (980507) + New features: + None. + Problems fixed: + None. + Known limitations: + - tar archive instead of standard package format (rpm). + - FDDI statistic is empty. + - not tested with 2.1.xx kernels + - integration in kernel not tested + - not tested simultaneously with FDDI adapters from other vendors. + - only X86 processors supported. + - SBA (Synchronous Bandwidth Allocator) parameters can + not be configured. + - does not work on some COMPAQ machines. See the PCI howto + document for details about this problem. + - data corruption with kernel versions below 2.0.33. + +*** End of information file *** diff --git a/Documentation/networking/smc9.txt b/Documentation/networking/smc9.txt new file mode 100644 index 000000000..d1e15074e --- /dev/null +++ b/Documentation/networking/smc9.txt @@ -0,0 +1,42 @@ + +SMC 9xxxx Driver +Revision 0.12 +3/5/96 +Copyright 1996 Erik Stahlman +Released under terms of the GNU General Public License. + +This file contains the instructions and caveats for my SMC9xxx driver. You +should not be using the driver without reading this file. + +Things to note about installation: + + 1. The driver should work on all kernels from 1.2.13 until 1.3.71. + (A kernel patch is supplied for 1.3.71 ) + + 2. If you include this into the kernel, you might need to change some + options, such as for forcing IRQ. + + + 3. To compile as a module, run 'make' . + Make will give you the appropriate options for various kernel support. + + 4. Loading the driver as a module : + + use: insmod smc9194.o + optional parameters: + io=xxxx : your base address + irq=xx : your irq + ifport=x : 0 for whatever is default + 1 for twisted pair + 2 for AUI ( or BNC on some cards ) + +How to obtain the latest version? + +FTP: + ftp://fenris.campus.vt.edu/smc9/smc9-12.tar.gz + ftp://sfbox.vt.edu/filebox/F/fenris/smc9/smc9-12.tar.gz + + +Contacting me: + erik@mail.vt.edu + diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt new file mode 100644 index 000000000..b0b75f846 --- /dev/null +++ b/Documentation/networking/spider_net.txt @@ -0,0 +1,204 @@ + + The Spidernet Device Driver + =========================== + +Written by Linas Vepstas <linas@austin.ibm.com> + +Version of 7 June 2007 + +Abstract +======== +This document sketches the structure of portions of the spidernet +device driver in the Linux kernel tree. The spidernet is a gigabit +ethernet device built into the Toshiba southbridge commonly used +in the SONY Playstation 3 and the IBM QS20 Cell blade. + +The Structure of the RX Ring. +============================= +The receive (RX) ring is a circular linked list of RX descriptors, +together with three pointers into the ring that are used to manage its +contents. + +The elements of the ring are called "descriptors" or "descrs"; they +describe the received data. This includes a pointer to a buffer +containing the received data, the buffer size, and various status bits. + +There are three primary states that a descriptor can be in: "empty", +"full" and "not-in-use". An "empty" or "ready" descriptor is ready +to receive data from the hardware. A "full" descriptor has data in it, +and is waiting to be emptied and processed by the OS. A "not-in-use" +descriptor is neither empty or full; it is simply not ready. It may +not even have a data buffer in it, or is otherwise unusable. + +During normal operation, on device startup, the OS (specifically, the +spidernet device driver) allocates a set of RX descriptors and RX +buffers. These are all marked "empty", ready to receive data. This +ring is handed off to the hardware, which sequentially fills in the +buffers, and marks them "full". The OS follows up, taking the full +buffers, processing them, and re-marking them empty. + +This filling and emptying is managed by three pointers, the "head" +and "tail" pointers, managed by the OS, and a hardware current +descriptor pointer (GDACTDPA). The GDACTDPA points at the descr +currently being filled. When this descr is filled, the hardware +marks it full, and advances the GDACTDPA by one. Thus, when there is +flowing RX traffic, every descr behind it should be marked "full", +and everything in front of it should be "empty". If the hardware +discovers that the current descr is not empty, it will signal an +interrupt, and halt processing. + +The tail pointer tails or trails the hardware pointer. When the +hardware is ahead, the tail pointer will be pointing at a "full" +descr. The OS will process this descr, and then mark it "not-in-use", +and advance the tail pointer. Thus, when there is flowing RX traffic, +all of the descrs in front of the tail pointer should be "full", and +all of those behind it should be "not-in-use". When RX traffic is not +flowing, then the tail pointer can catch up to the hardware pointer. +The OS will then note that the current tail is "empty", and halt +processing. + +The head pointer (somewhat mis-named) follows after the tail pointer. +When traffic is flowing, then the head pointer will be pointing at +a "not-in-use" descr. The OS will perform various housekeeping duties +on this descr. This includes allocating a new data buffer and +dma-mapping it so as to make it visible to the hardware. The OS will +then mark the descr as "empty", ready to receive data. Thus, when there +is flowing RX traffic, everything in front of the head pointer should +be "not-in-use", and everything behind it should be "empty". If no +RX traffic is flowing, then the head pointer can catch up to the tail +pointer, at which point the OS will notice that the head descr is +"empty", and it will halt processing. + +Thus, in an idle system, the GDACTDPA, tail and head pointers will +all be pointing at the same descr, which should be "empty". All of the +other descrs in the ring should be "empty" as well. + +The show_rx_chain() routine will print out the locations of the +GDACTDPA, tail and head pointers. It will also summarize the contents +of the ring, starting at the tail pointer, and listing the status +of the descrs that follow. + +A typical example of the output, for a nearly idle system, might be + +net eth1: Total number of descrs=256 +net eth1: Chain tail located at descr=20 +net eth1: Chain head is at 20 +net eth1: HW curr desc (GDACTDPA) is at 21 +net eth1: Have 1 descrs with stat=x40800101 +net eth1: HW next desc (GDACNEXTDA) is at 22 +net eth1: Last 255 descrs with stat=xa0800000 + +In the above, the hardware has filled in one descr, number 20. Both +head and tail are pointing at 20, because it has not yet been emptied. +Meanwhile, hw is pointing at 21, which is free. + +The "Have nnn decrs" refers to the descr starting at the tail: in this +case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers +to all of the rest of the descrs, from the last status change. The "nnn" +is a count of how many descrs have exactly the same status. + +The status x4... corresponds to "full" and status xa... corresponds +to "empty". The actual value printed is RXCOMST_A. + +In the device driver source code, a different set of names are +used for these same concepts, so that + +"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa +"full" == SPIDER_NET_DESCR_FRAME_END == 0x4 +"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf + + +The RX RAM full bug/feature +=========================== + +As long as the OS can empty out the RX buffers at a rate faster than +the hardware can fill them, there is no problem. If, for some reason, +the OS fails to empty the RX ring fast enough, the hardware GDACTDPA +pointer will catch up to the head, notice the not-empty condition, +ad stop. However, RX packets may still continue arriving on the wire. +The spidernet chip can save some limited number of these in local RAM. +When this local ram fills up, the spider chip will issue an interrupt +indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit +will be set in GHIINT1STS). When the RX ram full condition occurs, +a certain bug/feature is triggered that has to be specially handled. +This section describes the special handling for this condition. + +When the OS finally has a chance to run, it will empty out the RX ring. +In particular, it will clear the descriptor on which the hardware had +stopped. However, once the hardware has decided that a certain +descriptor is invalid, it will not restart at that descriptor; instead +it will restart at the next descr. This potentially will lead to a +deadlock condition, as the tail pointer will be pointing at this descr, +which, from the OS point of view, is empty; the OS will be waiting for +this descr to be filled. However, the hardware has skipped this descr, +and is filling the next descrs. Since the OS doesn't see this, there +is a potential deadlock, with the OS waiting for one descr to fill, +while the hardware is waiting for a different set of descrs to become +empty. + +A call to show_rx_chain() at this point indicates the nature of the +problem. A typical print when the network is hung shows the following: + +net eth1: Spider RX RAM full, incoming packets might be discarded! +net eth1: Total number of descrs=256 +net eth1: Chain tail located at descr=255 +net eth1: Chain head is at 255 +net eth1: HW curr desc (GDACTDPA) is at 0 +net eth1: Have 1 descrs with stat=xa0800000 +net eth1: HW next desc (GDACNEXTDA) is at 1 +net eth1: Have 127 descrs with stat=x40800101 +net eth1: Have 1 descrs with stat=x40800001 +net eth1: Have 126 descrs with stat=x40800101 +net eth1: Last 1 descrs with stat=xa0800000 + +Both the tail and head pointers are pointing at descr 255, which is +marked xa... which is "empty". Thus, from the OS point of view, there +is nothing to be done. In particular, there is the implicit assumption +that everything in front of the "empty" descr must surely also be empty, +as explained in the last section. The OS is waiting for descr 255 to +become non-empty, which, in this case, will never happen. + +The HW pointer is at descr 0. This descr is marked 0x4.. or "full". +Since its already full, the hardware can do nothing more, and thus has +halted processing. Notice that descrs 0 through 254 are all marked +"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is +descr 254, since tail was at 255.) Thus, the system is deadlocked, +and there can be no forward progress; the OS thinks there's nothing +to do, and the hardware has nowhere to put incoming data. + +This bug/feature is worked around with the spider_net_resync_head_ptr() +routine. When the driver receives RX interrupts, but an examination +of the RX chain seems to show it is empty, then it is probable that +the hardware has skipped a descr or two (sometimes dozens under heavy +network conditions). The spider_net_resync_head_ptr() subroutine will +search the ring for the next full descr, and the driver will resume +operations there. Since this will leave "holes" in the ring, there +is also a spider_net_resync_tail_ptr() that will skip over such holes. + +As of this writing, the spider_net_resync() strategy seems to work very +well, even under heavy network loads. + + +The TX ring +=========== +The TX ring uses a low-watermark interrupt scheme to make sure that +the TX queue is appropriately serviced for large packet sizes. + +For packet sizes greater than about 1KBytes, the kernel can fill +the TX ring quicker than the device can drain it. Once the ring +is full, the netdev is stopped. When there is room in the ring, +the netdev needs to be reawakened, so that more TX packets are placed +in the ring. The hardware can empty the ring about four times per jiffy, +so its not appropriate to wait for the poll routine to refill, since +the poll routine runs only once per jiffy. The low-watermark mechanism +marks a descr about 1/4th of the way from the bottom of the queue, so +that an interrupt is generated when the descr is processed. This +interrupt wakes up the netdev, which can then refill the queue. +For large packets, this mechanism generates a relatively small number +of interrupts, about 1K/sec. For smaller packets, this will drop to zero +interrupts, as the hardware can empty the queue faster than the kernel +can fill it. + + + ======= END OF DOCUMENT ======== + diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt new file mode 100644 index 000000000..2bb07078f --- /dev/null +++ b/Documentation/networking/stmmac.txt @@ -0,0 +1,401 @@ + STMicroelectronics 10/100/1000 Synopsys Ethernet driver + +Copyright (C) 2007-2015 STMicroelectronics Ltd +Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> + +This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers +(Synopsys IP blocks). + +Currently this network device driver is for all STi embedded MAC/GMAC +(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 +FF1152AMT0221 D1215994A VIRTEX FPGA board. + +DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether +MAC 10/100 Universal version 4.0 have been used for developing this driver. + +This driver supports both the platform bus and PCI. + +Please, for more information also visit: www.stlinux.com + +1) Kernel Configuration +The kernel configuration option is STMMAC_ETH: + Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) ---> + STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH) + +CONFIG_STMMAC_PLATFORM: is to enable the platform driver. +CONFIG_STMMAC_PCI: is to enable the pci driver. + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + phyaddr: to manually provide the physical address to the PHY device; + buf_sz: DMA buffer size; + tc: control the HW FIFO threshold; + watchdog: transmit timeout (in milliseconds); + flow_ctrl: Flow control ability [on/off]; + pause: Flow Control Pause Time; + eee_timer: tx EEE timer; + chain_mode: select chain mode instead of ring. + +3) Command line options +Driver parameters can be also passed in command line by using: + stmmaceth=watchdog:100,chain_mode=1 + +4) Driver information and notes + +4.1) Transmit process +The xmit method is invoked when the kernel needs to transmit a packet; it sets +the descriptors in the ring and informs the DMA engine, that there is a packet +ready to be transmitted. +By default, the driver sets the NETIF_F_SG bit in the features field of the +net_device structure, enabling the scatter-gather feature. This is true on +chips and configurations where the checksum can be done in hardware. +Once the controller has finished transmitting the packet, timer will be +scheduled to release the transmit resources. + +4.2) Receive process +When one or more packets are received, an interrupt happens. The interrupts +are not queued, so the driver has to scan all the descriptors in the ring during +the receive process. +This is based on NAPI, so the interrupt handler signals only if there is work +to be done, and it exits. +Then the poll method will be scheduled at some future point. +The incoming packets are stored, by the DMA, in a list of pre-allocated socket +buffers in order to avoid the memcpy (zero-copy). + +4.3) Interrupt mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for the reception on chips older than the 3.50. +New chips have an HW RX-Watchdog used for this mitigation. +Mitigation parameters can be tuned by ethtool. + +4.4) WOL +Wake up on Lan feature through Magic and Unicast frames are supported for the +GMAC core. + +4.5) DMA descriptors +Driver handles both normal and alternate descriptors. The latter has been only +tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later. + +STMMAC supports DMA descriptor to operate both in dual buffer (RING) +and linked-list(CHAINED) mode. In RING each descriptor points to two +data buffer pointers whereas in CHAINED mode they point to only one data +buffer pointer. RING mode is the default. + +In CHAINED mode each descriptor will have pointer to next descriptor in +the list, hence creating the explicit chaining in the descriptor itself, +whereas such explicit chaining is not possible in RING mode. + +4.5.1) Extended descriptors +The extended descriptors give us information about the Ethernet payload +when it is carrying PTP packets or TCP/UDP/ICMP over IP. +These are not available on GMAC Synopsys chips older than the 3.50. +At probe time the driver will decide if these can be actually used. +This support also is mandatory for PTPv2 because the extra descriptors +are used for saving the hardware timestamps and Extended Status. + +4.6) Ethtool support +Ethtool is supported. + +For example, driver statistics (including RMON), internal errors can be taken +using: + # ethtool -S ethX +command + +4.7) Jumbo and Segmentation Offloading +Jumbo frames are supported and tested for the GMAC. +The GSO has been also added but it's performed in software. +LRO is not supported. + +4.8) Physical +The driver is compatible with Physical Abstraction Layer to be connected with +PHY and GPHY devices. + +4.9) Platform information +Several information can be passed through the platform and device-tree. + +struct plat_stmmacenet_data { + char *phy_bus_name; + int bus_id; + int phy_addr; + int interface; + struct stmmac_mdio_bus_data *mdio_bus_data; + struct stmmac_dma_cfg *dma_cfg; + int clk_csr; + int has_gmac; + int enh_desc; + int tx_coe; + int rx_coe; + int bugged_jumbo; + int pmt; + int force_sf_dma_mode; + int force_thresh_dma_mode; + int riwt_off; + int max_speed; + int maxmtu; + void (*fix_mac_speed)(void *priv, unsigned int speed); + void (*bus_setup)(void __iomem *ioaddr); + int (*init)(struct platform_device *pdev, void *priv); + void (*exit)(struct platform_device *pdev, void *priv); + void *bsp_priv; + int has_gmac4; + bool tso_en; +}; + +Where: + o phy_bus_name: phy bus name to attach to the stmmac. + o bus_id: bus identifier. + o phy_addr: the physical address can be passed from the platform. + If it is set to -1 the driver will automatically + detect it at run-time by probing all the 32 addresses. + o interface: PHY device's interface. + o mdio_bus_data: specific platform fields for the MDIO bus. + o dma_cfg: internal DMA parameters + o pbl: the Programmable Burst Length is maximum number of beats to + be transferred in one DMA transaction. + GMAC also enables the 4xPBL by default. (8xPBL for GMAC 3.50 and newer) + o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx. + o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. + o fixed_burst/mixed_burst/aal + o clk_csr: fixed CSR Clock range selection. + o has_gmac: uses the GMAC core. + o enh_desc: if sets the MAC will use the enhanced descriptor structure. + o tx_coe: core is able to perform the tx csum in HW. + o rx_coe: the supports three check sum offloading engine types: + type_1, type_2 (full csum) and no RX coe. + o bugged_jumbo: some HWs are not able to perform the csum in HW for + over-sized frames due to limited buffer sizes. + Setting this flag the csum will be done in SW on + JUMBO frames. + o pmt: core has the embedded power module (optional). + o force_sf_dma_mode: force DMA to use the Store and Forward mode + instead of the Threshold. + o force_thresh_dma_mode: force DMA to use the Threshold mode other than + the Store and Forward mode. + o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode. + o fix_mac_speed: this callback is used for modifying some syscfg registers + (on ST SoCs) according to the link speed negotiated by the + physical layer . + o bus_setup: perform HW setup of the bus. For example, on some ST platforms + this field is used to configure the AMBA bridge to generate more + efficient STBus traffic. + o init/exit: callbacks used for calling a custom initialization; + this is sometime necessary on some platforms (e.g. ST boxes) + where the HW needs to have set some PIO lines or system cfg + registers. init/exit callbacks should not use or modify + platform data. + o bsp_priv: another private pointer. + o has_gmac4: uses GMAC4 core. + o tso_en: Enables TSO (TCP Segmentation Offload) feature. + +For MDIO bus The we have: + + struct stmmac_mdio_bus_data { + int (*phy_reset)(void *priv); + unsigned int phy_mask; + int *irqs; + int probed_phy_irq; + }; + +Where: + o phy_reset: hook to reset the phy device attached to the bus. + o phy_mask: phy mask passed when register the MDIO bus within the driver. + o irqs: list of IRQs, one per PHY. + o probed_phy_irq: if irqs is NULL, use this for probed PHY. + +For DMA engine we have the following internal fields that should be +tuned according to the HW capabilities. + +struct stmmac_dma_cfg { + int pbl; + int txpbl; + int rxpbl; + bool pblx8; + int fixed_burst; + int mixed_burst; + bool aal; +}; + +Where: + o pbl: Programmable Burst Length (tx and rx) + o txpbl: Transmit Programmable Burst Length. Only for GMAC and newer. + If set, DMA tx will use this value rather than pbl. + o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer. + If set, DMA rx will use this value rather than pbl. + o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. + o fixed_burst: program the DMA to use the fixed burst mode + o mixed_burst: program the DMA to use the mixed burst mode + o aal: Address-Aligned Beats + +--- + +Below an example how the structures above are using on ST platforms. + + static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = { + .has_gmac = 0, + .enh_desc = 0, + .fix_mac_speed = stxYYY_ethernet_fix_mac_speed, + | + |-> to write an internal syscfg + | on this platform when the + | link speed changes from 10 to + | 100 and viceversa + .init = &stmmac_claim_resource, + | + |-> On ST SoC this calls own "PAD" + | manager framework to claim + | all the resources necessary + | (GPIO ...). The .custom_cfg field + | is used to pass a custom config. +}; + +Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact, +there are two MAC cores: one MAC is for MDIO Bus/PHY emulation +with fixed_link support. + +static struct stmmac_mdio_bus_data stmmac1_mdio_bus = { + .phy_reset = phy_reset; + | + |-> function to provide the phy_reset on this board + .phy_mask = 0, +}; + +static struct fixed_phy_status stmmac0_fixed_phy_status = { + .link = 1, + .speed = 100, + .duplex = 1, +}; + +During the board's device_init we can configure the first +MAC for fixed_link by calling: + fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status, -1); +and the second one, with a real PHY device attached to the bus, +by using the stmmac_mdio_bus_data structure (to provide the id, the +reset procedure etc). + +Note that, starting from new chips, where it is available the HW capability +register, many configurations are discovered at run-time for example to +understand if EEE, HW csum, PTP, enhanced descriptor etc are actually +available. As strategy adopted in this driver, the information from the HW +capability register can replace what has been passed from the platform. + +4.10) Device-tree support. + +Please see the following document: + Documentation/devicetree/bindings/net/stmmac.txt + +4.11) This is a summary of the content of some relevant files: + o stmmac_main.c: implements the main network device driver; + o stmmac_mdio.c: provides MDIO functions; + o stmmac_pci: this is the PCI driver; + o stmmac_platform.c: this the platform driver (OF supported); + o stmmac_ethtool.c: implements the ethtool support; + o stmmac.h: private driver structure; + o common.h: common definitions and VFTs; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP; + o stmmac_ptp.c: PTP 1588 clock; + o stmmac_pcs.h: Physical Coding Sublayer common implementation; + o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c + for STMicroelectronics SoCs. + +- GMAC 3.x + o descs.h: descriptor structure definitions; + o dwmac1000_core.c: dwmac GiGa core functions; + o dwmac1000_dma.c: dma functions for the GMAC chip; + o dwmac1000.h: specific header file for the dwmac GiGa; + o dwmac100_core: dwmac 100 core code; + o dwmac100_dma.c: dma functions for the dwmac 100 chip; + o dwmac1000.h: specific header file for the MAC; + o dwmac_lib.c: generic DMA functions; + o enh_desc.c: functions for handling enhanced descriptors; + o norm_desc.c: functions for handling normal descriptors; + o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; + +- GMAC4.x generation + o dwmac4_core.c: dwmac GMAC4.x core functions; + o dwmac4_desc.c: functions for handling GMAC4.x descriptors; + o dwmac4_descs.h: descriptor definitions; + o dwmac4_dma.c: dma functions for the GMAC4.x chip; + o dwmac4_dma.h: dma definitions for the GMAC4.x chip; + o dwmac4.h: core definitions for the GMAC4.x chip; + o dwmac4_lib.c: generic GMAC4.x functions; + +4.12) TSO support (GMAC4.x) + +TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family. +When a packet is sent through TCP protocol, the TCP stack ensures that +the SKB provided to the low level driver (stmmac in our case) matches with +the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for +MTU set to 1500)). It means that if an application using TCP want to send a +packet which will have a length (after adding headers) > 1514 the packet +will be split in several TCP packets: The data payload is split and headers +(TCP/IP ..) are added. It is done by software. + +When TSO is enabled, the TCP stack doesn't care about the maximum frame +length and provide SKB packet to stmmac as it is. The GMAC IP will have to +perform the segmentation by it self to match with maximum frame length. + +This feature can be enabled in device tree through "snps,tso" entry. + +5) Debug Information + +The driver exports many information i.e. internal statistics, +debug information, MAC and DMA registers etc. + +These can be read in several ways depending on the +type of the information actually needed. + +For example a user can be use the ethtool support +to get statistics: e.g. using: ethtool -S ethX +(that shows the Management counters (MMC) if supported) +or sees the MAC/DMA registers: e.g. using: ethtool -d ethX + +Compiling the Kernel with CONFIG_DEBUG_FS the driver will export the following +debugfs entries: + +/sys/kernel/debug/stmmaceth/descriptors_status + To show the DMA TX/RX descriptor rings + +Developer can also use the "debug" module parameter to get further debug +information (please see: NETIF Msg Level). + +6) Energy Efficient Ethernet + +Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along +with a family of Physical layer to operate in the Low power Idle(LPI) +mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps, +1000Mbps & 10Gbps. + +The LPI mode allows power saving by switching off parts of the +communication device functionality when there is no data to be +transmitted & received. The system on both the side of the link can +disable some functionalities & save power during the period of low-link +utilization. The MAC controls whether the system should enter or exit +the LPI mode & communicate this to PHY. + +As soon as the interface is opened, the driver verifies if the EEE can +be supported. This is done by looking at both the DMA HW capability +register and the PHY devices MCD registers. +To enter in Tx LPI mode the driver needs to have a software timer +that enable and disable the LPI mode when there is nothing to be +transmitted. + +7) Precision Time Protocol (PTP) +The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), +which enables precise synchronization of clocks in measurement and +control systems implemented with technologies such as network +communication. + +In addition to the basic timestamp features mentioned in IEEE 1588-2002 +Timestamps, new GMAC cores support the advanced timestamp features. +IEEE 1588-2008 that can be enabled when configure the Kernel. + +8) SGMII/RGMII support +New GMAC devices provide own way to manage RGMII/SGMII. +This information is available at run-time by looking at the +HW capability register. This means that the stmmac can manage +auto-negotiation and link status w/o using the PHYLIB stuff. +In fact, the HW provides a subset of extended registers to +restart the ANE, verify Full/Half duplex mode and Speed. +Thanks to these registers, it is possible to look at the +Auto-negotiated Link Parter Ability. diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt new file mode 100644 index 000000000..a7d354ddd --- /dev/null +++ b/Documentation/networking/strparser.txt @@ -0,0 +1,207 @@ +Stream Parser (strparser) + +Introduction +============ + +The stream parser (strparser) is a utility that parses messages of an +application layer protocol running over a data stream. The stream +parser works in conjunction with an upper layer in the kernel to provide +kernel support for application layer messages. For instance, Kernel +Connection Multiplexor (KCM) uses the Stream Parser to parse messages +using a BPF program. + +The strparser works in one of two modes: receive callback or general +mode. + +In receive callback mode, the strparser is called from the data_ready +callback of a TCP socket. Messages are parsed and delivered as they are +received on the socket. + +In general mode, a sequence of skbs are fed to strparser from an +outside source. Message are parsed and delivered as the sequence is +processed. This modes allows strparser to be applied to arbitrary +streams of data. + +Interface +========= + +The API includes a context structure, a set of callbacks, utility +functions, and a data_ready function for receive callback mode. The +callbacks include a parse_msg function that is called to perform +parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function +that is called when a full message has been completed. + +Functions +========= + +strp_init(struct strparser *strp, struct sock *sk, + const struct strp_callbacks *cb) + + Called to initialize a stream parser. strp is a struct of type + strparser that is allocated by the upper layer. sk is the TCP + socket associated with the stream parser for use with receive + callback mode; in general mode this is set to NULL. Callbacks + are called by the stream parser (the callbacks are listed below). + +void strp_pause(struct strparser *strp) + + Temporarily pause a stream parser. Message parsing is suspended + and no new messages are delivered to the upper layer. + +void strp_unpause(struct strparser *strp) + + Unpause a paused stream parser. + +void strp_stop(struct strparser *strp); + + strp_stop is called to completely stop stream parser operations. + This is called internally when the stream parser encounters an + error, and it is called from the upper layer to stop parsing + operations. + +void strp_done(struct strparser *strp); + + strp_done is called to release any resources held by the stream + parser instance. This must be called after the stream processor + has been stopped. + +int strp_process(struct strparser *strp, struct sk_buff *orig_skb, + unsigned int orig_offset, size_t orig_len, + size_t max_msg_size, long timeo) + + strp_process is called in general mode for a stream parser to + parse an sk_buff. The number of bytes processed or a negative + error number is returned. Note that strp_process does not + consume the sk_buff. max_msg_size is maximum size the stream + parser will parse. timeo is timeout for completing a message. + +void strp_data_ready(struct strparser *strp); + + The upper layer calls strp_tcp_data_ready when data is ready on + the lower socket for strparser to process. This should be called + from a data_ready callback that is set on the socket. Note that + maximum messages size is the limit of the receive socket + buffer and message timeout is the receive timeout for the socket. + +void strp_check_rcv(struct strparser *strp); + + strp_check_rcv is called to check for new messages on the socket. + This is normally called at initialization of a stream parser + instance or after strp_unpause. + +Callbacks +========= + +There are six callbacks: + +int (*parse_msg)(struct strparser *strp, struct sk_buff *skb); + + parse_msg is called to determine the length of the next message + in the stream. The upper layer must implement this function. It + should parse the sk_buff as containing the headers for the + next application layer message in the stream. + + The skb->cb in the input skb is a struct strp_msg. Only + the offset field is relevant in parse_msg and gives the offset + where the message starts in the skb. + + The return values of this function are: + + >0 : indicates length of successfully parsed message + 0 : indicates more data must be received to parse the message + -ESTRPIPE : current message should not be processed by the + kernel, return control of the socket to userspace which + can proceed to read the messages itself + other < 0 : Error in parsing, give control back to userspace + assuming that synchronization is lost and the stream + is unrecoverable (application expected to close TCP socket) + + In the case that an error is returned (return value is less than + zero) and the parser is in receive callback mode, then it will set + the error on TCP socket and wake it up. If parse_msg returned + -ESTRPIPE and the stream parser had previously read some bytes for + the current message, then the error set on the attached socket is + ENODATA since the stream is unrecoverable in that case. + +void (*lock)(struct strparser *strp) + + The lock callback is called to lock the strp structure when + the strparser is performing an asynchronous operation (such as + processing a timeout). In receive callback mode the default + function is to lock_sock for the associated socket. In general + mode the callback must be set appropriately. + +void (*unlock)(struct strparser *strp) + + The unlock callback is called to release the lock obtained + by the lock callback. In receive callback mode the default + function is release_sock for the associated socket. In general + mode the callback must be set appropriately. + +void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb); + + rcv_msg is called when a full message has been received and + is queued. The callee must consume the sk_buff; it can + call strp_pause to prevent any further messages from being + received in rcv_msg (see strp_pause above). This callback + must be set. + + The skb->cb in the input skb is a struct strp_msg. This + struct contains two fields: offset and full_len. Offset is + where the message starts in the skb, and full_len is the + the length of the message. skb->len - offset may be greater + then full_len since strparser does not trim the skb. + +int (*read_sock_done)(struct strparser *strp, int err); + + read_sock_done is called when the stream parser is done reading + the TCP socket in receive callback mode. The stream parser may + read multiple messages in a loop and this function allows cleanup + to occur when exiting the loop. If the callback is not set (NULL + in strp_init) a default function is used. + +void (*abort_parser)(struct strparser *strp, int err); + + This function is called when stream parser encounters an error + in parsing. The default function stops the stream parser and + sets the error in the socket if the parser is in receive callback + mode. The default function can be changed by setting the callback + to non-NULL in strp_init. + +Statistics +========== + +Various counters are kept for each stream parser instance. These are in +the strp_stats structure. strp_aggr_stats is a convenience structure for +accumulating statistics for multiple stream parser instances. +save_strp_stats and aggregate_strp_stats are helper functions to save +and aggregate statistics. + +Message assembly limits +======================= + +The stream parser provide mechanisms to limit the resources consumed by +message assembly. + +A timer is set when assembly starts for a new message. In receive +callback mode the message timeout is taken from rcvtime for the +associated TCP socket. In general mode, the timeout is passed as an +argument in strp_process. If the timer fires before assembly completes +the stream parser is aborted and the ETIMEDOUT error is set on the TCP +socket if in receive callback mode. + +In receive callback mode, message length is limited to the receive +buffer size of the associated TCP socket. If the length returned by +parse_msg is greater than the socket buffer size then the stream parser +is aborted with EMSGSIZE error set on the TCP socket. Note that this +makes the maximum size of receive skbuffs for a socket with a stream +parser to be 2*sk_rcvbuf of the TCP socket. + +In general mode the message length limit is passed in as an argument +to strp_process. + +Author +====== + +Tom Herbert (tom@quantonium.net) + diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt new file mode 100644 index 000000000..82236a17b --- /dev/null +++ b/Documentation/networking/switchdev.txt @@ -0,0 +1,394 @@ +Ethernet switch device driver model (switchdev) +=============================================== +Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> +Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com> + + +The Ethernet switch device driver model (switchdev) is an in-kernel driver +model for switch devices which offload the forwarding (data) plane from the +kernel. + +Figure 1 is a block diagram showing the components of the switchdev model for +an example setup using a data-center-class switch ASIC chip. Other setups +with SR-IOV or soft switches, such as OVS, are possible. + + + User-space tools + + user space | + +-------------------------------------------------------------------+ + kernel | Netlink + | + +--------------+-------------------------------+ + | Network stack | + | (Linux) | + | | + +----------------------------------------------+ + + sw1p2 sw1p4 sw1p6 + sw1p1 + sw1p3 + sw1p5 + eth1 + + | + | + | + + | | | | | | | + +--+----+----+----+----+----+---+ +-----+-----+ + | Switch driver | | mgmt | + | (this document) | | driver | + | | | | + +--------------+----------------+ +-----------+ + | + kernel | HW bus (eg PCI) + +-------------------------------------------------------------------+ + hardware | + +--------------+----------------+ + | Switch device (sw1) | + | +----+ +--------+ + | | v offloaded data path | mgmt port + | | | | + +--|----|----+----+----+----+---+ + | | | | | | + + + + + + + + p1 p2 p3 p4 p5 p6 + + front-panel ports + + + Fig 1. + + +Include Files +------------- + +#include <linux/netdevice.h> +#include <net/switchdev.h> + + +Configuration +------------- + +Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model +support is built for driver. + + +Switch Ports +------------ + +On switchdev driver initialization, the driver will allocate and register a +struct net_device (using register_netdev()) for each enumerated physical switch +port, called the port netdev. A port netdev is the software representation of +the physical port and provides a conduit for control traffic to/from the +controller (the kernel) and the network, as well as an anchor point for higher +level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using +standard netdev tools (iproute2, ethtool, etc), the port netdev can also +provide to the user access to the physical properties of the switch port such +as PHY link state and I/O statistics. + +There is (currently) no higher-level kernel object for the switch beyond the +port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. + +A switch management port is outside the scope of the switchdev driver model. +Typically, the management port is not participating in offloaded data plane and +is loaded with a different driver, such as a NIC driver, on the management port +device. + +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the switchdev op switchdev_port_attr_get +for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same +physical ID for each port of a switch. The ID must be unique between switches +on the same system. The ID does not need to be unique between switches on +different systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + +Port Netdev Naming +^^^^^^^^^^^^^^^^^^ + +Udev rules should be used for port netdev naming, using some unique attribute +of the port as a key, for example the port MAC address or the port PHYS name. +Hard-coding of kernel netdev names within the driver is discouraged; let the +kernel pick the default netdev name, and let udev set the final name based on a +port attribute. + +Using port PHYS name (ndo_get_phys_port_name) for the key is particularly +useful for dynamically-named ports where the device names its ports based on +external configuration. For example, if a physical 40G port is split logically +into 4 10G ports, resulting in 4 port netdevs, the device can give a unique +name for each port using port PHYS name. The udev rule would be: + +SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ + ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" + +Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y +is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 +would be sub-port 0 on port 1 on switch 1. + +Port Features +^^^^^^^^^^^^^ + +NETIF_F_NETNS_LOCAL + +If the switchdev driver (and device) only supports offloading of the default +network namespace (netns), the driver should set this feature flag to prevent +the port netdev from being moved out of the default netns. A netns-aware +driver/device would not set this flag and be responsible for partitioning +hardware to preserve netns containment. This means hardware cannot forward +traffic from a port in one namespace to another port in another namespace. + +Port Topology +^^^^^^^^^^^^^ + +The port netdevs representing the physical switch ports can be organized into +higher-level switching constructs. The default construct is a standalone +router port, used to offload L3 forwarding. Two or more ports can be bonded +together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge +L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 +tunnels can be built on ports. These constructs are built using standard Linux +tools such as the bridge driver, the bonding/team drivers, and netlink-based +tools such as iproute2. + +The switchdev driver can know a particular port's position in the topology by +monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a +bond will see it's upper master change. If that bond is moved into a bridge, +the bond's upper master will change. And so on. The driver will track such +movements to know what position a port is in in the overall topology by +registering for netdevice events and acting on NETDEV_CHANGEUPPER. + +L2 Forwarding Offload +--------------------- + +The idea is to offload the L2 data forwarding (switching) path from the kernel +to the switchdev device by mirroring bridge FDB entries down to the device. An +FDB entry is the {port, MAC, VLAN} tuple forwarding destination. + +To offloading L2 bridging, the switchdev driver/device should support: + + - Static FDB entries installed on a bridge port + - Notification of learned/forgotten src mac/vlans from device + - STP state changes on the port + - VLAN flooding of multicast/broadcast and unknown unicast packets + +Static FDB Entries +^^^^^^^^^^^^^^^^^^ + +The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump +to support static FDB entries installed to the device. Static bridge FDB +entries are installed, for example, using iproute2 bridge cmd: + + bridge fdb add ADDR dev DEV [vlan VID] [self] + +The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx +ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using +switchdev_port_obj_xxx ops. + +XXX: what should be done if offloading this rule to hardware fails (for +example, due to full capacity in hardware tables) ? + +Note: by default, the bridge does not filter on VLAN and only bridges untagged +traffic. To enable VLAN support, turn on VLAN filtering: + + echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering + +Notification of Learned/Forgotten Source MAC/VLANs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switch device will learn/forget source MAC address/VLAN on ingress packets +and notify the switch driver of the mac/vlan/port tuples. The switch driver, +in turn, will notify the bridge driver using the switchdev notifier call: + + err = call_switchdev_notifiers(val, dev, info); + +Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when +forgetting, and info points to a struct switchdev_notifier_fdb_info. On +SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the +bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge +command will label these entries "offload": + + $ bridge fdb + 52:54:00:12:35:01 dev sw1p1 master br0 permanent + 00:02:00:00:02:00 dev sw1p1 master br0 offload + 00:02:00:00:02:00 dev sw1p1 self + 52:54:00:12:35:02 dev sw1p2 master br0 permanent + 00:02:00:00:03:00 dev sw1p2 master br0 offload + 00:02:00:00:03:00 dev sw1p2 self + 33:33:00:00:00:01 dev eth0 self permanent + 01:00:5e:00:00:01 dev eth0 self permanent + 33:33:ff:00:00:00 dev eth0 self permanent + 01:80:c2:00:00:0e dev eth0 self permanent + 33:33:00:00:00:01 dev br0 self permanent + 01:00:5e:00:00:01 dev br0 self permanent + 33:33:ff:12:35:01 dev br0 self permanent + +Learning on the port should be disabled on the bridge using the bridge command: + + bridge link set dev DEV learning off + +Learning on the device port should be enabled, as well as learning_sync: + + bridge link set dev DEV learning on self + bridge link set dev DEV learning_sync on self + +Learning_sync attribute enables syncing of the learned/forgotten FDB entry to +the bridge's FDB. It's possible, but not optimal, to enable learning on the +device port and on the bridge port, and disable learning_sync. + +To support learning and learning_sync port attributes, the driver implements +switchdev op switchdev_port_attr_get/set for +SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes +to the hardware defaults. + +FDB Ageing +^^^^^^^^^^ + +The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is +the responsibility of the port driver/device to age out these entries. If the +port device supports ageing, when the FDB entry expires, it will notify the +driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the +device does not support ageing, the driver can simulate ageing using a +garbage collection timer to monitor FDB entries. Expired entries will be +notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for +example of driver running ageing timer. + +To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB +entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The +notification will reset the FDB entry's last-used time to now. The driver +should rate limit refresh notifications, for example, no more than once a +second. (The last-used time is visible using the bridge -s fdb option). + +STP State Change on Port +^^^^^^^^^^^^^^^^^^^^^^^^ + +Internally or with a third-party STP protocol implementation (e.g. mstpd), the +bridge driver maintains the STP state for ports, and will notify the switch +driver of STP state change on a port using the switchdev op +switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. + +State is one of BR_STATE_*. The switch driver can use STP state updates to +update ingress packet filter list for the port. For example, if port is +DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs +and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. + +Note that STP BDPUs are untagged and STP state applies to all VLANs on the port +so packet filters should be applied consistently across untagged and tagged +VLANs on the port. + +Flooding L2 domain +^^^^^^^^^^^^^^^^^^ + +For a given L2 VLAN domain, the switch device should flood multicast/broadcast +and unknown unicast packets to all ports in domain, if allowed by port's +current STP state. The switch driver, knowing which ports are within which +vlan L2 domain, can program the switch device for flooding. The packet may +be sent to the port netdev for processing by the bridge driver. The +bridge should not reflood the packet to the same ports the device flooded, +otherwise there will be duplicate packets on the wire. + +To avoid duplicate packets, the switch driver should mark a packet as already +forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark +the skb using the ingress bridge port's mark and prevent it from being forwarded +through any bridge port with the same mark. + +It is possible for the switch device to not handle flooding and push the +packets up to the bridge driver for flooding. This is not ideal as the number +of ports scale in the L2 domain as the device is much more efficient at +flooding packets that software. + +If supported by the device, flood control can be offloaded to it, preventing +certain netdevs from flooding unicast traffic for which there is no FDB entry. + +IGMP Snooping +^^^^^^^^^^^^^ + +In order to support IGMP snooping, the port netdevs should trap to the bridge +driver all IGMP join and leave messages. +The bridge multicast module will notify port netdevs on every multicast group +changed whether it is static configured or dynamically joined/leave. +The hardware implementation should be forwarding all registered multicast +traffic groups only to the configured ports. + +L3 Routing Offload +------------------ + +Offloading L3 routing requires that device be programmed with FIB entries from +the kernel, with the device doing the FIB lookup and forwarding. The device +does a longest prefix match (LPM) on FIB entries matching route prefix and +forwards the packet to the matching FIB entry's nexthop(s) egress ports. + +To program the device, the driver has to register a FIB notifier handler +using register_fib_notifier. The following events are available: +FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device, + or modifying an existing entry on the device. +FIB_EVENT_ENTRY_DEL: used for removing a FIB entry +FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes + +FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass: + + struct fib_entry_notifier_info { + struct fib_notifier_info info; /* must be first */ + u32 dst; + int dst_len; + struct fib_info *fi; + u8 tos; + u8 type; + u32 tb_id; + u32 nlflags; + }; + +to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi +structure holds details on the route and route's nexthops. *dev is one of the +port netdevs mentioned in the route's next hop list. + +Routes offloaded to the device are labeled with "offload" in the ip route +listing: + + $ ip route show + default via 192.168.0.2 dev eth0 + 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload + 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload + 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 12.0.0.2 proto zebra metric 30 offload + nexthop via 11.0.0.1 dev sw1p1 weight 1 + nexthop via 11.0.0.9 dev sw1p2 weight 1 + 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 + +The "offload" flag is set in case at least one device offloads the FIB entry. + +XXX: add/mod/del IPv6 FIB API + +Nexthop Resolution +^^^^^^^^^^^^^^^^^^ + +The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for +the switch device to forward the packet with the correct dst mac address, the +nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac +address discovery comes via the ARP (or ND) process and is available via the +arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver +should trigger the kernel's neighbor resolution process. See the rocker +driver's rocker_port_ipv4_resolve() for an example. + +The driver can monitor for updates to arp_tbl using the netevent notifier +NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops +for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy +to know when arp_tbl neighbor entries are purged from the port. + +Transaction item queue +^^^^^^^^^^^^^^^^^^^^^^ + +For switchdev ops attr_set and obj_add, there is a 2 phase transaction model +used. First phase is to "prepare" anything needed, including various checks, +memory allocation, etc. The goal is to handle the stuff that is not unlikely +to fail here. The second phase is to "commit" the actual changes. + +Switchdev provides an infrastructure for sharing items (for example memory +allocations) between the two phases. + +The object created by a driver in "prepare" phase and it is queued up by: +switchdev_trans_item_enqueue() +During the "commit" phase, the driver gets the object by: +switchdev_trans_item_dequeue() + +If a transaction is aborted during "prepare" phase, switchdev code will handle +cleanup of the queued-up objects. diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt new file mode 100644 index 000000000..f37814693 --- /dev/null +++ b/Documentation/networking/tc-actions-env-rules.txt @@ -0,0 +1,24 @@ + +The "environmental" rules for authors of any new tc actions are: + +1) If you stealeth or borroweth any packet thou shalt be branching +from the righteous path and thou shalt cloneth. + +For example if your action queues a packet to be processed later, +or intentionally branches by redirecting a packet, then you need to +clone the packet. + +2) If you munge any packet thou shalt call pskb_expand_head in the case +someone else is referencing the skb. After that you "own" the skb. + +3) Dropping packets you don't own is a no-no. You simply return +TC_ACT_SHOT to the caller and they will drop it. + +The "environmental" rules for callers of actions (qdiscs etc) are: + +*) Thou art responsible for freeing anything returned as being +TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is +returned, then all is great and you don't need to do anything. + +Post on netdev if something is unclear. + diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt new file mode 100644 index 000000000..151e22998 --- /dev/null +++ b/Documentation/networking/tcp-thin.txt @@ -0,0 +1,47 @@ +Thin-streams and TCP +==================== +A wide range of Internet-based services that use reliable transport +protocols display what we call thin-stream properties. This means +that the application sends data with such a low rate that the +retransmission mechanisms of the transport protocol are not fully +effective. In time-dependent scenarios (like online games, control +systems, stock trading etc.) where the user experience depends +on the data delivery latency, packet loss can be devastating for +the service quality. Extreme latencies are caused by TCP's +dependency on the arrival of new data from the application to trigger +retransmissions effectively through fast retransmit instead of +waiting for long timeouts. + +After analysing a large number of time-dependent interactive +applications, we have seen that they often produce thin streams +and also stay with this traffic pattern throughout its entire +lifespan. The combination of time-dependency and the fact that the +streams provoke high latencies when using TCP is unfortunate. + +In order to reduce application-layer latency when packets are lost, +a set of mechanisms has been made, which address these latency issues +for thin streams. In short, if the kernel detects a thin stream, +the retransmission mechanisms are modified in the following manner: + +1) If the stream is thin, fast retransmit on the first dupACK. +2) If the stream is thin, do not apply exponential backoff. + +These enhancements are applied only if the stream is detected as +thin. This is accomplished by defining a threshold for the number +of packets in flight. If there are less than 4 packets in flight, +fast retransmissions can not be triggered, and the stream is prone +to experience high retransmission latencies. + +Since these mechanisms are targeted at time-dependent applications, +they must be specifically activated by the application using the +TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the +tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both +modifications are turned off by default. + +References +========== +More information on the modifications, as well as a wide range of +experimental data can be found here: +"Improving latency for interactive, thin-stream applications over +reliable transport" +http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt new file mode 100644 index 000000000..9c7139d57 --- /dev/null +++ b/Documentation/networking/tcp.txt @@ -0,0 +1,101 @@ +TCP protocol +============ + +Last updated: 3 June 2017 + +Contents +======== + +- Congestion control +- How the new TCP output machine [nyi] works + +Congestion control +================== + +The following variables are used in the tcp_sock for congestion control: +snd_cwnd The size of the congestion window +snd_ssthresh Slow start threshold. We are in slow start if + snd_cwnd is less than this. +snd_cwnd_cnt A counter used to slow down the rate of increase + once we exceed slow start threshold. +snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to. +snd_cwnd_stamp Timestamp for when congestion window last validated. +snd_cwnd_used Used as a highwater mark for how much of the + congestion window is in use. It is used to adjust + snd_cwnd down when the link is limited by the + application rather than the network. + +As of 2.6.13, Linux supports pluggable congestion control algorithms. +A congestion control mechanism can be registered through functions in +tcp_cong.c. The functions used by the congestion control mechanism are +registered via passing a tcp_congestion_ops struct to +tcp_register_congestion_control. As a minimum, the congestion control +mechanism must provide a valid name and must implement either ssthresh, +cong_avoid and undo_cwnd hooks or the "omnipotent" cong_control hook. + +Private data for a congestion control mechanism is stored in tp->ca_priv. +tcp_ca(tp) returns a pointer to this space. This is preallocated space - it +is important to check the size of your private data will fit this space, or +alternatively, space could be allocated elsewhere and a pointer to it could +be stored here. + +There are three kinds of congestion control algorithms currently: The +simplest ones are derived from TCP reno (highspeed, scalable) and just +provide an alternative congestion window calculation. More complex +ones like BIC try to look at other events to provide better +heuristics. There are also round trip time based algorithms like +Vegas and Westwood+. + +Good TCP congestion control is a complex problem because the algorithm +needs to maintain fairness and performance. Please review current +research and RFC's before developing new modules. + +The default congestion control mechanism is chosen based on the +DEFAULT_TCP_CONG Kconfig parameter. If you really want a particular default +value then you can set it using sysctl net.ipv4.tcp_congestion_control. The +module will be autoloaded if needed and you will get the expected protocol. If +you ask for an unknown congestion method, then the sysctl attempt will fail. + +If you remove a TCP congestion control module, then you will get the next +available one. Since reno cannot be built as a module, and cannot be +removed, it will always be available. + +How the new TCP output machine [nyi] works. +=========================================== + +Data is kept on a single queue. The skb->users flag tells us if the frame is +one that has been queued already. To add a frame we throw it on the end. Ack +walks down the list from the start. + +We keep a set of control flags + + + sk->tcp_pend_event + + TCP_PEND_ACK Ack needed + TCP_ACK_NOW Needed now + TCP_WINDOW Window update check + TCP_WINZERO Zero probing + + + sk->transmit_queue The transmission frame begin + sk->transmit_new First new frame pointer + sk->transmit_end Where to add frames + + sk->tcp_last_tx_ack Last ack seen + sk->tcp_dup_ack Dup ack count for fast retransmit + + +Frames are queued for output by tcp_write. We do our best to send the frames +off immediately if possible, but otherwise queue and compute the body +checksum in the copy. + +When a write is done we try to clear any pending events and piggy back them. +If the window is full we queue full sized frames. On the first timeout in +zero window we split this. + +On a timer we walk the retransmit list to send any retransmits, update the +backoff timers etc. A change of route table stamp causes a change of header +and recompute. We add any new tcp level headers and refinish the checksum +before sending. + diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt new file mode 100644 index 000000000..5a013686b --- /dev/null +++ b/Documentation/networking/team.txt @@ -0,0 +1,2 @@ +Team devices are driven from userspace via libteam library which is here: + https://github.com/jpirko/libteam diff --git a/Documentation/networking/ti-cpsw.txt b/Documentation/networking/ti-cpsw.txt new file mode 100644 index 000000000..d4d4c0751 --- /dev/null +++ b/Documentation/networking/ti-cpsw.txt @@ -0,0 +1,541 @@ +* Texas Instruments CPSW ethernet driver + +Multiqueue & CBS & MQPRIO +===================================================================== +===================================================================== + +The cpsw has 3 CBS shapers for each external ports. This document +describes MQPRIO and CBS Qdisc offload configuration for cpsw driver +based on examples. It potentially can be used in audio video bridging +(AVB) and time sensitive networking (TSN). + +The following examples were tested on AM572x EVM and BBB boards. + +Test setup +========== + +Under consideration two examples with AM572x EVM running cpsw driver +in dual_emac mode. + +Several prerequisites: +- TX queues must be rated starting from txq0 that has highest priority +- Traffic classes are used starting from 0, that has highest priority +- CBS shapers should be used with rated queues +- The bandwidth for CBS shapers has to be set a little bit more then + potential incoming rate, thus, rate of all incoming tx queues has + to be a little less +- Real rates can differ, due to discreetness +- Map skb-priority to txq is not enough, also skb-priority to l2 prio + map has to be created with ip or vconfig tool +- Any l2/socket prio (0 - 7) for classes can be used, but for + simplicity default values are used: 3 and 2 +- only 2 classes tested: A and B, but checked and can work with more, + maximum allowed 4, but only for 3 rate can be set. + +Test setup for examples +======================= + +-------------------------------+ + |--+ | + | | Workstation0 | + |E | MAC 18:03:73:66:87:42 | ++-----------------------------+ +--|t | | +| | 1 | E | | |h |./tsn_listener -d \ | +| Target board: | 0 | t |--+ |0 | 18:03:73:66:87:42 -i eth0 \| +| AM572x EVM | 0 | h | | | -s 1500 | +| | 0 | 0 | |--+ | +| Only 2 classes: |Mb +---| +-------------------------------+ +| class A, class B | | +| | +---| +-------------------------------+ +| | 1 | E | |--+ | +| | 0 | t | | | Workstation1 | +| | 0 | h |--+ |E | MAC 20:cf:30:85:7d:fd | +| |Mb | 1 | +--|t | | ++-----------------------------+ |h |./tsn_listener -d \ | + |0 | 20:cf:30:85:7d:fd -i eth0 \| + | | -s 1500 | + |--+ | + +-------------------------------+ + +********************************************************************* +********************************************************************* +********************************************************************* +Example 1: One port tx AVB configuration scheme for target board +---------------------------------------------------------------------- +(prints and scheme for AM572x evm, applicable for single port boards) + +tc - traffic class +txq - transmit queue +p - priority +f - fifo (cpsw fifo) +S - shaper configured + ++------------------------------------------------------------------+ u +| +---------------+ +---------------+ +------+ +------+ | s +| | | | | | | | | | e +| | App 1 | | App 2 | | Apps | | Apps | | r +| | Class A | | Class B | | Rest | | Rest | | +| | Eth0 | | Eth0 | | Eth0 | | Eth1 | | s +| | VLAN100 | | VLAN100 | | | | | | | | p +| | 40 Mb/s | | 20 Mb/s | | | | | | | | a +| | SO_PRIORITY=3 | | SO_PRIORITY=2 | | | | | | | | c +| | | | | | | | | | | | | | e +| +---|-----------+ +---|-----------+ +---|--+ +---|--+ | ++-----|------------------|------------------|--------|-------------+ + +-+ +------------+ | | + | | +-----------------+ +--+ + | | | | ++---|-------|-------------|-----------------------|----------------+ +| +----+ +----+ +----+ +----+ +----+ | +| | p3 | | p2 | | p1 | | p0 | | p0 | | k +| \ / \ / \ / \ / \ / | e +| \ / \ / \ / \ / \ / | r +| \/ \/ \/ \/ \/ | n +| | | | | | e +| | | +-----+ | | l +| | | | | | +| +----+ +----+ +----+ +----+ | s +| |tc0 | |tc1 | |tc2 | |tc0 | | p +| \ / \ / \ / \ / | a +| \ / \ / \ / \ / | c +| \/ \/ \/ \/ | e +| | | +-----+ | | +| | | | | | | +| | | | | | | +| | | | | | | +| +----+ +----+ +----+ +----+ +----+ | +| |txq0| |txq1| |txq2| |txq3| |txq4| | +| \ / \ / \ / \ / \ / | +| \ / \ / \ / \ / \ / | +| \/ \/ \/ \/ \/ | +| +-|------|------|------|--+ +--|--------------+ | +| | | | | | | Eth0.100 | | Eth1 | | ++---|------|------|------|------------------------|----------------+ + | | | | | + p p p p | + 3 2 0-1, 4-7 <- L2 priority | + | | | | | + | | | | | ++---|------|------|------|------------------------|----------------+ +| | | | | |----------+ | +| +----+ +----+ +----+ +----+ +----+ | +| |dma7| |dma6| |dma5| |dma4| |dma3| | +| \ / \ / \ / \ / \ / | c +| \S / \S / \ / \ / \ / | p +| \/ \/ \/ \/ \/ | s +| | | | +----- | | w +| | | | | | | +| | | | | | | d +| +----+ +----+ +----+p p+----+ | r +| | | | | | |o o| | | i +| | f3 | | f2 | | f0 |r r| f0 | | v +| |tc0 | |tc1 | |tc2 |t t|tc0 | | e +| \CBS / \CBS / \CBS /1 2\CBS / | r +| \S / \S / \ / \ / | +| \/ \/ \/ \/ | ++------------------------------------------------------------------+ +========================================Eth==========================> + +1) +// Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1 +$ ethtool -L eth0 rx 1 tx 5 +rx unmodified, ignoring + +2) +// Check if num of queues is set correctly: +$ ethtool -l eth0 +Channel parameters for eth0: +Pre-set maximums: +RX: 8 +TX: 8 +Other: 0 +Combined: 0 +Current hardware settings: +RX: 1 +TX: 5 +Other: 0 +Combined: 0 + +3) +// TX queues must be rated starting from 0, so set bws for tx0 and tx1 +// Set rates 40 and 20 Mb/s appropriately. +// Pay attention, real speed can differ a bit due to discreetness. +// Leave last 2 tx queues not rated. +$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate +$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate + +4) +// Check maximum rate of tx (cpdma) queues: +$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate +40 +20 +0 +0 +0 + +5) +// Map skb->priority to traffic class: +// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2 +// Map traffic class to transmit queue: +// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3) +$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \ +map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1 + +5a) +// As two interface sharing same set of tx queues, assign all traffic +// coming to interface Eth1 to separate queue in order to not mix it +// with traffic from interface Eth0, so use separate txq to send +// packets to Eth1, so all prio -> tc0 and tc0 -> txq4 +// Here hw 0, so here still default configuration for eth1 in hw +$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \ +map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0 + +6) +// Check classes settings +$ tc -g class show dev eth0 ++---(100:ffe2) mqprio +| +---(100:3) mqprio +| +---(100:4) mqprio +| ++---(100:ffe1) mqprio +| +---(100:2) mqprio +| ++---(100:ffe0) mqprio + +---(100:1) mqprio + +$ tc -g class show dev eth1 ++---(100:ffe0) mqprio + +---(100:5) mqprio + +7) +// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc +// Set it +1 Mb for reserve (important!) +// here only idle slope is important, others arg are ignored +// Pay attention, real speed can differ a bit due to discreetness +$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \ +hicredit 62 sendslope -959000 idleslope 41000 offload 1 +net eth0: set FIFO3 bw = 50 + +8) +// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc: +// Set it +1 Mb for reserve (important!) +$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \ +hicredit 65 sendslope -979000 idleslope 21000 offload 1 +net eth0: set FIFO2 bw = 30 + +9) +// Create vlan 100 to map sk->priority to vlan qos +$ ip link add link eth0 name eth0.100 type vlan id 100 +8021q: 802.1Q VLAN Support v1.8 +8021q: adding VLAN 0 to HW filter on device eth0 +8021q: adding VLAN 0 to HW filter on device eth1 +net eth0: Adding vlanid 100 to vlan filter + +10) +// Map skb->priority to L2 prio, 1 to 1 +$ ip link set eth0.100 type vlan \ +egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +11) +// Check egress map for vlan 100 +$ cat /proc/net/vlan/eth0.100 +[...] +INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 +EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +12) +// Run your appropriate tools with socket option "SO_PRIORITY" +// to 3 for class A and/or to 2 for class B +// (I took at https://www.spinics.net/lists/netdev/msg460869.html) +./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500& +./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500& + +13) +// run your listener on workstation (should be in same vlan) +// (I took at https://www.spinics.net/lists/netdev/msg460869.html) +./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500 +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39000 kbps + +14) +// Restore default configuration if needed +$ ip link del eth0.100 +$ tc qdisc del dev eth1 root +$ tc qdisc del dev eth0 root +net eth0: Prev FIFO2 is shaped +net eth0: set FIFO3 bw = 0 +net eth0: set FIFO2 bw = 0 +$ ethtool -L eth0 rx 1 tx 1 + +********************************************************************* +********************************************************************* +********************************************************************* +Example 2: Two port tx AVB configuration scheme for target board +---------------------------------------------------------------------- +(prints and scheme for AM572x evm, for dual emac boards only) + ++------------------------------------------------------------------+ u +| +----------+ +----------+ +------+ +----------+ +----------+ | s +| | | | | | | | | | | | e +| | App 1 | | App 2 | | Apps | | App 3 | | App 4 | | r +| | Class A | | Class B | | Rest | | Class B | | Class A | | +| | Eth0 | | Eth0 | | | | | Eth1 | | Eth1 | | s +| | VLAN100 | | VLAN100 | | | | | VLAN100 | | VLAN100 | | p +| | 40 Mb/s | | 20 Mb/s | | | | | 10 Mb/s | | 30 Mb/s | | a +| | SO_PRI=3 | | SO_PRI=2 | | | | | SO_PRI=3 | | SO_PRI=2 | | c +| | | | | | | | | | | | | | | | | e +| +---|------+ +---|------+ +---|--+ +---|------+ +---|------+ | ++-----|-------------|-------------|---------|-------------|--------+ + +-+ +-------+ | +----------+ +----+ + | | +-------+------+ | | + | | | | | | ++---|-------|-------------|--------------|-------------|-------|---+ +| +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | +| | p3 | | p2 | | p1 | | p0 | | p0 | | p1 | | p2 | | p3 | | k +| \ / \ / \ / \ / \ / \ / \ / \ / | e +| \ / \ / \ / \ / \ / \ / \ / \ / | r +| \/ \/ \/ \/ \/ \/ \/ \/ | n +| | | | | | | | e +| | | +----+ +----+ | | | l +| | | | | | | | +| +----+ +----+ +----+ +----+ +----+ +----+ | s +| |tc0 | |tc1 | |tc2 | |tc2 | |tc1 | |tc0 | | p +| \ / \ / \ / \ / \ / \ / | a +| \ / \ / \ / \ / \ / \ / | c +| \/ \/ \/ \/ \/ \/ | e +| | | +-----+ +-----+ | | | +| | | | | | | | | | +| | | | | | | | | | +| | | | | E E | | | | | +| +----+ +----+ +----+ +----+ t t +----+ +----+ +----+ +----+ | +| |txq0| |txq1| |txq4| |txq5| h h |txq6| |txq7| |txq3| |txq2| | +| \ / \ / \ / \ / 0 1 \ / \ / \ / \ / | +| \ / \ / \ / \ / . . \ / \ / \ / \ / | +| \/ \/ \/ \/ 1 1 \/ \/ \/ \/ | +| +-|------|------|------|--+ 0 0 +-|------|------|------|--+ | +| | | | | | | 0 0 | | | | | | | ++---|------|------|------|---------------|------|------|------|----+ + | | | | | | | | + p p p p p p p p + 3 2 0-1, 4-7 <-L2 pri-> 0-1, 4-7 2 3 + | | | | | | | | + | | | | | | | | ++---|------|------|------|---------------|------|------|------|----+ +| | | | | | | | | | +| +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | +| |dma7| |dma6| |dma3| |dma2| |dma1| |dma0| |dma4| |dma5| | +| \ / \ / \ / \ / \ / \ / \ / \ / | c +| \S / \S / \ / \ / \ / \ / \S / \S / | p +| \/ \/ \/ \/ \/ \/ \/ \/ | s +| | | | +----- | | | | | w +| | | | | +----+ | | | | +| | | | | | | | | | d +| +----+ +----+ +----+p p+----+ +----+ +----+ | r +| | | | | | |o o| | | | | | | i +| | f3 | | f2 | | f0 |r CPSW r| f3 | | f2 | | f0 | | v +| |tc0 | |tc1 | |tc2 |t t|tc0 | |tc1 | |tc2 | | e +| \CBS / \CBS / \CBS /1 2\CBS / \CBS / \CBS / | r +| \S / \S / \ / \S / \S / \ / | +| \/ \/ \/ \/ \/ \/ | ++------------------------------------------------------------------+ +========================================Eth==========================> + +1) +// Add 8 tx queues, for interface Eth0, but they are common, so are accessed +// by two interfaces Eth0 and Eth1. +$ ethtool -L eth1 rx 1 tx 8 +rx unmodified, ignoring + +2) +// Check if num of queues is set correctly: +$ ethtool -l eth0 +Channel parameters for eth0: +Pre-set maximums: +RX: 8 +TX: 8 +Other: 0 +Combined: 0 +Current hardware settings: +RX: 1 +TX: 8 +Other: 0 +Combined: 0 + +3) +// TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0 +// and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately +// for Eth0 and 30 and 10 Mb/s for Eth1. +// Real speed can differ a bit due to discreetness +// Leave last 4 tx queues as not rated +$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate +$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate +$ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate +$ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate + +4) +// Check maximum rate of tx (cpdma) queues: +$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate +40 +20 +30 +10 +0 +0 +0 +0 + +5) +// Map skb->priority to traffic class for Eth0: +// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2 +// Map traffic class to transmit queue: +// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5) +$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \ +map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1 + +6) +// Check classes settings +$ tc -g class show dev eth0 ++---(100:ffe2) mqprio +| +---(100:5) mqprio +| +---(100:6) mqprio +| ++---(100:ffe1) mqprio +| +---(100:2) mqprio +| ++---(100:ffe0) mqprio + +---(100:1) mqprio + +7) +// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0 +// here only idle slope is important, others ignored +// Real speed can differ a bit due to discreetness +$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \ +hicredit 62 sendslope -959000 idleslope 41000 offload 1 +net eth0: set FIFO3 bw = 50 + +8) +// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0 +$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \ +hicredit 65 sendslope -979000 idleslope 21000 offload 1 +net eth0: set FIFO2 bw = 30 + +9) +// Create vlan 100 to map sk->priority to vlan qos for Eth0 +$ ip link add link eth0 name eth0.100 type vlan id 100 +net eth0: Adding vlanid 100 to vlan filter + +10) +// Map skb->priority to L2 prio for Eth0.100, one to one +$ ip link set eth0.100 type vlan \ +egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +11) +// Check egress map for vlan 100 +$ cat /proc/net/vlan/eth0.100 +[...] +INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 +EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +12) +// Map skb->priority to traffic class for Eth1: +// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2 +// Map traffic class to transmit queue: +// tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7) +$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \ +map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1 + +13) +// Check classes settings +$ tc -g class show dev eth1 ++---(100:ffe2) mqprio +| +---(100:7) mqprio +| +---(100:8) mqprio +| ++---(100:ffe1) mqprio +| +---(100:4) mqprio +| ++---(100:ffe0) mqprio + +---(100:3) mqprio + +14) +// Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1 +// here only idle slope is important, others ignored, but calculated +// for interface speed - 100Mb for eth1 port. +// Set it +1 Mb for reserve (important!) +$ tc qdisc add dev eth1 parent 100:3 cbs locredit -1035 \ +hicredit 465 sendslope -69000 idleslope 31000 offload 1 +net eth1: set FIFO3 bw = 31 + +15) +// Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1 +// Set it +1 Mb for reserve (important!) +$ tc qdisc add dev eth1 parent 100:4 cbs locredit -1335 \ +hicredit 405 sendslope -89000 idleslope 11000 offload 1 +net eth1: set FIFO2 bw = 11 + +16) +// Create vlan 100 to map sk->priority to vlan qos for Eth1 +$ ip link add link eth1 name eth1.100 type vlan id 100 +net eth1: Adding vlanid 100 to vlan filter + +17) +// Map skb->priority to L2 prio for Eth1.100, one to one +$ ip link set eth1.100 type vlan \ +egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +18) +// Check egress map for vlan 100 +$ cat /proc/net/vlan/eth1.100 +[...] +INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 +EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 + +19) +// Run appropriate tools with socket option "SO_PRIORITY" to 3 +// for class A and to 2 for class B. For both interfaces +./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500& +./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500& +./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500& +./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500& + +20) +// run your listener on workstation (should be in same vlan) +// (I took at https://www.spinics.net/lists/netdev/msg460869.html) +./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500 +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39012 kbps +Receiving data rate: 39000 kbps + +21) +// Restore default configuration if needed +$ ip link del eth1.100 +$ ip link del eth0.100 +$ tc qdisc del dev eth1 root +net eth1: Prev FIFO2 is shaped +net eth1: set FIFO3 bw = 0 +net eth1: set FIFO2 bw = 0 +$ tc qdisc del dev eth0 root +net eth0: Prev FIFO2 is shaped +net eth0: set FIFO3 bw = 0 +net eth0: set FIFO2 bw = 0 +$ ethtool -L eth0 rx 1 tx 1 diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt new file mode 100644 index 000000000..1be0b6f9e --- /dev/null +++ b/Documentation/networking/timestamping.txt @@ -0,0 +1,538 @@ + +1. Control Interfaces + +The interfaces for receiving network packages timestamps are: + +* SO_TIMESTAMP + Generates a timestamp for each incoming packet in (not necessarily + monotonic) system time. Reports the timestamp via recvmsg() in a + control message as struct timeval (usec resolution). + +* SO_TIMESTAMPNS + Same timestamping mechanism as SO_TIMESTAMP, but reports the + timestamp as struct timespec (nsec resolution). + +* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] + Only for multicast:approximate transmit timestamp obtained by + reading the looped packet receive timestamp. + +* SO_TIMESTAMPING + Generates timestamps on reception, transmission or both. Supports + multiple timestamp sources, including hardware. Supports generating + timestamps for stream sockets. + + +1.1 SO_TIMESTAMP: + +This socket option enables timestamping of datagrams on the reception +path. Because the destination socket, if any, is not known early in +the network stack, the feature has to be enabled for all packets. The +same is true for all early receive timestamp options. + +For interface details, see `man 7 socket`. + + +1.2 SO_TIMESTAMPNS: + +This option is identical to SO_TIMESTAMP except for the returned data type. +Its struct timespec allows for higher resolution (ns) timestamps than the +timeval of SO_TIMESTAMP (ms). + + +1.3 SO_TIMESTAMPING: + +Supports multiple types of timestamp requests. As a result, this +socket option takes a bitmap of flags, not a boolean. In + + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); + +val is an integer with any of the following bits set. Setting other +bit returns EINVAL and does not change the current state. + +The socket option configures timestamp generation for individual +sk_buffs (1.3.1), timestamp reporting to the socket's error +queue (1.3.2) and options (1.3.3). Timestamp generation can also +be enabled for individual sendmsg calls using cmsg (1.3.4). + + +1.3.1 Timestamp Generation + +Some bits are requests to the stack to try to generate timestamps. Any +combination of them is valid. Changes to these bits apply to newly +created packets, not to packets already in the stack. As a result, it +is possible to selectively request timestamps for a subset of packets +(e.g., for sampling) by embedding an send() call within two setsockopt +calls, one to enable timestamp generation and one to disable it. +Timestamps may also be generated for reasons other than being +requested by a particular socket, such as when receive timestamping is +enabled system wide, as explained earlier. + +SOF_TIMESTAMPING_RX_HARDWARE: + Request rx timestamps generated by the network adapter. + +SOF_TIMESTAMPING_RX_SOFTWARE: + Request rx timestamps when data enters the kernel. These timestamps + are generated just after a device driver hands a packet to the + kernel receive stack. + +SOF_TIMESTAMPING_TX_HARDWARE: + Request tx timestamps generated by the network adapter. This flag + can be enabled via both socket options and control messages. + +SOF_TIMESTAMPING_TX_SOFTWARE: + Request tx timestamps when data leaves the kernel. These timestamps + are generated in the device driver as close as possible, but always + prior to, passing the packet to the network interface. Hence, they + require driver support and may not be available for all devices. + This flag can be enabled via both socket options and control messages. + + +SOF_TIMESTAMPING_TX_SCHED: + Request tx timestamps prior to entering the packet scheduler. Kernel + transmit latency is, if long, often dominated by queuing delay. The + difference between this timestamp and one taken at + SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent + of protocol processing. The latency incurred in protocol + processing, if any, can be computed by subtracting a userspace + timestamp taken immediately before send() from this timestamp. On + machines with virtual devices where a transmitted packet travels + through multiple devices and, hence, multiple packet schedulers, + a timestamp is generated at each layer. This allows for fine + grained measurement of queuing delay. This flag can be enabled + via both socket options and control messages. + +SOF_TIMESTAMPING_TX_ACK: + Request tx timestamps when all data in the send buffer has been + acknowledged. This only makes sense for reliable protocols. It is + currently only implemented for TCP. For that protocol, it may + over-report measurement, because the timestamp is generated when all + data up to and including the buffer at send() was acknowledged: the + cumulative acknowledgment. The mechanism ignores SACK and FACK. + This flag can be enabled via both socket options and control messages. + + +1.3.2 Timestamp Reporting + +The other three bits control which timestamps will be reported in a +generated control message. Changes to the bits take immediate +effect at the timestamp reporting locations in the stack. Timestamps +are only reported for packets that also have the relevant timestamp +generation request set. + +SOF_TIMESTAMPING_SOFTWARE: + Report any software timestamps when available. + +SOF_TIMESTAMPING_SYS_HARDWARE: + This option is deprecated and ignored. + +SOF_TIMESTAMPING_RAW_HARDWARE: + Report hardware timestamps as generated by + SOF_TIMESTAMPING_TX_HARDWARE when available. + + +1.3.3 Timestamp Options + +The interface supports the options + +SOF_TIMESTAMPING_OPT_ID: + + Generate a unique identifier along with each packet. A process can + have multiple concurrent timestamping requests outstanding. Packets + can be reordered in the transmit path, for instance in the packet + scheduler. In that case timestamps will be queued onto the error + queue out of order from the original send() calls. It is not always + possible to uniquely match timestamps to the original send() calls + based on timestamp order or payload inspection alone, then. + + This option associates each packet at send() with a unique + identifier and returns that along with the timestamp. The identifier + is derived from a per-socket u32 counter (that wraps). For datagram + sockets, the counter increments with each sent packet. For stream + sockets, it increments with every byte. + + The counter starts at zero. It is initialized the first time that + the socket option is enabled. It is reset each time the option is + enabled after having been disabled. Resetting the counter does not + change the identifiers of existing packets in the system. + + This option is implemented only for transmit timestamps. There, the + timestamp is always looped along with a struct sock_extended_err. + The option modifies field ee_data to pass an id that is unique + among all possibly concurrently outstanding timestamp requests for + that socket. + + +SOF_TIMESTAMPING_OPT_CMSG: + + Support recv() cmsg for all timestamped packets. Control messages + are already supported unconditionally on all packets with receive + timestamps and on IPv6 packets with transmit timestamp. This option + extends them to IPv4 packets with transmit timestamp. One use case + is to correlate packets with their egress device, by enabling socket + option IP_PKTINFO simultaneously. + + +SOF_TIMESTAMPING_OPT_TSONLY: + + Applies to transmit timestamps only. Makes the kernel return the + timestamp as a cmsg alongside an empty packet, as opposed to + alongside the original packet. This reduces the amount of memory + charged to the socket's receive budget (SO_RCVBUF) and delivers + the timestamp even if sysctl net.core.tstamp_allow_data is 0. + This option disables SOF_TIMESTAMPING_OPT_CMSG. + +SOF_TIMESTAMPING_OPT_STATS: + + Optional stats that are obtained along with the transmit timestamps. + It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the + transmit timestamp is available, the stats are available in a + separate control message of type SCM_TIMESTAMPING_OPT_STATS, as a + list of TLVs (struct nlattr) of types. These stats allow the + application to associate various transport layer stats with + the transmit timestamps, such as how long a certain block of + data was limited by peer's receiver window. + +SOF_TIMESTAMPING_OPT_PKTINFO: + + Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming + packets with hardware timestamps. The message contains struct + scm_ts_pktinfo, which supplies the index of the real interface which + received the packet and its length at layer 2. A valid (non-zero) + interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is + enabled and the driver is using NAPI. The struct contains also two + other fields, but they are reserved and undefined. + +SOF_TIMESTAMPING_OPT_TX_SWHW: + + Request both hardware and software timestamps for outgoing packets + when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE + are enabled at the same time. If both timestamps are generated, + two separate messages will be looped to the socket's error queue, + each containing just one timestamp. + +New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to +disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate +regardless of the setting of sysctl net.core.tstamp_allow_data. + +An exception is when a process needs additional cmsg data, for +instance SOL_IP/IP_PKTINFO to detect the egress network interface. +Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on +having access to the contents of the original packet, so cannot be +combined with SOF_TIMESTAMPING_OPT_TSONLY. + + +1.3.4. Enabling timestamps via control messages + +In addition to socket options, timestamp generation can be requested +per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). +Using this feature, applications can sample timestamps per sendmsg() +without paying the overhead of enabling and disabling timestamps via +setsockopt: + + struct msghdr *msg; + ... + cmsg = CMSG_FIRSTHDR(msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SO_TIMESTAMPING; + cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); + *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | + SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_TX_ACK; + err = sendmsg(fd, msg, 0); + +The SOF_TIMESTAMPING_TX_* flags set via cmsg will override +the SOF_TIMESTAMPING_TX_* flags set via setsockopt. + +Moreover, applications must still enable timestamp reporting via +setsockopt to receive timestamps: + + __u32 val = SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_OPT_ID /* or any other flag */; + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); + + +1.4 Bytestream Timestamps + +The SO_TIMESTAMPING interface supports timestamping of bytes in a +bytestream. Each request is interpreted as a request for when the +entire contents of the buffer has passed a timestamping point. That +is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record +when all bytes have reached the device driver, regardless of how +many packets the data has been converted into. + +In general, bytestreams have no natural delimiters and therefore +correlating a timestamp with data is non-trivial. A range of bytes +may be split across segments, any segments may be merged (possibly +coalescing sections of previously segmented buffers associated with +independent send() calls). Segments can be reordered and the same +byte range can coexist in multiple segments for protocols that +implement retransmissions. + +It is essential that all timestamps implement the same semantics, +regardless of these possible transformations, as otherwise they are +incomparable. Handling "rare" corner cases differently from the +simple case (a 1:1 mapping from buffer to skb) is insufficient +because performance debugging often needs to focus on such outliers. + +In practice, timestamps can be correlated with segments of a +bytestream consistently, if both semantics of the timestamp and the +timing of measurement are chosen correctly. This challenge is no +different from deciding on a strategy for IP fragmentation. There, the +definition is that only the first fragment is timestamped. For +bytestreams, we chose that a timestamp is generated only when all +bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to +implement and reason about. An implementation that has to take into +account SACK would be more complex due to possible transmission holes +and out of order arrival. + +On the host, TCP can also break the simple 1:1 mapping from buffer to +skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The +implementation ensures correctness in all cases by tracking the +individual last byte passed to send(), even if it is no longer the +last byte after an skbuff extend or merge operation. It stores the +relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff +has only one such field, only one timestamp can be generated. + +In rare cases, a timestamp request can be missed if two requests are +collapsed onto the same skb. A process can detect this situation by +enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at +send time with the value returned for each timestamp. It can prevent +the situation by always flushing the TCP stack in between requests, +for instance by enabling TCP_NODELAY and disabling TCP_CORK and +autocork. + +These precautions ensure that the timestamp is generated only when all +bytes have passed a timestamp point, assuming that the network stack +itself does not reorder the segments. The stack indeed tries to avoid +reordering. The one exception is under administrator control: it is +possible to construct a packet scheduler configuration that delays +segments from the same stream differently. Such a setup would be +unusual. + + +2 Data Interfaces + +Timestamps are read using the ancillary data feature of recvmsg(). +See `man 3 cmsg` for details of this interface. The socket manual +page (`man 7 socket`) describes how timestamps generated with +SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved. + + +2.1 SCM_TIMESTAMPING records + +These timestamps are returned in a control message with cmsg_level +SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type + +struct scm_timestamping { + struct timespec ts[3]; +}; + +The structure can return up to three timestamps. This is a legacy +feature. At least one field is non-zero at any time. Most timestamps +are passed in ts[0]. Hardware timestamps are passed in ts[2]. + +ts[1] used to hold hardware timestamps converted to system time. +Instead, expose the hardware clock device on the NIC directly as +a HW PTP clock source, to allow time conversion in userspace and +optionally synchronize system time with a userspace PTP stack such +as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. + +Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled +together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false +software timestamp will be generated in the recvmsg() call and passed +in ts[0] when a real software timestamp is missing. This happens also +on hardware transmit timestamps. + +2.1.1 Transmit timestamps with MSG_ERRQUEUE + +For transmit timestamps the outgoing packet is looped back to the +socket's error queue with the send timestamp(s) attached. A process +receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE +set and with a msg_control buffer sufficiently large to receive the +relevant metadata structures. The recvmsg call returns the original +outgoing data packet with two ancillary messages attached. + +A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR +embeds a struct sock_extended_err. This defines the error type. For +timestamps, the ee_errno field is ENOMSG. The other ancillary message +will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This +embeds the struct scm_timestamping. + + +2.1.1.2 Timestamp types + +The semantics of the three struct timespec are defined by field +ee_info in the extended error structure. It contains a value of +type SCM_TSTAMP_* to define the actual timestamp passed in +scm_timestamping. + +The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_* +control fields discussed previously, with one exception. For legacy +reasons, SCM_TSTAMP_SND is equal to zero and can be set for both +SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It +is the first if ts[2] is non-zero, the second otherwise, in which +case the timestamp is stored in ts[0]. + + +2.1.1.3 Fragmentation + +Fragmentation of outgoing datagrams is rare, but is possible, e.g., by +explicitly disabling PMTU discovery. If an outgoing packet is fragmented, +then only the first fragment is timestamped and returned to the sending +socket. + + +2.1.1.4 Packet Payload + +The calling application is often not interested in receiving the whole +packet payload that it passed to the stack originally: the socket +error queue mechanism is just a method to piggyback the timestamp on. +In this case, the application can choose to read datagrams with a +smaller buffer, possibly even of length 0. The payload is truncated +accordingly. Until the process calls recvmsg() on the error queue, +however, the full packet is queued, taking up budget from SO_RCVBUF. + + +2.1.1.5 Blocking Read + +Reading from the error queue is always a non-blocking operation. To +block waiting on a timestamp, use poll or select. poll() will return +POLLERR in pollfd.revents if any data is ready on the error queue. +There is no need to pass this flag in pollfd.events. This flag is +ignored on request. See also `man 2 poll`. + + +2.1.2 Receive timestamps + +On reception, there is no reason to read from the socket error queue. +The SCM_TIMESTAMPING ancillary data is sent along with the packet data +on a normal recvmsg(). Since this is not a socket error, it is not +accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case, +the meaning of the three fields in struct scm_timestamping is +implicitly defined. ts[0] holds a software timestamp if set, ts[1] +is again deprecated and ts[2] holds a hardware timestamp if set. + + +3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP + +Hardware time stamping must also be initialized for each device driver +that is expected to do hardware time stamping. The parameter is defined in +/include/linux/net_tstamp.h as: + +struct hwtstamp_config { + int flags; /* no flags defined right now, must be zero */ + int tx_type; /* HWTSTAMP_TX_* */ + int rx_filter; /* HWTSTAMP_FILTER_* */ +}; + +Desired behavior is passed into the kernel and to a specific device by +calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose +ifr_data points to a struct hwtstamp_config. The tx_type and +rx_filter are hints to the driver what it is expected to do. If +the requested fine-grained filtering for incoming packets is not +supported, the driver may time stamp more than just the requested types +of packets. + +Drivers are free to use a more permissive configuration than the requested +configuration. It is expected that drivers should only implement directly the +most generic mode that can be supported. For example if the hardware can +support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale +HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT +is more generic (and more useful to applications). + +A driver which supports hardware time stamping shall update the struct +with the actual, possibly more permissive configuration. If the +requested packets cannot be time stamped, then nothing should be +changed and ERANGE shall be returned (in contrast to EINVAL, which +indicates that SIOCSHWTSTAMP is not supported at all). + +Only a processes with admin rights may change the configuration. User +space is responsible to ensure that multiple processes don't interfere +with each other and that the settings are reset. + +Any process can read the actual configuration by passing this +structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has +not been implemented in all drivers. + +/* possible values for hwtstamp_config->tx_type */ +enum { + /* + * no outgoing packet will need hardware time stamping; + * should a packet arrive which asks for it, no hardware + * time stamping will be done + */ + HWTSTAMP_TX_OFF, + + /* + * enables hardware time stamping for outgoing packets; + * the sender of the packet decides which are to be + * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE + * before sending the packet + */ + HWTSTAMP_TX_ON, +}; + +/* possible values for hwtstamp_config->rx_filter */ +enum { + /* time stamp no incoming packet at all */ + HWTSTAMP_FILTER_NONE, + + /* time stamp any incoming packet */ + HWTSTAMP_FILTER_ALL, + + /* return value: time stamp all packets requested plus some others */ + HWTSTAMP_FILTER_SOME, + + /* PTP v1, UDP, any kind of event packet */ + HWTSTAMP_FILTER_PTP_V1_L4_EVENT, + + /* for the complete list of values, please check + * the include file /include/linux/net_tstamp.h + */ +}; + +3.1 Hardware Timestamping Implementation: Device Drivers + +A driver which supports hardware time stamping must support the +SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with +the actual values as described in the section on SIOCSHWTSTAMP. It +should also support SIOCGHWTSTAMP. + +Time stamps for received packets must be stored in the skb. To get a pointer +to the shared time stamp structure of the skb call skb_hwtstamps(). Then +set the time stamps in the structure: + +struct skb_shared_hwtstamps { + /* hardware time stamp transformed into duration + * since arbitrary point in time + */ + ktime_t hwtstamp; +}; + +Time stamps for outgoing packets are to be generated as follows: +- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) + is set no-zero. If yes, then the driver is expected to do hardware time + stamping. +- If this is possible for the skb and requested, then declare + that the driver is doing the time stamping by setting the flag + SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with + + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; + + You might want to keep a pointer to the associated skb for the next step + and not free the skb. A driver not supporting hardware time stamping doesn't + do that. A driver must never touch sk_buff::tstamp! It is used to store + software generated time stamps by the network subsystem. +- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware + as possible. skb_tx_timestamp() provides a software time stamp if requested + and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). +- As soon as the driver has sent the packet and/or obtained a + hardware time stamp for it, it passes the time stamp back by + calling skb_hwtstamp_tx() with the original skb, the raw + hardware time stamp. skb_hwtstamp_tx() clones the original skb and + adds the timestamps, therefore the original skb has to be freed now. + If obtaining the hardware time stamp somehow fails, then the driver + should not fall back to software time stamping. The rationale is that + this would occur at a later time in the processing pipeline than other + software time stamping and therefore could lead to unexpected deltas + between time stamps. diff --git a/Documentation/networking/tlan.txt b/Documentation/networking/tlan.txt new file mode 100644 index 000000000..34550dfce --- /dev/null +++ b/Documentation/networking/tlan.txt @@ -0,0 +1,117 @@ +(C) 1997-1998 Caldera, Inc. +(C) 1998 James Banks +(C) 1999-2001 Torben Mathiasen <tmm@image.dk, torben.mathiasen@compaq.com> + +For driver information/updates visit http://www.compaq.com + + +TLAN driver for Linux, version 1.14a +README + + +I. Supported Devices. + + Only PCI devices will work with this driver. + + Supported: + Vendor ID Device ID Name + 0e11 ae32 Compaq Netelligent 10/100 TX PCI UTP + 0e11 ae34 Compaq Netelligent 10 T PCI UTP + 0e11 ae35 Compaq Integrated NetFlex 3/P + 0e11 ae40 Compaq Netelligent Dual 10/100 TX PCI UTP + 0e11 ae43 Compaq Netelligent Integrated 10/100 TX UTP + 0e11 b011 Compaq Netelligent 10/100 TX Embedded UTP + 0e11 b012 Compaq Netelligent 10 T/2 PCI UTP/Coax + 0e11 b030 Compaq Netelligent 10/100 TX UTP + 0e11 f130 Compaq NetFlex 3/P + 0e11 f150 Compaq NetFlex 3/P + 108d 0012 Olicom OC-2325 + 108d 0013 Olicom OC-2183 + 108d 0014 Olicom OC-2326 + + + Caveats: + + I am not sure if 100BaseTX daughterboards (for those cards which + support such things) will work. I haven't had any solid evidence + either way. + + However, if a card supports 100BaseTx without requiring an add + on daughterboard, it should work with 100BaseTx. + + The "Netelligent 10 T/2 PCI UTP/Coax" (b012) device is untested, + but I do not expect any problems. + + +II. Driver Options + 1. You can append debug=x to the end of the insmod line to get + debug messages, where x is a bit field where the bits mean + the following: + + 0x01 Turn on general debugging messages. + 0x02 Turn on receive debugging messages. + 0x04 Turn on transmit debugging messages. + 0x08 Turn on list debugging messages. + + 2. You can append aui=1 to the end of the insmod line to cause + the adapter to use the AUI interface instead of the 10 Base T + interface. This is also what to do if you want to use the BNC + connector on a TLAN based device. (Setting this option on a + device that does not have an AUI/BNC connector will probably + cause it to not function correctly.) + + 3. You can set duplex=1 to force half duplex, and duplex=2 to + force full duplex. + + 4. You can set speed=10 to force 10Mbs operation, and speed=100 + to force 100Mbs operation. (I'm not sure what will happen + if a card which only supports 10Mbs is forced into 100Mbs + mode.) + + 5. You have to use speed=X duplex=Y together now. If you just + do "insmod tlan.o speed=100" the driver will do Auto-Neg. + To force a 10Mbps Half-Duplex link do "insmod tlan.o speed=10 + duplex=1". + + 6. If the driver is built into the kernel, you can use the 3rd + and 4th parameters to set aui and debug respectively. For + example: + + ether=0,0,0x1,0x7,eth0 + + This sets aui to 0x1 and debug to 0x7, assuming eth0 is a + supported TLAN device. + + The bits in the third byte are assigned as follows: + + 0x01 = aui + 0x02 = use half duplex + 0x04 = use full duplex + 0x08 = use 10BaseT + 0x10 = use 100BaseTx + + You also need to set both speed and duplex settings when forcing + speeds with kernel-parameters. + ether=0,0,0x12,0,eth0 will force link to 100Mbps Half-Duplex. + + 7. If you have more than one tlan adapter in your system, you can + use the above options on a per adapter basis. To force a 100Mbit/HD + link with your eth1 adapter use: + + insmod tlan speed=0,100 duplex=0,1 + + Now eth0 will use auto-neg and eth1 will be forced to 100Mbit/HD. + Note that the tlan driver supports a maximum of 8 adapters. + + +III. Things to try if you have problems. + 1. Make sure your card's PCI id is among those listed in + section I, above. + 2. Make sure routing is correct. + 3. Try forcing different speed/duplex settings + + +There is also a tlan mailing list which you can join by sending "subscribe tlan" +in the body of an email to majordomo@vuser.vu.union.edu. +There is also a tlan website at http://www.compaq.com + diff --git a/Documentation/networking/tls.txt b/Documentation/networking/tls.txt new file mode 100644 index 000000000..58b5ef75f --- /dev/null +++ b/Documentation/networking/tls.txt @@ -0,0 +1,197 @@ +Overview +======== + +Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over +TCP. TLS provides end-to-end data integrity and confidentiality. + +User interface +============== + +Creating a TLS connection +------------------------- + +First create a new TCP socket and set the TLS ULP. + + sock = socket(AF_INET, SOCK_STREAM, 0); + setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); + +Setting the TLS ULP allows us to set/get TLS socket options. Currently +only the symmetric encryption is handled in the kernel. After the TLS +handshake is complete, we have all the parameters required to move the +data-path to the kernel. There is a separate socket option for moving +the transmit and the receive into the kernel. + + /* From linux/tls.h */ + struct tls_crypto_info { + unsigned short version; + unsigned short cipher_type; + }; + + struct tls12_crypto_info_aes_gcm_128 { + struct tls_crypto_info info; + unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; + unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; + unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; + unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; + }; + + + struct tls12_crypto_info_aes_gcm_128 crypto_info; + + crypto_info.info.version = TLS_1_2_VERSION; + crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; + memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); + memcpy(crypto_info.rec_seq, seq_number_write, + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); + memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); + memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); + + setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); + +Transmit and receive are set separately, but the setup is the same, using either +TLS_TX or TLS_RX. + +Sending TLS application data +---------------------------- + +After setting the TLS_TX socket option all application data sent over this +socket is encrypted using TLS and the parameters provided in the socket option. +For example, we can send an encrypted hello world record as follows: + + const char *msg = "hello world\n"; + send(sock, msg, strlen(msg)); + +send() data is directly encrypted from the userspace buffer provided +to the encrypted kernel send buffer if possible. + +The sendfile system call will send the file's data over TLS records of maximum +length (2^14). + + file = open(filename, O_RDONLY); + fstat(file, &stat); + sendfile(sock, file, &offset, stat.st_size); + +TLS records are created and sent after each send() call, unless +MSG_MORE is passed. MSG_MORE will delay creation of a record until +MSG_MORE is not passed, or the maximum record size is reached. + +The kernel will need to allocate a buffer for the encrypted data. +This buffer is allocated at the time send() is called, such that +either the entire send() call will return -ENOMEM (or block waiting +for memory), or the encryption will always succeed. If send() returns +-ENOMEM and some data was left on the socket buffer from a previous +call using MSG_MORE, the MSG_MORE data is left on the socket buffer. + +Receiving TLS application data +------------------------------ + +After setting the TLS_RX socket option, all recv family socket calls +are decrypted using TLS parameters provided. A full TLS record must +be received before decryption can happen. + + char buffer[16384]; + recv(sock, buffer, 16384); + +Received data is decrypted directly in to the user buffer if it is +large enough, and no additional allocations occur. If the userspace +buffer is too small, data is decrypted in the kernel and copied to +userspace. + +EINVAL is returned if the TLS version in the received message does not +match the version passed in setsockopt. + +EMSGSIZE is returned if the received message is too big. + +EBADMSG is returned if decryption failed for any other reason. + +Send TLS control messages +------------------------- + +Other than application data, TLS has control messages such as alert +messages (record type 21) and handshake messages (record type 22), etc. +These messages can be sent over the socket by providing the TLS record type +via a CMSG. For example the following function sends @data of @length bytes +using a record of type @record_type. + +/* send TLS control message using record_type */ + static int klts_send_ctrl_message(int sock, unsigned char record_type, + void *data, size_t length) + { + struct msghdr msg = {0}; + int cmsg_len = sizeof(record_type); + struct cmsghdr *cmsg; + char buf[CMSG_SPACE(cmsg_len)]; + struct iovec msg_iov; /* Vector of data to send/receive into. */ + + msg.msg_control = buf; + msg.msg_controllen = sizeof(buf); + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_TLS; + cmsg->cmsg_type = TLS_SET_RECORD_TYPE; + cmsg->cmsg_len = CMSG_LEN(cmsg_len); + *CMSG_DATA(cmsg) = record_type; + msg.msg_controllen = cmsg->cmsg_len; + + msg_iov.iov_base = data; + msg_iov.iov_len = length; + msg.msg_iov = &msg_iov; + msg.msg_iovlen = 1; + + return sendmsg(sock, &msg, 0); + } + +Control message data should be provided unencrypted, and will be +encrypted by the kernel. + +Receiving TLS control messages +------------------------------ + +TLS control messages are passed in the userspace buffer, with message +type passed via cmsg. If no cmsg buffer is provided, an error is +returned if a control message is received. Data messages may be +received without a cmsg buffer set. + + char buffer[16384]; + char cmsg[CMSG_SPACE(sizeof(unsigned char))]; + struct msghdr msg = {0}; + msg.msg_control = cmsg; + msg.msg_controllen = sizeof(cmsg); + + struct iovec msg_iov; + msg_iov.iov_base = buffer; + msg_iov.iov_len = 16384; + + msg.msg_iov = &msg_iov; + msg.msg_iovlen = 1; + + int ret = recvmsg(sock, &msg, 0 /* flags */); + + struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg); + if (cmsg->cmsg_level == SOL_TLS && + cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { + int record_type = *((unsigned char *)CMSG_DATA(cmsg)); + // Do something with record_type, and control message data in + // buffer. + // + // Note that record_type may be == to application data (23). + } else { + // Buffer contains application data. + } + +recv will never return data from mixed types of TLS records. + +Integrating in to userspace TLS library +--------------------------------------- + +At a high level, the kernel TLS ULP is a replacement for the record +layer of a userspace TLS library. + +A patchset to OpenSSL to use ktls as the record layer is here: + +https://github.com/Mellanox/openssl/commits/tls_rx2 + +An example of calling send directly after a handshake using +gnutls. Since it doesn't implement a full record layer, control +messages are not supported: + +https://github.com/ktls/af_ktls-tool/commits/RX diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt new file mode 100644 index 000000000..b9a188823 --- /dev/null +++ b/Documentation/networking/tproxy.txt @@ -0,0 +1,104 @@ +Transparent proxy support +========================= + +This feature adds Linux 2.2-like transparent proxy support to current kernels. +To use it, enable the socket match and the TPROXY target in your kernel config. +You will need policy routing too, so be sure to enable that as well. + +From Linux 4.18 transparent proxy support is also available in nf_tables. + +1. Making non-local sockets work +================================ + +The idea is that you identify packets with destination address matching a local +socket on your box, set the packet mark to a certain value: + +# iptables -t mangle -N DIVERT +# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT +# iptables -t mangle -A DIVERT -j MARK --set-mark 1 +# iptables -t mangle -A DIVERT -j ACCEPT + +Alternatively you can do this in nft with the following commands: + +# nft add table filter +# nft add chain filter divert "{ type filter hook prerouting priority -150; }" +# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept + +And then match on that value using policy routing to have those packets +delivered locally: + +# ip rule add fwmark 1 lookup 100 +# ip route add local 0.0.0.0/0 dev lo table 100 + +Because of certain restrictions in the IPv4 routing output code you'll have to +modify your application to allow it to send datagrams _from_ non-local IP +addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket +option before calling bind: + +fd = socket(AF_INET, SOCK_STREAM, 0); +/* - 8< -*/ +int value = 1; +setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value)); +/* - 8< -*/ +name.sin_family = AF_INET; +name.sin_port = htons(0xCAFE); +name.sin_addr.s_addr = htonl(0xDEADBEEF); +bind(fd, &name, sizeof(name)); + +A trivial patch for netcat is available here: +http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch + + +2. Redirecting traffic +====================== + +Transparent proxying often involves "intercepting" traffic on a router. This is +usually done with the iptables REDIRECT target; however, there are serious +limitations of that method. One of the major issues is that it actually +modifies the packets to change the destination address -- which might not be +acceptable in certain situations. (Think of proxying UDP for example: you won't +be able to find out the original destination address. Even in case of TCP +getting the original destination address is racy.) + +The 'TPROXY' target provides similar functionality without relying on NAT. Simply +add rules like this to the iptables ruleset above: + +# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \ + --tproxy-mark 0x1/0x1 --on-port 50080 + +Or the following rule to nft: + +# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept + +Note that for this to work you'll have to modify the proxy to enable (SOL_IP, +IP_TRANSPARENT) for the listening socket. + +As an example implementation, tcprdr is available here: +https://git.breakpoint.cc/cgit/fw/tcprdr.git/ +This tool is written by Florian Westphal and it was used for testing during the +nf_tables implementation. + +3. Iptables and nf_tables extensions +==================================== + +To use tproxy you'll need to have the following modules compiled for iptables: + - NETFILTER_XT_MATCH_SOCKET + - NETFILTER_XT_TARGET_TPROXY + +Or the floowing modules for nf_tables: + - NFT_SOCKET + - NFT_TPROXY + +4. Application support +====================== + +4.1. Squid +---------- + +Squid 3.HEAD has support built-in. To use it, pass +'--enable-linux-netfilter' to configure and set the 'tproxy' option on +the HTTP listener you redirect traffic to with the TPROXY iptables +target. + +For more information please consult the following page on the Squid +wiki: http://wiki.squid-cache.org/Features/Tproxy4 diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt new file mode 100644 index 000000000..949d5dcdd --- /dev/null +++ b/Documentation/networking/tuntap.txt @@ -0,0 +1,227 @@ +Universal TUN/TAP device driver. +Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> + + Linux, Solaris drivers + Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> + + FreeBSD TAP driver + Copyright (c) 1999-2000 Maksim Yevmenkin <m_evmenkin@yahoo.com> + + Revision of this document 2002 by Florian Thiel <florian.thiel@gmx.net> + +1. Description + TUN/TAP provides packet reception and transmission for user space programs. + It can be seen as a simple Point-to-Point or Ethernet device, which, + instead of receiving packets from physical media, receives them from + user space program and instead of sending packets via physical media + writes them to the user space program. + + In order to use the driver a program has to open /dev/net/tun and issue a + corresponding ioctl() to register a network device with the kernel. A network + device will appear as tunXX or tapXX, depending on the options chosen. When + the program closes the file descriptor, the network device and all + corresponding routes will disappear. + + Depending on the type of device chosen the userspace program has to read/write + IP packets (with tun) or ethernet frames (with tap). Which one is being used + depends on the flags given with the ioctl(). + + The package from http://vtun.sourceforge.net/tun contains two simple examples + for how to use tun and tap devices. Both programs work like a bridge between + two network interfaces. + br_select.c - bridge based on select system call. + br_sigio.c - bridge based on async io and SIGIO signal. + However, the best example is VTun http://vtun.sourceforge.net :)) + +2. Configuration + Create device node: + mkdir /dev/net (if it doesn't exist already) + mknod /dev/net/tun c 10 200 + + Set permissions: + e.g. chmod 0666 /dev/net/tun + There's no harm in allowing the device to be accessible by non-root users, + since CAP_NET_ADMIN is required for creating network devices or for + connecting to network devices which aren't owned by the user in question. + If you want to create persistent devices and give ownership of them to + unprivileged users, then you need the /dev/net/tun device to be usable by + those users. + + Driver module autoloading + + Make sure that "Kernel module loader" - module auto-loading + support is enabled in your kernel. The kernel should load it on + first access. + + Manual loading + insert the module by hand: + modprobe tun + + If you do it the latter way, you have to load the module every time you + need it, if you do it the other way it will be automatically loaded when + /dev/net/tun is being opened. + +3. Program interface + 3.1 Network device allocation: + + char *dev should be the name of the device with a format string (e.g. + "tun%d"), but (as far as I can see) this can be any valid network device name. + Note that the character pointer becomes overwritten with the real device name + (e.g. "tun0") + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_alloc(char *dev) + { + struct ifreq ifr; + int fd, err; + + if( (fd = open("/dev/net/tun", O_RDWR)) < 0 ) + return tun_alloc_old(dev); + + memset(&ifr, 0, sizeof(ifr)); + + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + */ + ifr.ifr_flags = IFF_TUN; + if( *dev ) + strncpy(ifr.ifr_name, dev, IFNAMSIZ); + + if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){ + close(fd); + return err; + } + strcpy(dev, ifr.ifr_name); + return fd; + } + + 3.2 Frame format: + If flag IFF_NO_PI is not set each frame format is: + Flags [2 bytes] + Proto [2 bytes] + Raw protocol(IP, IPv6, etc) frame. + + 3.3 Multiqueue tuntap interface: + + From version 3.8, Linux supports multiqueue tuntap which can uses multiple + file descriptors (queues) to parallelize packets sending or receiving. The + device allocation is the same as before, and if user wants to create multiple + queues, TUNSETIFF with the same device name must be called many times with + IFF_MULTI_QUEUE flag. + + char *dev should be the name of the device, queues is the number of queues to + be created, fds is used to store and return the file descriptors (queues) + created to the caller. Each file descriptor were served as the interface of a + queue which could be accessed by userspace. + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + + A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When + calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when + calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were + enabled by default after it was created through TUNSETIFF. + + fd is the file descriptor (queue) that we want to enable or disable, when + enable is true we enable it, otherwise we disable it + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + +Universal TUN/TAP device driver Frequently Asked Question. + +1. What platforms are supported by TUN/TAP driver ? +Currently driver has been written for 3 Unices: + Linux kernels 2.2.x, 2.4.x + FreeBSD 3.x, 4.x, 5.x + Solaris 2.6, 7.0, 8.0 + +2. What is TUN/TAP driver used for? +As mentioned above, main purpose of TUN/TAP driver is tunneling. +It is used by VTun (http://vtun.sourceforge.net). + +Another interesting application using TUN/TAP is pipsecd +(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec +implementation that can use complete kernel routing (unlike FreeS/WAN). + +3. How does Virtual network device actually work ? +Virtual network device can be viewed as a simple Point-to-Point or +Ethernet device, which instead of receiving packets from a physical +media, receives them from user space program and instead of sending +packets via physical media sends them to the user space program. + +Let's say that you configured IPX on the tap0, then whenever +the kernel sends an IPX packet to tap0, it is passed to the application +(VTun for example). The application encrypts, compresses and sends it to +the other side over TCP or UDP. The application on the other side decompresses +and decrypts the data received and writes the packet to the TAP device, +the kernel handles the packet like it came from real physical device. + +4. What is the difference between TUN driver and TAP driver? +TUN works with IP frames. TAP works with Ethernet frames. + +This means that you have to read/write IP packets when you are using tun and +ethernet frames when using tap. + +5. What is the difference between BPF and TUN/TAP driver? +BPF is an advanced packet filter. It can be attached to existing +network interface. It does not provide a virtual network interface. +A TUN/TAP driver does provide a virtual network interface and it is possible +to attach BPF to this interface. + +6. Does TAP driver support kernel Ethernet bridging? +Yes. Linux and FreeBSD drivers support Ethernet bridging. diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt new file mode 100644 index 000000000..53a726855 --- /dev/null +++ b/Documentation/networking/udplite.txt @@ -0,0 +1,278 @@ + =========================================================================== + The UDP-Lite protocol (RFC 3828) + =========================================================================== + + + UDP-Lite is a Standards-Track IETF transport protocol whose characteristic + is a variable-length checksum. This has advantages for transport of multimedia + (video, VoIP) over wireless networks, as partly damaged packets can still be + fed into the codec instead of being discarded due to a failed checksum test. + + This file briefly describes the existing kernel support and the socket API. + For in-depth information, you can consult: + + o The UDP-Lite Homepage: + http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + From here you can also download some example application source code. + + o The UDP-Lite HOWTO on + http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + files/UDP-Lite-HOWTO.txt + + o The Wireshark UDP-Lite WiKi (with capture files): + https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol + + o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt + + + I) APPLICATIONS + + Several applications have been ported successfully to UDP-Lite. Ethereal + (now called wireshark) has UDP-Litev4/v6 support by default. + Porting applications to UDP-Lite is straightforward: only socket level and + IPPROTO need to be changed; senders additionally set the checksum coverage + length (default = header length = 8). Details are in the next section. + + + II) PROGRAMMING API + + UDP-Lite provides a connectionless, unreliable datagram service and hence + uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is + very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2) + call so that the statement looks like: + + s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE); + + or, respectively, + + s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE); + + With just the above change you are able to run UDP-Lite services or connect + to UDP-Lite servers. The kernel will assume that you are not interested in + using partial checksum coverage and so emulate UDP mode (full coverage). + + To make use of the partial checksum coverage facilities requires setting a + single socket option, which takes an integer specifying the coverage length: + + * Sender checksum coverage: UDPLITE_SEND_CSCOV + + For example, + + int val = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int)); + + sets the checksum coverage length to 20 bytes (12b data + 8b header). + Of each packet only the first 20 bytes (plus the pseudo-header) will be + checksummed. This is useful for RTP applications which have a 12-byte + base header. + + + * Receiver checksum coverage: UDPLITE_RECV_CSCOV + + This option is the receiver-side analogue. It is truly optional, i.e. not + required to enable traffic with partial checksum coverage. Its function is + that of a traffic filter: when enabled, it instructs the kernel to drop + all packets which have a coverage _less_ than this value. For example, if + RTP and UDP headers are to be protected, a receiver can enforce that only + packets with a minimum coverage of 20 are admitted: + + int min = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int)); + + The calls to getsockopt(2) are analogous. Being an extension and not a stand- + alone protocol, all socket options known from UDP can be used in exactly the + same manner as before, e.g. UDP_CORK or UDP_ENCAP. + + A detailed discussion of UDP-Lite checksum coverage options is in section IV. + + + III) HEADER FILES + + The socket API requires support through header files in /usr/include: + + * /usr/include/netinet/in.h + to define IPPROTO_UDPLITE + + * /usr/include/netinet/udplite.h + for UDP-Lite header fields and protocol constants + + For testing purposes, the following can serve as a `mini' header file: + + #define IPPROTO_UDPLITE 136 + #define SOL_UDPLITE 136 + #define UDPLITE_SEND_CSCOV 10 + #define UDPLITE_RECV_CSCOV 11 + + Ready-made header files for various distros are in the UDP-Lite tarball. + + + IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS + + To enable debugging messages, the log level need to be set to 8, as most + messages use the KERN_DEBUG level (7). + + 1) Sender Socket Options + + If the sender specifies a value of 0 as coverage length, the module + assumes full coverage, transmits a packet with coverage length of 0 + and according checksum. If the sender specifies a coverage < 8 and + different from 0, the kernel assumes 8 as default value. Finally, + if the specified coverage length exceeds the packet length, the packet + length is used instead as coverage length. + + 2) Receiver Socket Options + + The receiver specifies the minimum value of the coverage length it + is willing to accept. A value of 0 here indicates that the receiver + always wants the whole of the packet covered. In this case, all + partially covered packets are dropped and an error is logged. + + It is not possible to specify illegal values (<0 and <8); in these + cases the default of 8 is assumed. + + All packets arriving with a coverage value less than the specified + threshold are discarded, these events are also logged. + + 3) Disabling the Checksum Computation + + On both sender and receiver, checksumming will always be performed + and cannot be disabled using SO_NO_CHECK. Thus + + setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... ); + + will always will be ignored, while the value of + + getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...); + + is meaningless (as in TCP). Packets with a zero checksum field are + illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded. + + 4) Fragmentation + + The checksum computation respects both buffersize and MTU. The size + of UDP-Lite packets is determined by the size of the send buffer. The + minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF + in include/net/sock.h), the default value is configurable as + net.core.wmem_default or via setting the SO_SNDBUF socket(7) + option. The maximum upper bound for the send buffer is determined + by net.core.wmem_max. + + Given a payload size larger than the send buffer size, UDP-Lite will + split the payload into several individual packets, filling up the + send buffer size in each case. + + The precise value also depends on the interface MTU. The interface MTU, + in turn, may trigger IP fragmentation. In this case, the generated + UDP-Lite packet is split into several IP packets, of which only the + first one contains the L4 header. + + The send buffer size has implications on the checksum coverage length. + Consider the following example: + + Payload: 1536 bytes Send Buffer: 1024 bytes + MTU: 1500 bytes Coverage Length: 856 bytes + + UDP-Lite will ship the 1536 bytes in two separate packets: + + Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes + Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes + + The coverage packet covers the UDP-Lite header and 848 bytes of the + payload in the first packet, the second packet is fully covered. Note + that for the second packet, the coverage length exceeds the packet + length. The kernel always re-adjusts the coverage length to the packet + length in such cases. + + As an example of what happens when one UDP-Lite packet is split into + several tiny fragments, consider the following example. + + Payload: 1024 bytes Send buffer size: 1024 bytes + MTU: 300 bytes Coverage length: 575 bytes + + +-+-----------+--------------+--------------+--------------+ + |8| 272 | 280 | 280 | 280 | + +-+-----------+--------------+--------------+--------------+ + 280 560 840 1032 + ^ + *****checksum coverage************* + + The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte + header). According to the interface MTU, these are split into 4 IP + packets (280 byte IP payload + 20 byte IP header). The kernel module + sums the contents of the entire first two packets, plus 15 bytes of + the last packet before releasing the fragments to the IP module. + + To see the analogous case for IPv6 fragmentation, consider a link + MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum + coverage is less than 1232 bytes (MTU minus IPv6/fragment header + lengths), only the first fragment needs to be considered. When using + larger checksum coverage lengths, each eligible fragment needs to be + checksummed. Suppose we have a checksum coverage of 3062. The buffer + of 3356 bytes will be split into the following fragments: + + Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data + + The first two fragments have to be checksummed in full, of the last + fragment only 598 (= 3062 - 2*1232) bytes are checksummed. + + While it is important that such cases are dealt with correctly, they + are (annoyingly) rare: UDP-Lite is designed for optimising multimedia + performance over wireless (or generally noisy) links and thus smaller + coverage lengths are likely to be expected. + + + V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING + + Exceptional and error conditions are logged to syslog at the KERN_DEBUG + level. Live statistics about UDP-Lite are available in /proc/net/snmp + and can (with newer versions of netstat) be viewed using + + netstat -svu + + This displays UDP-Lite statistics variables, whose meaning is as follows. + + InDatagrams: The total number of datagrams delivered to users. + + NoPorts: Number of packets received to an unknown port. + These cases are counted separately (not as InErrors). + + InErrors: Number of erroneous UDP-Lite packets. Errors include: + * internal socket queue receive errors + * packet too short (less than 8 bytes or stated + coverage length exceeds received length) + * xfrm4_policy_check() returned with error + * application has specified larger min. coverage + length than that of incoming packet + * checksum coverage violated + * bad checksum + + OutDatagrams: Total number of sent datagrams. + + These statistics derive from the UDP MIB (RFC 2013). + + + VI) IPTABLES + + There is packet match support for UDP-Lite as well as support for the LOG target. + If you copy and paste the following line into /etc/protocols, + + udplite 136 UDP-Lite # UDP-Lite [RFC 3828] + + then + iptables -A INPUT -p udplite -j LOG + + will produce logging output to syslog. Dropping and rejecting packets also works. + + + VII) MAINTAINER ADDRESS + + The UDP-Lite patch was developed at + University of Aberdeen + Electronics Research Group + Department of Engineering + Fraser Noble Building + Aberdeen AB24 3UE; UK + The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial + code was developed by William Stanislaus, <william@erg.abdn.ac.uk>. diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt new file mode 100644 index 000000000..ad3dead05 --- /dev/null +++ b/Documentation/networking/vortex.txt @@ -0,0 +1,448 @@ +Documentation/networking/vortex.txt +Andrew Morton +30 April 2000 + + +This document describes the usage and errata of the 3Com "Vortex" device +driver for Linux, 3c59x.c. + +The driver was written by Donald Becker <becker@scyld.com> + +Don is no longer the prime maintainer of this version of the driver. +Please report problems to one or more of: + + Andrew Morton + Netdev mailing list <netdev@vger.kernel.org> + Linux kernel mailing list <linux-kernel@vger.kernel.org> + +Please note the 'Reporting and Diagnosing Problems' section at the end +of this file. + + +Since kernel 2.3.99-pre6, this driver incorporates the support for the +3c575-series Cardbus cards which used to be handled by 3c575_cb.c. + +This driver supports the following hardware: + + 3c590 Vortex 10Mbps + 3c592 EISA 10Mbps Demon/Vortex + 3c597 EISA Fast Demon/Vortex + 3c595 Vortex 100baseTx + 3c595 Vortex 100baseT4 + 3c595 Vortex 100base-MII + 3c900 Boomerang 10baseT + 3c900 Boomerang 10Mbps Combo + 3c900 Cyclone 10Mbps TPO + 3c900 Cyclone 10Mbps Combo + 3c900 Cyclone 10Mbps TPC + 3c900B-FL Cyclone 10base-FL + 3c905 Boomerang 100baseTx + 3c905 Boomerang 100baseT4 + 3c905B Cyclone 100baseTx + 3c905B Cyclone 10/100/BNC + 3c905B-FX Cyclone 100baseFx + 3c905C Tornado + 3c920B-EMB-WNM (ATI Radeon 9100 IGP) + 3c980 Cyclone + 3c980C Python-T + 3cSOHO100-TX Hurricane + 3c555 Laptop Hurricane + 3c556 Laptop Tornado + 3c556B Laptop Hurricane + 3c575 [Megahertz] 10/100 LAN CardBus + 3c575 Boomerang CardBus + 3CCFE575BT Cyclone CardBus + 3CCFE575CT Tornado CardBus + 3CCFE656 Cyclone CardBus + 3CCFEM656B Cyclone+Winmodem CardBus + 3CXFEM656C Tornado+Winmodem CardBus + 3c450 HomePNA Tornado + 3c920 Tornado + 3c982 Hydra Dual Port A + 3c982 Hydra Dual Port B + 3c905B-T4 + 3c920B-EMB-WNM Tornado + +Module parameters +================= + +There are several parameters which may be provided to the driver when +its module is loaded. These are usually placed in /etc/modprobe.d/*.conf +configuration files. Example: + +options 3c59x debug=3 rx_copybreak=300 + +If you are using the PCMCIA tools (cardmgr) then the options may be +placed in /etc/pcmcia/config.opts: + +module "3c59x" opts "debug=3 rx_copybreak=300" + + +The supported parameters are: + +debug=N + + Where N is a number from 0 to 7. Anything above 3 produces a lot + of output in your system logs. debug=1 is default. + +options=N1,N2,N3,... + + Each number in the list provides an option to the corresponding + network card. So if you have two 3c905's and you wish to provide + them with option 0x204 you would use: + + options=0x204,0x204 + + The individual options are composed of a number of bitfields which + have the following meanings: + + Possible media type settings + 0 10baseT + 1 10Mbs AUI + 2 undefined + 3 10base2 (BNC) + 4 100base-TX + 5 100base-FX + 6 MII (Media Independent Interface) + 7 Use default setting from EEPROM + 8 Autonegotiate + 9 External MII + 10 Use default setting from EEPROM + + When generating a value for the 'options' setting, the above media + selection values may be OR'ed (or added to) the following: + + 0x8000 Set driver debugging level to 7 + 0x4000 Set driver debugging level to 2 + 0x0400 Enable Wake-on-LAN + 0x0200 Force full duplex mode. + 0x0010 Bus-master enable bit (Old Vortex cards only) + + For example: + + insmod 3c59x options=0x204 + + will force full-duplex 100base-TX, rather than allowing the usual + autonegotiation. + +global_options=N + + Sets the `options' parameter for all 3c59x NICs in the machine. + Entries in the `options' array above will override any setting of + this. + +full_duplex=N1,N2,N3... + + Similar to bit 9 of 'options'. Forces the corresponding card into + full-duplex mode. Please use this in preference to the `options' + parameter. + + In fact, please don't use this at all! You're better off getting + autonegotiation working properly. + +global_full_duplex=N1 + + Sets full duplex mode for all 3c59x NICs in the machine. Entries + in the `full_duplex' array above will override any setting of this. + +flow_ctrl=N1,N2,N3... + + Use 802.3x MAC-layer flow control. The 3com cards only support the + PAUSE command, which means that they will stop sending packets for a + short period if they receive a PAUSE frame from the link partner. + + The driver only allows flow control on a link which is operating in + full duplex mode. + + This feature does not appear to work on the 3c905 - only 3c905B and + 3c905C have been tested. + + The 3com cards appear to only respond to PAUSE frames which are + sent to the reserved destination address of 01:80:c2:00:00:01. They + do not honour PAUSE frames which are sent to the station MAC address. + +rx_copybreak=M + + The driver preallocates 32 full-sized (1536 byte) network buffers + for receiving. When a packet arrives, the driver has to decide + whether to leave the packet in its full-sized buffer, or to allocate + a smaller buffer and copy the packet across into it. + + This is a speed/space tradeoff. + + The value of rx_copybreak is used to decide when to make the copy. + If the packet size is less than rx_copybreak, the packet is copied. + The default value for rx_copybreak is 200 bytes. + +max_interrupt_work=N + + The driver's interrupt service routine can handle many receive and + transmit packets in a single invocation. It does this in a loop. + The value of max_interrupt_work governs how many times the interrupt + service routine will loop. The default value is 32 loops. If this + is exceeded the interrupt service routine gives up and generates a + warning message "eth0: Too much work in interrupt". + +hw_checksums=N1,N2,N3,... + + Recent 3com NICs are able to generate IPv4, TCP and UDP checksums + in hardware. Linux has used the Rx checksumming for a long time. + The "zero copy" patch which is planned for the 2.4 kernel series + allows you to make use of the NIC's DMA scatter/gather and transmit + checksumming as well. + + The driver is set up so that, when the zerocopy patch is applied, + all Tornado and Cyclone devices will use S/G and Tx checksums. + + This module parameter has been provided so you can override this + decision. If you think that Tx checksums are causing a problem, you + may disable the feature with `hw_checksums=0'. + + If you think your NIC should be performing Tx checksumming and the + driver isn't enabling it, you can force the use of hardware Tx + checksumming with `hw_checksums=1'. + + The driver drops a message in the logfiles to indicate whether or + not it is using hardware scatter/gather and hardware Tx checksums. + + Scatter/gather and hardware checksums provide considerable + performance improvement for the sendfile() system call, but a small + decrease in throughput for send(). There is no effect upon receive + efficiency. + +compaq_ioaddr=N +compaq_irq=N +compaq_device_id=N + + "Variables to work-around the Compaq PCI BIOS32 problem".... + +watchdog=N + + Sets the time duration (in milliseconds) after which the kernel + decides that the transmitter has become stuck and needs to be reset. + This is mainly for debugging purposes, although it may be advantageous + to increase this value on LANs which have very high collision rates. + The default value is 5000 (5.0 seconds). + +enable_wol=N1,N2,N3,... + + Enable Wake-on-LAN support for the relevant interface. Donald + Becker's `ether-wake' application may be used to wake suspended + machines. + + Also enables the NIC's power management support. + +global_enable_wol=N + + Sets enable_wol mode for all 3c59x NICs in the machine. Entries in + the `enable_wol' array above will override any setting of this. + +Media selection +--------------- + +A number of the older NICs such as the 3c590 and 3c900 series have +10base2 and AUI interfaces. + +Prior to January, 2001 this driver would autoeselect the 10base2 or AUI +port if it didn't detect activity on the 10baseT port. It would then +get stuck on the 10base2 port and a driver reload was necessary to +switch back to 10baseT. This behaviour could not be prevented with a +module option override. + +Later (current) versions of the driver _do_ support locking of the +media type. So if you load the driver module with + + modprobe 3c59x options=0 + +it will permanently select the 10baseT port. Automatic selection of +other media types does not occur. + + +Transmit error, Tx status register 82 +------------------------------------- + +This is a common error which is almost always caused by another host on +the same network being in full-duplex mode, while this host is in +half-duplex mode. You need to find that other host and make it run in +half-duplex mode or fix this host to run in full-duplex mode. + +As a last resort, you can force the 3c59x driver into full-duplex mode +with + + options 3c59x full_duplex=1 + +but this has to be viewed as a workaround for broken network gear and +should only really be used for equipment which cannot autonegotiate. + + +Additional resources +-------------------- + +Details of the device driver implementation are at the top of the source file. + +Additional documentation is available at Don Becker's Linux Drivers site: + + http://www.scyld.com/vortex.html + +Donald Becker's driver development site: + + http://www.scyld.com/network.html + +Donald's vortex-diag program is useful for inspecting the NIC's state: + + http://www.scyld.com/ethercard_diag.html + +Donald's mii-diag program may be used for inspecting and manipulating +the NIC's Media Independent Interface subsystem: + + http://www.scyld.com/ethercard_diag.html#mii-diag + +Donald's wake-on-LAN page: + + http://www.scyld.com/wakeonlan.html + +3Com's DOS-based application for setting up the NICs EEPROMs: + + ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe + + +Autonegotiation notes +--------------------- + + The driver uses a one-minute heartbeat for adapting to changes in + the external LAN environment if link is up and 5 seconds if link is down. + This means that when, for example, a machine is unplugged from a hubbed + 10baseT LAN plugged into a switched 100baseT LAN, the throughput + will be quite dreadful for up to sixty seconds. Be patient. + + Cisco interoperability note from Walter Wong <wcw+@CMU.EDU>: + + On a side note, adding HAS_NWAY seems to share a problem with the + Cisco 6509 switch. Specifically, you need to change the spanning + tree parameter for the port the machine is plugged into to 'portfast' + mode. Otherwise, the negotiation fails. This has been an issue + we've noticed for a while but haven't had the time to track down. + + Cisco switches (Jeff Busch <jbusch@deja.com>) + + My "standard config" for ports to which PC's/servers connect directly: + + interface FastEthernet0/N + description machinename + load-interval 30 + spanning-tree portfast + + If autonegotiation is a problem, you may need to specify "speed + 100" and "duplex full" as well (or "speed 10" and "duplex half"). + + WARNING: DO NOT hook up hubs/switches/bridges to these + specially-configured ports! The switch will become very confused. + + +Reporting and diagnosing problems +--------------------------------- + +Maintainers find that accurate and complete problem reports are +invaluable in resolving driver problems. We are frequently not able to +reproduce problems and must rely on your patience and efforts to get to +the bottom of the problem. + +If you believe you have a driver problem here are some of the +steps you should take: + +- Is it really a driver problem? + + Eliminate some variables: try different cards, different + computers, different cables, different ports on the switch/hub, + different versions of the kernel or of the driver, etc. + +- OK, it's a driver problem. + + You need to generate a report. Typically this is an email to the + maintainer and/or netdev@vger.kernel.org. The maintainer's + email address will be in the driver source or in the MAINTAINERS file. + +- The contents of your report will vary a lot depending upon the + problem. If it's a kernel crash then you should refer to the + admin-guide/reporting-bugs.rst file. + + But for most problems it is useful to provide the following: + + o Kernel version, driver version + + o A copy of the banner message which the driver generates when + it is initialised. For example: + + eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19 + 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. + MII transceiver found at address 24, status 782d. + Enabling bus-master transmits and whole-frame receives. + + NOTE: You must provide the `debug=2' modprobe option to generate + a full detection message. Please do this: + + modprobe 3c59x debug=2 + + o If it is a PCI device, the relevant output from 'lspci -vx', eg: + + 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74) + Subsystem: 3Com Corporation: Unknown device 9200 + Flags: bus master, medium devsel, latency 32, IRQ 19 + I/O ports at a400 [size=128] + Memory at db000000 (32-bit, non-prefetchable) [size=128] + Expansion ROM at <unassigned> [disabled] [size=128K] + Capabilities: [dc] Power Management version 2 + 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00 + 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10 + 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a + + o A description of the environment: 10baseT? 100baseT? + full/half duplex? switched or hubbed? + + o Any additional module parameters which you may be providing to the driver. + + o Any kernel logs which are produced. The more the merrier. + If this is a large file and you are sending your report to a + mailing list, mention that you have the logfile, but don't send + it. If you're reporting direct to the maintainer then just send + it. + + To ensure that all kernel logs are available, add the + following line to /etc/syslog.conf: + + kern.* /var/log/messages + + Then restart syslogd with: + + /etc/rc.d/init.d/syslog restart + + (The above may vary, depending upon which Linux distribution you use). + + o If your problem is reproducible then that's great. Try the + following: + + 1) Increase the debug level. Usually this is done via: + + a) modprobe driver debug=7 + b) In /etc/modprobe.d/driver.conf: + options driver debug=7 + + 2) Recreate the problem with the higher debug level, + send all logs to the maintainer. + + 3) Download you card's diagnostic tool from Donald + Becker's website <http://www.scyld.com/ethercard_diag.html>. + Download mii-diag.c as well. Build these. + + a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is + working correctly. Save the output. + + b) Run the above commands when the card is malfunctioning. Send + both sets of output. + +Finally, please be patient and be prepared to do some work. You may +end up working on this problem for a week or more as the maintainer +asks more questions, asks for more tests, asks for patches to be +applied, etc. At the end of it all, the problem may even remain +unresolved. diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt new file mode 100644 index 000000000..8ff7b4c8f --- /dev/null +++ b/Documentation/networking/vrf.txt @@ -0,0 +1,404 @@ +Virtual Routing and Forwarding (VRF) +==================================== +The VRF device combined with ip rules provides the ability to create virtual +routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the +Linux network stack. One use case is the multi-tenancy problem where each +tenant has their own unique routing tables and in the very least need +different default gateways. + +Processes can be "VRF aware" by binding a socket to the VRF device. Packets +through the socket then use the routing table associated with the VRF +device. An important feature of the VRF device implementation is that it +impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected +(ie., they do not need to be run in each VRF). The design also allows +the use of higher priority ip rules (Policy Based Routing, PBR) to take +precedence over the VRF device rules directing specific traffic as desired. + +In addition, VRF devices allow VRFs to be nested within namespaces. For +example network namespaces provide separation of network interfaces at the +device layer, VLANs on the interfaces within a namespace provide L2 separation +and then VRF devices provide L3 separation. + +Design +------ +A VRF device is created with an associated route table. Network interfaces +are then enslaved to a VRF device: + + +-----------------------------+ + | vrf-blue | ===> route table 10 + +-----------------------------+ + | | | + +------+ +------+ +-------------+ + | eth1 | | eth2 | ... | bond1 | + +------+ +------+ +-------------+ + | | + +------+ +------+ + | eth8 | | eth9 | + +------+ +------+ + +Packets received on an enslaved device and are switched to the VRF device +in the IPv4 and IPv6 processing stacks giving the impression that packets +flow through the VRF device. Similarly on egress routing rules are used to +send packets to the VRF device driver before getting sent out the actual +interface. This allows tcpdump on a VRF device to capture all packets into +and out of the VRF as a whole.[1] Similarly, netfilter[2] and tc rules can be +applied using the VRF device to specify rules that apply to the VRF domain +as a whole. + +[1] Packets in the forwarded state do not flow through the device, so those + packets are not seen by tcpdump. Will revisit this limitation in a + future release. + +[2] Iptables on ingress supports PREROUTING with skb->dev set to the real + ingress device and both INPUT and PREROUTING rules with skb->dev set to + the VRF device. For egress POSTROUTING and OUTPUT rules can be written + using either the VRF device or real egress device. + +Setup +----- +1. VRF device is created with an association to a FIB table. + e.g, ip link add vrf-blue type vrf table 10 + ip link set dev vrf-blue up + +2. An l3mdev FIB rule directs lookups to the table associated with the device. + A single l3mdev rule is sufficient for all VRFs. The VRF device adds the + l3mdev rule for IPv4 and IPv6 when the first device is created with a + default preference of 1000. Users may delete the rule if desired and add + with a different priority or install per-VRF rules. + + Prior to the v4.8 kernel iif and oif rules are needed for each VRF device: + ip ru add oif vrf-blue table 10 + ip ru add iif vrf-blue table 10 + +3. Set the default route for the table (and hence default route for the VRF). + ip route add table 10 unreachable default metric 4278198272 + + This high metric value ensures that the default unreachable route can + be overridden by a routing protocol suite. FRRouting interprets + kernel metrics as a combined admin distance (upper byte) and priority + (lower 3 bytes). Thus the above metric translates to [255/8192]. + +4. Enslave L3 interfaces to a VRF device. + ip link set dev eth1 master vrf-blue + + Local and connected routes for enslaved devices are automatically moved to + the table associated with VRF device. Any additional routes depending on + the enslaved device are dropped and will need to be reinserted to the VRF + FIB table following the enslavement. + + The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global + addresses as VRF enslavement changes. + sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 + +5. Additional VRF routes are added to associated table. + ip route add table 10 ... + + +Applications +------------ +Applications that are to work within a VRF need to bind their socket to the +VRF device: + + setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); + +or to specify the output device using cmsg and IP_PKTINFO. + +TCP & UDP services running in the default VRF context (ie., not bound +to any VRF device) can work across all VRF domains by enabling the +tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: + sysctl -w net.ipv4.tcp_l3mdev_accept=1 + sysctl -w net.ipv4.udp_l3mdev_accept=1 + +netfilter rules on the VRF device can be used to limit access to services +running in the default VRF context as well. + +The default VRF does not have limited scope with respect to port bindings. +That is, if a process does a wildcard bind to a port in the default VRF it +owns the port across all VRF domains within the network namespace. + +################################################################################ + +Using iproute2 for VRFs +======================= +iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this +section lists both commands where appropriate -- with the vrf keyword and the +older form without it. + +1. Create a VRF + + To instantiate a VRF device and associate it with a table: + $ ip link add dev NAME type vrf table ID + + As of v4.8 the kernel supports the l3mdev FIB rule where a single rule + covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first + device create. + +2. List VRFs + + To list VRFs that have been created: + $ ip [-d] link show type vrf + NOTE: The -d option is needed to show the table id + + For example: + $ ip -d link show type vrf + 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 1 addrgenmode eui64 + 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 10 addrgenmode eui64 + 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 66 addrgenmode eui64 + 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 + vrf table 81 addrgenmode eui64 + + + Or in brief output: + + $ ip -br link show type vrf + mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP> + red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP> + blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP> + green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP> + + +3. Assign a Network Interface to a VRF + + Network interfaces are assigned to a VRF by enslaving the netdevice to a + VRF device: + $ ip link set dev NAME master NAME + + On enslavement connected and local routes are automatically moved to the + table associated with the VRF device. + + For example: + $ ip link set dev eth0 master mgmt + + +4. Show Devices Assigned to a VRF + + To show devices that have been assigned to a specific VRF add the master + option to the ip command: + $ ip link show vrf NAME + $ ip link show master NAME + + For example: + $ ip link show vrf red + 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + + Or using the brief output: + $ ip -br link show vrf red + eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP> + eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP> + eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST> + + +5. Show Neighbor Entries for a VRF + + To list neighbor entries associated with devices enslaved to a VRF device + add the master option to the ip command: + $ ip [-6] neigh show vrf NAME + $ ip [-6] neigh show master NAME + + For example: + $ ip neigh show vrf red + 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE + + $ ip -6 neigh show vrf red + 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + + +6. Show Addresses for a VRF + + To show addresses for interfaces associated with a VRF add the master + option to the ip command: + $ ip addr show vrf NAME + $ ip addr show master NAME + + For example: + $ ip addr show vrf red + 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff + inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 + valid_lft forever preferred_lft forever + inet6 2002:1::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:202/64 scope link + valid_lft forever preferred_lft forever + 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 + link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff + inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 + valid_lft forever preferred_lft forever + inet6 2002:2::2/120 scope global + valid_lft forever preferred_lft forever + inet6 fe80::ff:fe00:203/64 scope link + valid_lft forever preferred_lft forever + 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000 + link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff + + Or in brief format: + $ ip -br addr show vrf red + eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 + eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 + eth5 DOWN + + +7. Show Routes for a VRF + + To show routes for a VRF use the ip command to display the table associated + with the VRF device: + $ ip [-6] route show vrf NAME + $ ip [-6] route show table ID + + For example: + $ ip route show vrf red + unreachable default metric 4278198272 + broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 + 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 + local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 + broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 + broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 + 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 + local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 + broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 + + $ ip -6 route show vrf red + local 2002:1:: dev lo proto none metric 0 pref medium + local 2002:1::2 dev lo proto none metric 0 pref medium + 2002:1::/120 dev eth1 proto kernel metric 256 pref medium + local 2002:2:: dev lo proto none metric 0 pref medium + local 2002:2::2 dev lo proto none metric 0 pref medium + 2002:2::/120 dev eth2 proto kernel metric 256 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80:: dev lo proto none metric 0 pref medium + local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium + local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium + fe80::/64 dev eth1 proto kernel metric 256 pref medium + fe80::/64 dev eth2 proto kernel metric 256 pref medium + ff00::/8 dev red metric 256 pref medium + ff00::/8 dev eth1 metric 256 pref medium + ff00::/8 dev eth2 metric 256 pref medium + unreachable default dev lo metric 4278198272 error -101 pref medium + +8. Route Lookup for a VRF + + A test route lookup can be done for a VRF: + $ ip [-6] route get vrf NAME ADDRESS + $ ip [-6] route get oif NAME ADDRESS + + For example: + $ ip route get 10.2.1.40 vrf red + 10.2.1.40 dev eth1 table red src 10.2.1.2 + cache + + $ ip -6 route get 2002:1::32 vrf red + 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium + + +9. Removing Network Interface from a VRF + + Network interfaces are removed from a VRF by breaking the enslavement to + the VRF device: + $ ip link set dev NAME nomaster + + Connected routes are moved back to the default table and local entries are + moved to the local table. + + For example: + $ ip link set dev eth0 nomaster + +-------------------------------------------------------------------------------- + +Commands used in this example: + +cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF +1 mgmt +10 red +66 blue +81 green +EOF + +function vrf_create +{ + VRF=$1 + TBID=$2 + + # create VRF device + ip link add ${VRF} type vrf table ${TBID} + + if [ "${VRF}" != "mgmt" ]; then + ip route add table ${TBID} unreachable default metric 4278198272 + fi + ip link set dev ${VRF} up +} + +vrf_create mgmt 1 +ip link set dev eth0 master mgmt + +vrf_create red 10 +ip link set dev eth1 master red +ip link set dev eth2 master red +ip link set dev eth5 master red + +vrf_create blue 66 +ip link set dev eth3 master blue + +vrf_create green 81 +ip link set dev eth4 master green + + +Interface addresses from /etc/network/interfaces: +auto eth0 +iface eth0 inet static + address 10.0.0.2 + netmask 255.255.255.0 + gateway 10.0.0.254 + +iface eth0 inet6 static + address 2000:1::2 + netmask 120 + +auto eth1 +iface eth1 inet static + address 10.2.1.2 + netmask 255.255.255.0 + +iface eth1 inet6 static + address 2002:1::2 + netmask 120 + +auto eth2 +iface eth2 inet static + address 10.2.2.2 + netmask 255.255.255.0 + +iface eth2 inet6 static + address 2002:2::2 + netmask 120 + +auto eth3 +iface eth3 inet static + address 10.2.3.2 + netmask 255.255.255.0 + +iface eth3 inet6 static + address 2002:3::2 + netmask 120 + +auto eth4 +iface eth4 inet static + address 10.2.4.2 + netmask 255.255.255.0 + +iface eth4 inet6 static + address 2002:4::2 + netmask 120 diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt new file mode 100644 index 000000000..abfec245f --- /dev/null +++ b/Documentation/networking/vxge.txt @@ -0,0 +1,93 @@ +Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver +============================================================================== + +Contents +-------- + +1) Introduction +2) Features supported +3) Configurable driver parameters +4) Troubleshooting + +1) Introduction: +---------------- +This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O +Virtualized Server adapters. +The X3100 series supports four modes of operation, configurable via +firmware - + Single function mode + Multi function mode + SRIOV mode + MRIOV mode +The functions share a 10GbE link and the pci-e bus, but hardly anything else +inside the ASIC. Features like independent hw reset, statistics, bandwidth/ +priority allocation and guarantees, GRO, TSO, interrupt moderation etc are +supported independently on each function. + +(See below for a complete list of features supported for both IPv4 and IPv6) + +2) Features supported: +---------------------- + +i) Single function mode (up to 17 queues) + +ii) Multi function mode (up to 17 functions) + +iii) PCI-SIG's I/O Virtualization + - Single Root mode: v1.0 (up to 17 functions) + - Multi-Root mode: v1.0 (up to 17 functions) + +iv) Jumbo frames + X3100 Series supports MTU up to 9600 bytes, modifiable using + ip command. + +v) Offloads supported: (Enabled by default) + Checksum offload (TCP/UDP/IP) on transmit and receive paths + TCP Segmentation Offload (TSO) on transmit path + Generic Receive Offload (GRO) on receive path + +vi) MSI-X: (Enabled by default) + Resulting in noticeable performance improvement (up to 7% on certain + platforms). + +vii) NAPI: (Enabled by default) + For better Rx interrupt moderation. + +viii)RTH (Receive Traffic Hash): (Enabled by default) + Receive side steering for better scaling. + +ix) Statistics + Comprehensive MAC-level and software statistics displayed using + "ethtool -S" option. + +x) Multiple hardware queues: (Enabled by default) + Up to 17 hardware based transmit and receive data channels, with + multiple steering options (transmit multiqueue enabled by default). + +3) Configurable driver parameters: +---------------------------------- + +i) max_config_dev + Specifies maximum device functions to be enabled. + Valid range: 1-8 + +ii) max_config_port + Specifies number of ports to be enabled. + Valid range: 1,2 + Default: 1 + +iii)max_config_vpath + Specifies maximum VPATH(s) configured for each device function. + Valid range: 1-17 + +iv) vlan_tag_strip + Enables/disables vlan tag stripping from all received tagged frames that + are not replicated at the internal L2 switch. + Valid range: 0,1 (disabled, enabled respectively) + Default: 1 + +v) addr_learn_en + Enable learning the mac address of the guest OS interface in + virtualization environment. + Valid range: 0,1 (disabled, enabled respectively) + Default: 0 diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt new file mode 100644 index 000000000..c28f4989c --- /dev/null +++ b/Documentation/networking/vxlan.txt @@ -0,0 +1,51 @@ +Virtual eXtensible Local Area Networking documentation +====================================================== + +The VXLAN protocol is a tunnelling protocol designed to solve the +problem of limited VLAN IDs (4096) in IEEE 802.1q. With VXLAN the +size of the identifier is expanded to 24 bits (16777216). + +VXLAN is described by IETF RFC 7348, and has been implemented by a +number of vendors. The protocol runs over UDP using a single +destination port. This document describes the Linux kernel tunnel +device, there is also a separate implementation of VXLAN for +Openvswitch. + +Unlike most tunnels, a VXLAN is a 1 to N network, not just point to +point. A VXLAN device can learn the IP address of the other endpoint +either dynamically in a manner similar to a learning bridge, or make +use of statically-configured forwarding entries. + +The management of vxlan is done in a manner similar to its two closest +neighbors GRE and VLAN. Configuring VXLAN requires the version of +iproute2 that matches the kernel release where VXLAN was first merged +upstream. + +1. Create vxlan device + # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789 + +This creates a new device named vxlan0. The device uses the multicast +group 239.1.1.1 over eth1 to handle traffic for which there is no +entry in the forwarding table. The destination port number is set to +the IANA-assigned value of 4789. The Linux implementation of VXLAN +pre-dates the IANA's selection of a standard destination port number +and uses the Linux-selected value by default to maintain backwards +compatibility. + +2. Delete vxlan device + # ip link delete vxlan0 + +3. Show vxlan info + # ip -d link show vxlan0 + +It is possible to create, destroy and display the vxlan +forwarding table using the new bridge command. + +1. Create forwarding table entry + # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0 + +2. Delete forwarding table entry + # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0 + +3. Show forwarding table + # bridge fdb show dev vxlan0 diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt new file mode 100644 index 000000000..7f213b556 --- /dev/null +++ b/Documentation/networking/x25-iface.txt @@ -0,0 +1,123 @@ + X.25 Device Driver Interface 1.1 + + Jonathan Naylor 26.12.96 + +This is a description of the messages to be passed between the X.25 Packet +Layer and the X.25 device driver. They are designed to allow for the easy +setting of the LAPB mode from within the Packet Layer. + +The X.25 device driver will be coded normally as per the Linux device driver +standards. Most X.25 device drivers will be moderately similar to the +already existing Ethernet device drivers. However unlike those drivers, the +X.25 device driver has a state associated with it, and this information +needs to be passed to and from the Packet Layer for proper operation. + +All messages are held in sk_buff's just like real data to be transmitted +over the LAPB link. The first byte of the skbuff indicates the meaning of +the rest of the skbuff, if any more information does exist. + + +Packet Layer to Device Driver +----------------------------- + +First Byte = 0x00 (X25_IFACE_DATA) + +This indicates that the rest of the skbuff contains data to be transmitted +over the LAPB link. The LAPB link should already exist before any data is +passed down. + +First Byte = 0x01 (X25_IFACE_CONNECT) + +Establish the LAPB link. If the link is already established then the connect +confirmation message should be returned as soon as possible. + +First Byte = 0x02 (X25_IFACE_DISCONNECT) + +Terminate the LAPB link. If it is already disconnected then the disconnect +confirmation message should be returned as soon as possible. + +First Byte = 0x03 (X25_IFACE_PARAMS) + +LAPB parameters. To be defined. + + +Device Driver to Packet Layer +----------------------------- + +First Byte = 0x00 (X25_IFACE_DATA) + +This indicates that the rest of the skbuff contains data that has been +received over the LAPB link. + +First Byte = 0x01 (X25_IFACE_CONNECT) + +LAPB link has been established. The same message is used for both a LAPB +link connect_confirmation and a connect_indication. + +First Byte = 0x02 (X25_IFACE_DISCONNECT) + +LAPB link has been terminated. This same message is used for both a LAPB +link disconnect_confirmation and a disconnect_indication. + +First Byte = 0x03 (X25_IFACE_PARAMS) + +LAPB parameters. To be defined. + + + +Possible Problems +================= + +(Henner Eisen, 2000-10-28) + +The X.25 packet layer protocol depends on a reliable datalink service. +The LAPB protocol provides such reliable service. But this reliability +is not preserved by the Linux network device driver interface: + +- With Linux 2.4.x (and above) SMP kernels, packet ordering is not + preserved. Even if a device driver calls netif_rx(skb1) and later + netif_rx(skb2), skb2 might be delivered to the network layer + earlier that skb1. +- Data passed upstream by means of netif_rx() might be dropped by the + kernel if the backlog queue is congested. + +The X.25 packet layer protocol will detect this and reset the virtual +call in question. But many upper layer protocols are not designed to +handle such N-Reset events gracefully. And frequent N-Reset events +will always degrade performance. + +Thus, driver authors should make netif_rx() as reliable as possible: + +SMP re-ordering will not occur if the driver's interrupt handler is +always executed on the same CPU. Thus, + +- Driver authors should use irq affinity for the interrupt handler. + +The probability of packet loss due to backlog congestion can be +reduced by the following measures or a combination thereof: + +(1) Drivers for kernel versions 2.4.x and above should always check the + return value of netif_rx(). If it returns NET_RX_DROP, the + driver's LAPB protocol must not confirm reception of the frame + to the peer. + This will reliably suppress packet loss. The LAPB protocol will + automatically cause the peer to re-transmit the dropped packet + later. + The lapb module interface was modified to support this. Its + data_indication() method should now transparently pass the + netif_rx() return value to the (lapb module) caller. +(2) Drivers for kernel versions 2.2.x should always check the global + variable netdev_dropping when a new frame is received. The driver + should only call netif_rx() if netdev_dropping is zero. Otherwise + the driver should not confirm delivery of the frame and drop it. + Alternatively, the driver can queue the frame internally and call + netif_rx() later when netif_dropping is 0 again. In that case, delivery + confirmation should also be deferred such that the internal queue + cannot grow to much. + This will not reliably avoid packet loss, but the probability + of packet loss in netif_rx() path will be significantly reduced. +(3) Additionally, driver authors might consider to support + CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up + when a previously congested backlog queue becomes empty again. + The driver could uses this for flow-controlling the peer by means + of the LAPB protocol's flow-control service. diff --git a/Documentation/networking/x25.txt b/Documentation/networking/x25.txt new file mode 100644 index 000000000..c91c6d715 --- /dev/null +++ b/Documentation/networking/x25.txt @@ -0,0 +1,44 @@ +Linux X.25 Project + +As my third year dissertation at University I have taken it upon myself to +write an X.25 implementation for Linux. My aim is to provide a complete X.25 +Packet Layer and a LAPB module to allow for "normal" X.25 to be run using +Linux. There are two sorts of X.25 cards available, intelligent ones that +implement LAPB on the card itself, and unintelligent ones that simply do +framing, bit-stuffing and checksumming. These both need to be handled by the +system. + +I therefore decided to write the implementation such that as far as the +Packet Layer is concerned, the link layer was being performed by a lower +layer of the Linux kernel and therefore it did not concern itself with +implementation of LAPB. Therefore the LAPB modules would be called by +unintelligent X.25 card drivers and not by intelligent ones, this would +provide a uniform device driver interface, and simplify configuration. + +To confuse matters a little, an 802.2 LLC implementation for Linux is being +written which will allow X.25 to be run over an Ethernet (or Token Ring) and +conform with the JNT "Pink Book", this will have a different interface to +the Packet Layer but there will be no confusion since the class of device +being served by the LLC will be completely separate from LAPB. The LLC +implementation is being done as part of another protocol project (SNA) and +by a different author. + +Just when you thought that it could not become more confusing, another +option appeared, XOT. This allows X.25 Packet Layer frames to operate over +the Internet using TCP/IP as a reliable link layer. RFC1613 specifies the +format and behaviour of the protocol. If time permits this option will also +be actively considered. + +A linux-x25 mailing list has been created at vger.kernel.org to support the +development and use of Linux X.25. It is early days yet, but interested +parties are welcome to subscribe to it. Just send a message to +majordomo@vger.kernel.org with the following in the message body: + +subscribe linux-x25 +end + +The contents of the Subject line are ignored. + +Jonathan + +g4klx@g4klx.demon.co.uk diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt new file mode 100644 index 000000000..50c34ca65 --- /dev/null +++ b/Documentation/networking/xfrm_device.txt @@ -0,0 +1,135 @@ + +=============================================== +XFRM device - offloading the IPsec computations +=============================================== +Shannon Nelson <shannon.nelson@oracle.com> + + +Overview +======== + +IPsec is a useful feature for securing network traffic, but the +computational cost is high: a 10Gbps link can easily be brought down +to under 1Gbps, depending on the traffic and link configuration. +Luckily, there are NICs that offer a hardware based IPsec offload which +can radically increase throughput and decrease CPU utilization. The XFRM +Device interface allows NIC drivers to offer to the stack access to the +hardware offload. + +Userland access to the offload is typically through a system such as +libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can +be handy when experimenting. An example command might look something +like this: + + ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ + reqid 0x07 replay-window 32 \ + aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ + sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \ + offload dev eth4 dir in + +Yes, that's ugly, but that's what shell scripts and/or libreswan are for. + + + +Callbacks to implement +====================== + +/* from include/linux/netdevice.h */ +struct xfrmdev_ops { + int (*xdo_dev_state_add) (struct xfrm_state *x); + void (*xdo_dev_state_delete) (struct xfrm_state *x); + void (*xdo_dev_state_free) (struct xfrm_state *x); + bool (*xdo_dev_offload_ok) (struct sk_buff *skb, + struct xfrm_state *x); + void (*xdo_dev_state_advance_esn) (struct xfrm_state *x); +}; + +The NIC driver offering ipsec offload will need to implement these +callbacks to make the offload available to the network stack's +XFRM subsytem. Additionally, the feature bits NETIF_F_HW_ESP and +NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload. + + + +Flow +==== + +At probe time and before the call to register_netdev(), the driver should +set up local data structures and XFRM callbacks, and set the feature bits. +The XFRM code's listener will finish the setup on NETDEV_REGISTER. + + adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops; + adapter->netdev->features |= NETIF_F_HW_ESP; + adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP; + +When new SAs are set up with a request for "offload" feature, the +driver's xdo_dev_state_add() will be given the new SA to be offloaded +and an indication of whether it is for Rx or Tx. The driver should + - verify the algorithm is supported for offloads + - store the SA information (key, salt, target-ip, protocol, etc) + - enable the HW offload of the SA + +The driver can also set an offload_handle in the SA, an opaque void pointer +that can be used to convey context into the fast-path offload requests. + + xs->xso.offload_handle = context; + + +When the network stack is preparing an IPsec packet for an SA that has +been setup for offload, it first calls into xdo_dev_offload_ok() with +the skb and the intended offload state to ask the driver if the offload +will serviceable. This can check the packet information to be sure the +offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and +return true of false to signify its support. + +When ready to send, the driver needs to inspect the Tx packet for the +offload information, including the opaque context, and set up the packet +send accordingly. + + xs = xfrm_input_state(skb); + context = xs->xso.offload_handle; + set up HW for send + +The stack has already inserted the appropriate IPsec headers in the +packet data, the offload just needs to do the encryption and fix up the +header values. + + +When a packet is received and the HW has indicated that it offloaded a +decryption, the driver needs to add a reference to the decoded SA into +the packet's skb. At this point the data should be decrypted but the +IPsec headers are still in the packet data; they are removed later up +the stack in xfrm_input(). + + find and hold the SA that was used to the Rx skb + get spi, protocol, and destination IP from packet headers + xs = find xs from (spi, protocol, dest_IP) + xfrm_state_hold(xs); + + store the state information into the skb + skb->sp = secpath_dup(skb->sp); + skb->sp->xvec[skb->sp->len++] = xs; + skb->sp->olen++; + + indicate the success and/or error status of the offload + xo = xfrm_offload(skb); + xo->flags = CRYPTO_DONE; + xo->status = crypto_status; + + hand the packet to napi_gro_receive() as usual + +In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn(). +Driver will check packet seq number and update HW ESN state machine if needed. + +When the SA is removed by the user, the driver's xdo_dev_state_delete() +is asked to disable the offload. Later, xdo_dev_state_free() is called +from a garbage collection routine after all reference counts to the state +have been removed and any remaining resources can be cleared for the +offload state. How these are used by the driver will depend on specific +hardware needs. + +As a netdev is set to DOWN the XFRM stack's netdev listener will call +xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded +states. + + diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt new file mode 100644 index 000000000..2eae619ab --- /dev/null +++ b/Documentation/networking/xfrm_proc.txt @@ -0,0 +1,82 @@ +XFRM proc - /proc/net/xfrm_* files +================================== +Masahide NAKAMURA <nakam@linux-ipv6.org> + + +Transformation Statistics +------------------------- + +The xfrm_proc code is a set of statistics showing numbers of packets +dropped by the transformation code and why. These counters are defined +as part of the linux private MIB. These counters can be viewed in +/proc/net/xfrm_stat. + + +Inbound errors +~~~~~~~~~~~~~~ +XfrmInError: + All errors which is not matched others +XfrmInBufferError: + No buffer is left +XfrmInHdrError: + Header error +XfrmInNoStates: + No state is found + i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong +XfrmInStateProtoError: + Transformation protocol specific error + e.g. SA key is wrong +XfrmInStateModeError: + Transformation mode specific error +XfrmInStateSeqError: + Sequence error + i.e. Sequence number is out of window +XfrmInStateExpired: + State is expired +XfrmInStateMismatch: + State has mismatch option + e.g. UDP encapsulation type is mismatch +XfrmInStateInvalid: + State is invalid +XfrmInTmplMismatch: + No matching template for states + e.g. Inbound SAs are correct but SP rule is wrong +XfrmInNoPols: + No policy is found for states + e.g. Inbound SAs are correct but no SP is found +XfrmInPolBlock: + Policy discards +XfrmInPolError: + Policy error +XfrmAcquireError: + State hasn't been fully acquired before use +XfrmFwdHdrError: + Forward routing of a packet is not allowed + +Outbound errors +~~~~~~~~~~~~~~~ +XfrmOutError: + All errors which is not matched others +XfrmOutBundleGenError: + Bundle generation error +XfrmOutBundleCheckError: + Bundle check error +XfrmOutNoStates: + No state is found +XfrmOutStateProtoError: + Transformation protocol specific error +XfrmOutStateModeError: + Transformation mode specific error +XfrmOutStateSeqError: + Sequence error + i.e. Sequence number overflow +XfrmOutStateExpired: + State is expired +XfrmOutPolBlock: + Policy discards +XfrmOutPolDead: + Policy is dead +XfrmOutPolError: + Policy error +XfrmOutStateInvalid: + State is invalid, perhaps expired diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt new file mode 100644 index 000000000..8d88e0f2e --- /dev/null +++ b/Documentation/networking/xfrm_sync.txt @@ -0,0 +1,169 @@ + +The sync patches work is based on initial patches from +Krisztian <hidden@balabit.hu> and others and additional patches +from Jamal <hadi@cyberus.ca>. + +The end goal for syncing is to be able to insert attributes + generate +events so that the SA can be safely moved from one machine to another +for HA purposes. +The idea is to synchronize the SA so that the takeover machine can do +the processing of the SA as accurate as possible if it has access to it. + +We already have the ability to generate SA add/del/upd events. +These patches add ability to sync and have accurate lifetime byte (to +ensure proper decay of SAs) and replay counters to avoid replay attacks +with as minimal loss at failover time. +This way a backup stays as closely up-to-date as an active member. + +Because the above items change for every packet the SA receives, +it is possible for a lot of the events to be generated. +For this reason, we also add a nagle-like algorithm to restrict +the events. i.e we are going to set thresholds to say "let me +know if the replay sequence threshold is reached or 10 secs have passed" +These thresholds are set system-wide via sysctls or can be updated +per SA. + +The identified items that need to be synchronized are: +- the lifetime byte counter +note that: lifetime time limit is not important if you assume the failover +machine is known ahead of time since the decay of the time countdown +is not driven by packet arrival. +- the replay sequence for both inbound and outbound + +1) Message Structure +---------------------- + +nlmsghdr:aevent_id:optional-TLVs. + +The netlink message types are: + +XFRM_MSG_NEWAE and XFRM_MSG_GETAE. + +A XFRM_MSG_GETAE does not have TLVs. +A XFRM_MSG_NEWAE will have at least two TLVs (as is +discussed further below). + +aevent_id structure looks like: + + struct xfrm_aevent_id { + struct xfrm_usersa_id sa_id; + xfrm_address_t saddr; + __u32 flags; + __u32 reqid; + }; + +The unique SA is identified by the combination of xfrm_usersa_id, +reqid and saddr. + +flags are used to indicate different things. The possible +flags are: + XFRM_AE_RTHR=1, /* replay threshold*/ + XFRM_AE_RVAL=2, /* replay value */ + XFRM_AE_LVAL=4, /* lifetime value */ + XFRM_AE_ETHR=8, /* expiry timer threshold */ + XFRM_AE_CR=16, /* Event cause is replay update */ + XFRM_AE_CE=32, /* Event cause is timer expiry */ + XFRM_AE_CU=64, /* Event cause is policy update */ + +How these flags are used is dependent on the direction of the +message (kernel<->user) as well the cause (config, query or event). +This is described below in the different messages. + +The pid will be set appropriately in netlink to recognize direction +(0 to the kernel and pid = processid that created the event +when going from kernel to user space) + +A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS +to get notified of these events. + +2) TLVS reflect the different parameters: +----------------------------------------- + +a) byte value (XFRMA_LTIME_VAL) +This TLV carries the running/current counter for byte lifetime since +last event. + +b)replay value (XFRMA_REPLAY_VAL) +This TLV carries the running/current counter for replay sequence since +last event. + +c)replay threshold (XFRMA_REPLAY_THRESH) +This TLV carries the threshold being used by the kernel to trigger events +when the replay sequence is exceeded. + +d) expiry timer (XFRMA_ETIMER_THRESH) +This is a timer value in milliseconds which is used as the nagle +value to rate limit the events. + +3) Default configurations for the parameters: +---------------------------------------------- + +By default these events should be turned off unless there is +at least one listener registered to listen to the multicast +group XFRMNLGRP_AEVENTS. + +Programs installing SAs will need to specify the two thresholds, however, +in order to not change existing applications such as racoon +we also provide default threshold values for these different parameters +in case they are not specified. + +the two sysctls/proc entries are: +a) /proc/sys/net/core/sysctl_xfrm_aevent_etime +used to provide default values for the XFRMA_ETIMER_THRESH in incremental +units of time of 100ms. The default is 10 (1 second) + +b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth +used to provide default values for XFRMA_REPLAY_THRESH parameter +in incremental packet count. The default is two packets. + +4) Message types +---------------- + +a) XFRM_MSG_GETAE issued by user-->kernel. +XFRM_MSG_GETAE does not carry any TLVs. +The response is a XFRM_MSG_NEWAE which is formatted based on what +XFRM_MSG_GETAE queried for. +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved +*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved + +b) XFRM_MSG_NEWAE is issued by either user space to configure +or kernel to announce events or respond to a XFRM_MSG_GETAE. + +i) user --> kernel to configure a specific SA. +any of the values or threshold parameters can be updated by passing the +appropriate TLV. +A response is issued back to the sender in user space to indicate success +or failure. +In the case of success, additionally an event with +XFRM_MSG_NEWAE is also issued to any listeners as described in iii). + +ii) kernel->user direction as a response to XFRM_MSG_GETAE +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +The threshold TLVs will be included if explicitly requested in +the XFRM_MSG_GETAE message. + +iii) kernel->user to report as event if someone sets any values or +thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). +In such a case XFRM_AE_CU flag is set to inform the user that +the change happened as a result of an update. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +iv) kernel->user to report event when replay threshold or a timeout +is exceeded. +In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout +happened) is set to inform the user what happened. +Note the two flags are mutually exclusive. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +Exceptions to threshold settings +-------------------------------- + +If you have an SA that is getting hit by traffic in bursts such that +there is a period where the timer threshold expires with no packets +seen, then an odd behavior is seen as follows: +The first packet arrival after a timer expiry will trigger a timeout +event; i.e we don't wait for a timeout period or a packet threshold +to be reached. This is done for simplicity and efficiency reasons. + +-JHS diff --git a/Documentation/networking/xfrm_sysctl.txt b/Documentation/networking/xfrm_sysctl.txt new file mode 100644 index 000000000..5bbd16792 --- /dev/null +++ b/Documentation/networking/xfrm_sysctl.txt @@ -0,0 +1,4 @@ +/proc/sys/net/core/xfrm_* Variables: + +xfrm_acq_expires - INTEGER + default 30 - hard timeout in seconds for acquire requests diff --git a/Documentation/networking/z8530book.rst b/Documentation/networking/z8530book.rst new file mode 100644 index 000000000..fea2c40e7 --- /dev/null +++ b/Documentation/networking/z8530book.rst @@ -0,0 +1,256 @@ +======================= +Z8530 Programming Guide +======================= + +:Author: Alan Cox + +Introduction +============ + +The Z85x30 family synchronous/asynchronous controller chips are used on +a large number of cheap network interface cards. The kernel provides a +core interface layer that is designed to make it easy to provide WAN +services using this chip. + +The current driver only support synchronous operation. Merging the +asynchronous driver support into this code to allow any Z85x30 device to +be used as both a tty interface and as a synchronous controller is a +project for Linux post the 2.4 release + +Driver Modes +============ + +The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in +three different modes. Each mode can be applied to an individual channel +on the chip (each chip has two channels). + +The PIO synchronous mode supports the most common Z8530 wiring. Here the +chip is interface to the I/O and interrupt facilities of the host +machine but not to the DMA subsystem. When running PIO the Z8530 has +extremely tight timing requirements. Doing high speeds, even with a +Z85230 will be tricky. Typically you should expect to achieve at best +9600 baud with a Z8C530 and 64Kbits with a Z85230. + +The DMA mode supports the chip when it is configured to use dual DMA +channels on an ISA bus. The better cards tend to support this mode of +operation for a single channel. With DMA running the Z85230 tops out +when it starts to hit ISA DMA constraints at about 512Kbits. It is worth +noting here that many PC machines hang or crash when the chip is driven +fast enough to hold the ISA bus solid. + +Transmit DMA mode uses a single DMA channel. The DMA channel is used for +transmission as the transmit FIFO is smaller than the receive FIFO. it +gives better performance than pure PIO mode but is nowhere near as ideal +as pure DMA mode. + +Using the Z85230 driver +======================= + +The Z85230 driver provides the back end interface to your board. To +configure a Z8530 interface you need to detect the board and to identify +its ports and interrupt resources. It is also your problem to verify the +resources are available. + +Having identified the chip you need to fill in a struct z8530_dev, +which describes each chip. This object must exist until you finally +shutdown the board. Firstly zero the active field. This ensures nothing +goes off without you intending it. The irq field should be set to the +interrupt number of the chip. (Each chip has a single interrupt source +rather than each channel). You are responsible for allocating the +interrupt line. The interrupt handler should be set to +:c:func:`z8530_interrupt()`. The device id should be set to the +z8530_dev structure pointer. Whether the interrupt can be shared or not +is board dependent, and up to you to initialise. + +The structure holds two channel structures. Initialise chanA.ctrlio and +chanA.dataio with the address of the control and data ports. You can or +this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS +delay for chip settling done in software. The PORT_SLEEP option is +architecture specific. Other flags may become available on future +platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to +start the chip up as disabled and discarding interrupt events. This +ensures that stray interrupts will be mopped up and not hang the bus. +Set chanA.dev to point to the device structure itself. The private and +name field you may use as you wish. The private field is unused by the +Z85230 layer. The name is used for error reporting and it may thus make +sense to make it match the network name. + +Repeat the same operation with the B channel if your chip has both +channels wired to something useful. This isn't always the case. If it is +not wired then the I/O values do not matter, but you must initialise +chanB.dev. + +If your board has DMA facilities then initialise the txdma and rxdma +fields for the relevant channels. You must also allocate the ISA DMA +channels and do any necessary board level initialisation to configure +them. The low level driver will do the Z8530 and DMA controller +programming but not board specific magic. + +Having initialised the device you can then call +:c:func:`z8530_init()`. This will probe the chip and reset it into +a known state. An identification sequence is then run to identify the +chip type. If the checks fail to pass the function returns a non zero +error code. Typically this indicates that the port given is not valid. +After this call the type field of the z8530_dev structure is +initialised to either Z8530, Z85C30 or Z85230 according to the chip +found. + +Once you have called z8530_init you can also make use of the utility +function :c:func:`z8530_describe()`. This provides a consistent +reporting format for the Z8530 devices, and allows all the drivers to +provide consistent reporting. + +Attaching Network Interfaces +============================ + +If you wish to use the network interface facilities of the driver, then +you need to attach a network device to each channel that is present and +in use. In addition to use the generic HDLC you need to follow some +additional plumbing rules. They may seem complex but a look at the +example hostess_sv11 driver should reassure you. + +The network device used for each channel should be pointed to by the +netdevice field of each channel. The hdlc-> priv field of the network +device points to your private data - you will need to be able to find +your private data from this. + +The way most drivers approach this particular problem is to create a +structure holding the Z8530 device definition and put that into the +private field of the network device. The network device fields of the +channels then point back to the network devices. + +If you wish to use the generic HDLC then you need to register the HDLC +device. + +Before you register your network device you will also need to provide +suitable handlers for most of the network device callbacks. See the +network device documentation for more details on this. + +Configuring And Activating The Port +=================================== + +The Z85230 driver provides helper functions and tables to load the port +registers on the Z8530 chips. When programming the register settings for +a channel be aware that the documentation recommends initialisation +orders. Strange things happen when these are not followed. + +:c:func:`z8530_channel_load()` takes an array of pairs of +initialisation values in an array of u8 type. The first value is the +Z8530 register number. Add 16 to indicate the alternate register bank on +the later chips. The array is terminated by a 255. + +The driver provides a pair of public tables. The z8530_hdlc_kilostream +table is for the UK 'Kilostream' service and also happens to cover most +other end host configurations. The z8530_hdlc_kilostream_85230 table +is the same configuration using the enhancements of the 85230 chip. The +configuration loaded is standard NRZ encoded synchronous data with HDLC +bitstuffing. All of the timing is taken from the other end of the link. + +When writing your own tables be aware that the driver internally tracks +register values. It may need to reload values. You should therefore be +sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where +the register settings depend on DMA selection the driver will update the +bits itself when you open or close. Loading a new table with the +interface open is not recommended. + +There are three standard configurations supported by the core code. In +PIO mode the interface is programmed up to use interrupt driven PIO. +This places high demands on the host processor to avoid latency. The +driver is written to take account of latency issues but it cannot avoid +latencies caused by other drivers, notably IDE in PIO mode. Because the +drivers allocate buffers you must also prevent MTU changes while the +port is open. + +Once the port is open it will call the rx_function of each channel +whenever a completed packet arrived. This is invoked from interrupt +context and passes you the channel and a network buffer (struct +sk_buff) holding the data. The data includes the CRC bytes so most +users will want to trim the last two bytes before processing the data. +This function is very timing critical. When you wish to simply discard +data the support code provides the function +:c:func:`z8530_null_rx()` to discard the data. + +To active PIO mode sending and receiving the ``z8530_sync_open`` is called. +This expects to be passed the network device and the channel. Typically +this is called from your network device open callback. On a failure a +non zero error status is returned. +The :c:func:`z8530_sync_close()` function shuts down a PIO +channel. This must be done before the channel is opened again and before +the driver shuts down and unloads. + +The ideal mode of operation is dual channel DMA mode. Here the kernel +driver will configure the board for DMA in both directions. The driver +also handles ISA DMA issues such as controller programming and the +memory range limit for you. This mode is activated by calling the +:c:func:`z8530_sync_dma_open()` function. On failure a non zero +error value is returned. Once this mode is activated it can be shut down +by calling the :c:func:`z8530_sync_dma_close()`. You must call +the close function matching the open mode you used. + +The final supported mode uses a single DMA channel to drive the transmit +side. As the Z85C30 has a larger FIFO on the receive channel this tends +to increase the maximum speed a little. This is activated by calling the +``z8530_sync_txdma_open``. This returns a non zero error code on failure. The +:c:func:`z8530_sync_txdma_close()` function closes down the Z8530 +interface from this mode. + +Network Layer Functions +======================= + +The Z8530 layer provides functions to queue packets for transmission. +The driver internally buffers the frame currently being transmitted and +one further frame (in order to keep back to back transmission running). +Any further buffering is up to the caller. + +The function :c:func:`z8530_queue_xmit()` takes a network buffer +in sk_buff format and queues it for transmission. The caller must +provide the entire packet with the exception of the bitstuffing and CRC. +This is normally done by the caller via the generic HDLC interface +layer. It returns 0 if the buffer has been queued and non zero values +for queue full. If the function accepts the buffer it becomes property +of the Z8530 layer and the caller should not free it. + +The function :c:func:`z8530_get_stats()` returns a pointer to an +internally maintained per interface statistics block. This provides most +of the interface code needed to implement the network layer get_stats +callback. + +Porting The Z8530 Driver +======================== + +The Z8530 driver is written to be portable. In DMA mode it makes +assumptions about the use of ISA DMA. These are probably warranted in +most cases as the Z85230 in particular was designed to glue to PC type +machines. The PIO mode makes no real assumptions. + +Should you need to retarget the Z8530 driver to another architecture the +only code that should need changing are the port I/O functions. At the +moment these assume PC I/O port accesses. This may not be appropriate +for all platforms. Replacing :c:func:`z8530_read_port()` and +``z8530_write_port`` is intended to be all that is required to port +this driver layer. + +Known Bugs And Assumptions +========================== + +Interrupt Locking + The locking in the driver is done via the global cli/sti lock. This + makes for relatively poor SMP performance. Switching this to use a + per device spin lock would probably materially improve performance. + +Occasional Failures + We have reports of occasional failures when run for very long + periods of time and the driver starts to receive junk frames. At the + moment the cause of this is not clear. + +Public Functions Provided +========================= + +.. kernel-doc:: drivers/net/wan/z85230.c + :export: + +Internal Functions +================== + +.. kernel-doc:: drivers/net/wan/z85230.c + :internal: diff --git a/Documentation/networking/z8530drv.txt b/Documentation/networking/z8530drv.txt new file mode 100644 index 000000000..2206abbc3 --- /dev/null +++ b/Documentation/networking/z8530drv.txt @@ -0,0 +1,657 @@ +This is a subset of the documentation. To use this driver you MUST have the +full package from: + +Internet: +========= + +1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz + +2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz + +Please note that the information in this document may be hopelessly outdated. +A new version of the documentation, along with links to other important +Linux Kernel AX.25 documentation and programs, is available on +http://yaina.de/jreuter + +----------------------------------------------------------------------------- + + + SCC.C - Linux driver for Z8530 based HDLC cards for AX.25 + + ******************************************************************** + + (c) 1993,2000 by Joerg Reuter DL1BKE <jreuter@yaina.de> + + portions (c) 1993 Guido ten Dolle PE1NNZ + + for the complete copyright notice see >> Copying.Z8530DRV << + + ******************************************************************** + + +1. Initialization of the driver +=============================== + +To use the driver, 3 steps must be performed: + + 1. if compiled as module: loading the module + 2. Setup of hardware, MODEM and KISS parameters with sccinit + 3. Attach each channel to the Linux kernel AX.25 with "ifconfig" + +Unlike the versions below 2.4 this driver is a real network device +driver. If you want to run xNOS instead of our fine kernel AX.25 +use a 2.x version (available from above sites) or read the +AX.25-HOWTO on how to emulate a KISS TNC on network device drivers. + + +1.1 Loading the module +====================== + +(If you're going to compile the driver as a part of the kernel image, + skip this chapter and continue with 1.2) + +Before you can use a module, you'll have to load it with + + insmod scc.o + +please read 'man insmod' that comes with module-init-tools. + +You should include the insmod in one of the /etc/rc.d/rc.* files, +and don't forget to insert a call of sccinit after that. It +will read your /etc/z8530drv.conf. + +1.2. /etc/z8530drv.conf +======================= + +To setup all parameters you must run /sbin/sccinit from one +of your rc.*-files. This has to be done BEFORE you can +"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf +and sets the hardware, MODEM and KISS parameters. A sample file is +delivered with this package. Change it to your needs. + +The file itself consists of two main sections. + +1.2.1 configuration of hardware parameters +========================================== + +The hardware setup section defines the following parameters for each +Z8530: + +chip 1 +data_a 0x300 # data port A +ctrl_a 0x304 # control port A +data_b 0x301 # data port B +ctrl_b 0x305 # control port B +irq 5 # IRQ No. 5 +pclock 4915200 # clock +board BAYCOM # hardware type +escc no # enhanced SCC chip? (8580/85180/85280) +vector 0 # latch for interrupt vector +special no # address of special function register +option 0 # option to set via sfr + + +chip - this is just a delimiter to make sccinit a bit simpler to + program. A parameter has no effect. + +data_a - the address of the data port A of this Z8530 (needed) +ctrl_a - the address of the control port A (needed) +data_b - the address of the data port B (needed) +ctrl_b - the address of the control port B (needed) + +irq - the used IRQ for this chip. Different chips can use different + IRQs or the same. If they share an interrupt, it needs to be + specified within one chip-definition only. + +pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is + default), measured in Hertz + +board - the "type" of the board: + + SCC type value + --------------------------------- + PA0HZP SCC card PA0HZP + EAGLE card EAGLE + PC100 card PC100 + PRIMUS-PC (DG9BL) card PRIMUS + BayCom (U)SCC card BAYCOM + +escc - if you want support for ESCC chips (8580, 85180, 85280), set + this to "yes" (option, defaults to "no") + +vector - address of the vector latch (aka "intack port") for PA0HZP + cards. There can be only one vector latch for all chips! + (option, defaults to 0) + +special - address of the special function register on several cards. + (option, defaults to 0) + +option - The value you write into that register (option, default is 0) + +You can specify up to four chips (8 channels). If this is not enough, +just change + + #define MAXSCC 4 + +to a higher value. + +Example for the BAYCOM USCC: +---------------------------- + +chip 1 +data_a 0x300 # data port A +ctrl_a 0x304 # control port A +data_b 0x301 # data port B +ctrl_b 0x305 # control port B +irq 5 # IRQ No. 5 (#) +board BAYCOM # hardware type (*) +# +# SCC chip 2 +# +chip 2 +data_a 0x302 +ctrl_a 0x306 +data_b 0x303 +ctrl_b 0x307 +board BAYCOM + +An example for a PA0HZP card: +----------------------------- + +chip 1 +data_a 0x153 +data_b 0x151 +ctrl_a 0x152 +ctrl_b 0x150 +irq 9 +pclock 4915200 +board PA0HZP +vector 0x168 +escc no +# +# +# +chip 2 +data_a 0x157 +data_b 0x155 +ctrl_a 0x156 +ctrl_b 0x154 +irq 9 +pclock 4915200 +board PA0HZP +vector 0x168 +escc no + +A DRSI would should probably work with this: +-------------------------------------------- +(actually: two DRSI cards...) + +chip 1 +data_a 0x303 +data_b 0x301 +ctrl_a 0x302 +ctrl_b 0x300 +irq 7 +pclock 4915200 +board DRSI +escc no +# +# +# +chip 2 +data_a 0x313 +data_b 0x311 +ctrl_a 0x312 +ctrl_b 0x310 +irq 7 +pclock 4915200 +board DRSI +escc no + +Note that you cannot use the on-board baudrate generator off DRSI +cards. Use "mode dpll" for clock source (see below). + +This is based on information provided by Mike Bilow (and verified +by Paul Helay) + +The utility "gencfg" +-------------------- + +If you only know the parameters for the PE1CHL driver for DOS, +run gencfg. It will generate the correct port addresses (I hope). +Its parameters are exactly the same as the ones you use with +the "attach scc" command in net, except that the string "init" must +not appear. Example: + +gencfg 2 0x150 4 2 0 1 0x168 9 4915200 + +will print a skeleton z8530drv.conf for the OptoSCC to stdout. + +gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10 + +does the same for the BAYCOM USCC card. In my opinion it is much easier +to edit scc_config.h... + + +1.2.2 channel configuration +=========================== + +The channel definition is divided into three sub sections for each +channel: + +An example for scc0: + +# DEVICE + +device scc0 # the device for the following params + +# MODEM / BUFFERS + +speed 1200 # the default baudrate +clock dpll # clock source: + # dpll = normal half duplex operation + # external = MODEM provides own Rx/Tx clock + # divider = use full duplex divider if + # installed (1) +mode nrzi # HDLC encoding mode + # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM + # nrz = DF9IC 9k6 MODEM + # +bufsize 384 # size of buffers. Note that this must include + # the AX.25 header, not only the data field! + # (optional, defaults to 384) + +# KISS (Layer 1) + +txdelay 36 # (see chapter 1.4) +persist 64 +slot 8 +tail 8 +fulldup 0 +wait 12 +min 3 +maxkey 7 +idle 3 +maxdef 120 +group 0 +txoff off +softdcd on +slip off + +The order WITHIN these sections is unimportant. The order OF these +sections IS important. The MODEM parameters are set with the first +recognized KISS parameter... + +Please note that you can initialize the board only once after boot +(or insmod). You can change all parameters but "mode" and "clock" +later with the Sccparam program or through KISS. Just to avoid +security holes... + +(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not + present at all (BayCom). It feeds back the output of the DPLL + (digital pll) as transmit clock. Using this mode without a divider + installed will normally result in keying the transceiver until + maxkey expires --- of course without sending anything (useful). + +2. Attachment of a channel by your AX.25 software +================================================= + +2.1 Kernel AX.25 +================ + +To set up an AX.25 device you can simply type: + + ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7 + +This will create a network interface with the IP number 44.128.20.107 +and the callsign "dl0tha". If you do not have any IP number (yet) you +can use any of the 44.128.0.0 network. Note that you do not need +axattach. The purpose of axattach (like slattach) is to create a KISS +network device linked to a TTY. Please read the documentation of the +ax25-utils and the AX.25-HOWTO to learn how to set the parameters of +the kernel AX.25. + +2.2 NOS, NET and TFKISS +======================= + +Since the TTY driver (aka KISS TNC emulation) is gone you need +to emulate the old behaviour. The cost of using these programs is +that you probably need to compile the kernel AX.25, regardless of whether +you actually use it or not. First setup your /etc/ax25/axports, +for example: + + 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3) + axlink dl0tha-15 38400 255 4 Link to NOS + +Now "ifconfig" the scc device: + + ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9 + +You can now axattach a pseudo-TTY: + + axattach /dev/ptys0 axlink + +and start your NOS and attach /dev/ptys0 there. The problem is that +NOS is reachable only via digipeating through the kernel AX.25 +(disastrous on a DAMA controlled channel). To solve this problem, +configure "rxecho" to echo the incoming frames from "9k6" to "axlink" +and outgoing frames from "axlink" to "9k6" and start: + + rxecho + +Or simply use "kissbridge" coming with z8530drv-utils: + + ifconfig scc3 hw ax25 dl0tha-9 + kissbridge scc3 /dev/ptys0 + + +3. Adjustment and Display of parameters +======================================= + +3.1 Displaying SCC Parameters: +============================== + +Once a SCC channel has been attached, the parameter settings and +some statistic information can be shown using the param program: + +dl1bke-u:~$ sccstat scc0 + +Parameters: + +speed : 1200 baud +txdelay : 36 +persist : 255 +slottime : 0 +txtail : 8 +fulldup : 1 +waittime : 12 +mintime : 3 sec +maxkeyup : 7 sec +idletime : 3 sec +maxdefer : 120 sec +group : 0x00 +txoff : off +softdcd : on +SLIP : off + +Status: + +HDLC Z8530 Interrupts Buffers +----------------------------------------------------------------------- +Sent : 273 RxOver : 0 RxInts : 125074 Size : 384 +Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0 +RxErrors : 1591 ExInts : 11776 +TxErrors : 0 SpInts : 1503 +Tx State : idle + + +The status info shown is: + +Sent - number of frames transmitted +Received - number of frames received +RxErrors - number of receive errors (CRC, ABORT) +TxErrors - number of discarded Tx frames (due to various reasons) +Tx State - status of the Tx interrupt handler: idle/busy/active/tail (2) +RxOver - number of receiver overruns +TxUnder - number of transmitter underruns +RxInts - number of receiver interrupts +TxInts - number of transmitter interrupts +EpInts - number of receiver special condition interrupts +SpInts - number of external/status interrupts +Size - maximum size of an AX.25 frame (*with* AX.25 headers!) +NoSpace - number of times a buffer could not get allocated + +An overrun is abnormal. If lots of these occur, the product of +baudrate and number of interfaces is too high for the processing +power of your computer. NoSpace errors are unlikely to be caused by the +driver or the kernel AX.25. + + +3.2 Setting Parameters +====================== + + +The setting of parameters of the emulated KISS TNC is done in the +same way in the SCC driver. You can change parameters by using +the kissparms program from the ax25-utils package or use the program +"sccparam": + + sccparam <device> <paramname> <decimal-|hexadecimal value> + +You can change the following parameters: + +param : value +------------------------ +speed : 1200 +txdelay : 36 +persist : 255 +slottime : 0 +txtail : 8 +fulldup : 1 +waittime : 12 +mintime : 3 +maxkeyup : 7 +idletime : 3 +maxdefer : 120 +group : 0x00 +txoff : off +softdcd : on +SLIP : off + + +The parameters have the following meaning: + +speed: + The baudrate on this channel in bits/sec + + Example: sccparam /dev/scc3 speed 9600 + +txdelay: + The delay (in units of 10 ms) after keying of the + transmitter, until the first byte is sent. This is usually + called "TXDELAY" in a TNC. When 0 is specified, the driver + will just wait until the CTS signal is asserted. This + assumes the presence of a timer or other circuitry in the + MODEM and/or transmitter, that asserts CTS when the + transmitter is ready for data. + A normal value of this parameter is 30-36. + + Example: sccparam /dev/scc0 txd 20 + +persist: + This is the probability that the transmitter will be keyed + when the channel is found to be free. It is a value from 0 + to 255, and the probability is (value+1)/256. The value + should be somewhere near 50-60, and should be lowered when + the channel is used more heavily. + + Example: sccparam /dev/scc2 persist 20 + +slottime: + This is the time between samples of the channel. It is + expressed in units of 10 ms. About 200-300 ms (value 20-30) + seems to be a good value. + + Example: sccparam /dev/scc0 slot 20 + +tail: + The time the transmitter will remain keyed after the last + byte of a packet has been transferred to the SCC. This is + necessary because the CRC and a flag still have to leave the + SCC before the transmitter is keyed down. The value depends + on the baudrate selected. A few character times should be + sufficient, e.g. 40ms at 1200 baud. (value 4) + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc2 4 + +full: + The full-duplex mode switch. This can be one of the following + values: + + 0: The interface will operate in CSMA mode (the normal + half-duplex packet radio operation) + 1: Fullduplex mode, i.e. the transmitter will be keyed at + any time, without checking the received carrier. It + will be unkeyed when there are no packets to be sent. + 2: Like 1, but the transmitter will remain keyed, also + when there are no packets to be sent. Flags will be + sent in that case, until a timeout (parameter 10) + occurs. + + Example: sccparam /dev/scc0 fulldup off + +wait: + The initial waittime before any transmit attempt, after the + frame has been queue for transmit. This is the length of + the first slot in CSMA mode. In full duplex modes it is + set to 0 for maximum performance. + The value of this parameter is in 10 ms units. + + Example: sccparam /dev/scc1 wait 4 + +maxkey: + The maximal time the transmitter will be keyed to send + packets, in seconds. This can be useful on busy CSMA + channels, to avoid "getting a bad reputation" when you are + generating a lot of traffic. After the specified time has + elapsed, no new frame will be started. Instead, the trans- + mitter will be switched off for a specified time (parameter + min), and then the selected algorithm for keyup will be + started again. + The value 0 as well as "off" will disable this feature, + and allow infinite transmission time. + + Example: sccparam /dev/scc0 maxk 20 + +min: + This is the time the transmitter will be switched off when + the maximum transmission time is exceeded. + + Example: sccparam /dev/scc3 min 10 + +idle + This parameter specifies the maximum idle time in full duplex + 2 mode, in seconds. When no frames have been sent for this + time, the transmitter will be keyed down. A value of 0 is + has same result as the fullduplex mode 1. This parameter + can be disabled. + + Example: sccparam /dev/scc2 idle off # transmit forever + +maxdefer + This is the maximum time (in seconds) to wait for a free channel + to send. When this timer expires the transmitter will be keyed + IMMEDIATELY. If you love to get trouble with other users you + should set this to a very low value ;-) + + Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes + + +txoff: + When this parameter has the value 0, the transmission of packets + is enable. Otherwise it is disabled. + + Example: sccparam /dev/scc2 txoff on + +group: + It is possible to build special radio equipment to use more than + one frequency on the same band, e.g. using several receivers and + only one transmitter that can be switched between frequencies. + Also, you can connect several radios that are active on the same + band. In these cases, it is not possible, or not a good idea, to + transmit on more than one frequency. The SCC driver provides a + method to lock transmitters on different interfaces, using the + "param <interface> group <x>" command. This will only work when + you are using CSMA mode (parameter full = 0). + The number <x> must be 0 if you want no group restrictions, and + can be computed as follows to create restricted groups: + <x> is the sum of some OCTAL numbers: + + 200 This transmitter will only be keyed when all other + transmitters in the group are off. + 100 This transmitter will only be keyed when the carrier + detect of all other interfaces in the group is off. + 0xx A byte that can be used to define different groups. + Interfaces are in the same group, when the logical AND + between their xx values is nonzero. + + Examples: + When 2 interfaces use group 201, their transmitters will never be + keyed at the same time. + When 2 interfaces use group 101, the transmitters will only key + when both channels are clear at the same time. When group 301, + the transmitters will not be keyed at the same time. + + Don't forget to convert the octal numbers into decimal before + you set the parameter. + + Example: (to be written) + +softdcd: + use a software dcd instead of the real one... Useful for a very + slow squelch. + + Example: sccparam /dev/scc0 soft on + + +4. Problems +=========== + +If you have tx-problems with your BayCom USCC card please check +the manufacturer of the 8530. SGS chips have a slightly +different timing. Try Zilog... A solution is to write to register 8 +instead to the data port, but this won't work with the ESCC chips. +*SIGH!* + +A very common problem is that the PTT locks until the maxkeyup timer +expires, although interrupts and clock source are correct. In most +cases compiling the driver with CONFIG_SCC_DELAY (set with +make config) solves the problems. For more hints read the (pseudo) FAQ +and the documentation coming with z8530drv-utils. + +I got reports that the driver has problems on some 386-based systems. +(i.e. Amstrad) Those systems have a bogus AT bus timing which will +lead to delayed answers on interrupts. You can recognize these +problems by looking at the output of Sccstat for the suspected +port. If it shows under- and overruns you own such a system. + +Delayed processing of received data: This depends on + +- the kernel version + +- kernel profiling compiled or not + +- a high interrupt load + +- a high load of the machine --- running X, Xmorph, XV and Povray, + while compiling the kernel... hmm ... even with 32 MB RAM ... ;-) + Or running a named for the whole .ampr.org domain on an 8 MB + box... + +- using information from rxecho or kissbridge. + +Kernel panics: please read /linux/README and find out if it +really occurred within the scc driver. + +If you cannot solve a problem, send me + +- a description of the problem, +- information on your hardware (computer system, scc board, modem) +- your kernel version +- the output of cat /proc/net/z8530 + +4. Thor RLC100 +============== + +Mysteriously this board seems not to work with the driver. Anyone +got it up-and-running? + + +Many thanks to Linus Torvalds and Alan Cox for including the driver +in the Linux standard distribution and their support. + +Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org + AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU + Internet: jreuter@yaina.de + WWW : http://yaina.de/jreuter |