diff options
Diffstat (limited to 'man7/packet.7')
-rw-r--r-- | man7/packet.7 | 694 |
1 files changed, 694 insertions, 0 deletions
diff --git a/man7/packet.7 b/man7/packet.7 new file mode 100644 index 0000000..b2a264c --- /dev/null +++ b/man7/packet.7 @@ -0,0 +1,694 @@ +.\" SPDX-License-Identifier: Linux-man-pages-1-para +.\" +.\" This man page is Copyright (C) 1999 Andi Kleen <ak@muc.de>. +.\" +.\" $Id: packet.7,v 1.13 2000/08/14 08:03:45 ak Exp $ +.\" +.TH packet 7 2023-07-15 "Linux man-pages 6.05.01" +.SH NAME +packet \- packet interface on device level +.SH SYNOPSIS +.nf +.B #include <sys/socket.h> +.B #include <linux/if_packet.h> +.B #include <net/ethernet.h> /* the L2 protocols */ +.PP +.BI "packet_socket = socket(AF_PACKET, int " socket_type ", int "protocol ); +.fi +.SH DESCRIPTION +Packet sockets are used to receive or send raw packets at the device driver +(OSI Layer 2) level. +They allow the user to implement protocol modules in user space +on top of the physical layer. +.PP +The +.I socket_type +is either +.B SOCK_RAW +for raw packets including the link-level header or +.B SOCK_DGRAM +for cooked packets with the link-level header removed. +The link-level header information is available in a common format in a +.I sockaddr_ll +structure. +.I protocol +is the IEEE 802.3 protocol number in network byte order. +See the +.I <linux/if_ether.h> +include file for a list of allowed protocols. +When protocol +is set to +.BR htons(ETH_P_ALL) , +then all protocols are received. +All incoming packets of that protocol type will be passed to the packet +socket before they are passed to the protocols implemented in the kernel. +If +.I protocol +is set to zero, +no packets are received. +.BR bind (2) +can optionally be called with a nonzero +.I sll_protocol +to start receiving packets for the protocols specified. +.PP +In order to create a packet socket, a process must have the +.B CAP_NET_RAW +capability in the user namespace that governs its network namespace. +.PP +.B SOCK_RAW +packets are passed to and from the device driver without any changes in +the packet data. +When receiving a packet, the address is still parsed and +passed in a standard +.I sockaddr_ll +address structure. +When transmitting a packet, the user-supplied buffer +should contain the physical-layer header. +That packet is then +queued unmodified to the network driver of the interface defined by the +destination address. +Some device drivers always add other headers. +.B SOCK_RAW +is similar to but not compatible with the obsolete +.B AF_INET/SOCK_PACKET +of Linux 2.0. +.PP +.B SOCK_DGRAM +operates on a slightly higher level. +The physical header is removed before the packet is passed to the user. +Packets sent through a +.B SOCK_DGRAM +packet socket get a suitable physical-layer header based on the +information in the +.I sockaddr_ll +destination address before they are queued. +.PP +By default, all packets of the specified protocol type +are passed to a packet socket. +To get packets only from a specific interface use +.BR bind (2) +specifying an address in a +.I struct sockaddr_ll +to bind the packet socket to an interface. +Fields used for binding are +.I sll_family +(should be +.BR AF_PACKET ), +.IR sll_protocol , +and +.IR sll_ifindex . +.PP +The +.BR connect (2) +operation is not supported on packet sockets. +.PP +When the +.B MSG_TRUNC +flag is passed to +.BR recvmsg (2), +.BR recv (2), +or +.BR recvfrom (2), +the real length of the packet on the wire is always returned, +even when it is longer than the buffer. +.SS Address types +The +.I sockaddr_ll +structure is a device-independent physical-layer address. +.PP +.in +4n +.EX +struct sockaddr_ll { + unsigned short sll_family; /* Always AF_PACKET */ + unsigned short sll_protocol; /* Physical\-layer protocol */ + int sll_ifindex; /* Interface number */ + unsigned short sll_hatype; /* ARP hardware type */ + unsigned char sll_pkttype; /* Packet type */ + unsigned char sll_halen; /* Length of address */ + unsigned char sll_addr[8]; /* Physical\-layer address */ +}; +.EE +.in +.PP +The fields of this structure are as follows: +.TP +.I sll_protocol +is the standard ethernet protocol type in network byte order as defined +in the +.I <linux/if_ether.h> +include file. +It defaults to the socket's protocol. +.TP +.I sll_ifindex +is the interface index of the interface +(see +.BR netdevice (7)); +0 matches any interface (only permitted for binding). +.I sll_hatype +is an ARP type as defined in the +.I <linux/if_arp.h> +include file. +.TP +.I sll_pkttype +contains the packet type. +Valid types are +.B PACKET_HOST +for a packet addressed to the local host, +.B PACKET_BROADCAST +for a physical-layer broadcast packet, +.B PACKET_MULTICAST +for a packet sent to a physical-layer multicast address, +.B PACKET_OTHERHOST +for a packet to some other host that has been caught by a device driver +in promiscuous mode, and +.B PACKET_OUTGOING +for a packet originating from the local host that is looped back to a packet +socket. +These types make sense only for receiving. +.TP +.I sll_addr +.TQ +.I sll_halen +contain the physical-layer (e.g., IEEE 802.3) address and its length. +The exact interpretation depends on the device. +.PP +When you send packets, it is enough to specify +.IR sll_family , +.IR sll_addr , +.IR sll_halen , +.IR sll_ifindex , +and +.IR sll_protocol . +The other fields should be 0. +.I sll_hatype +and +.I sll_pkttype +are set on received packets for your information. +.SS Socket options +Packet socket options are configured by calling +.BR setsockopt (2) +with level +.BR SOL_PACKET . +.TP +.B PACKET_ADD_MEMBERSHIP +.PD 0 +.TP +.B PACKET_DROP_MEMBERSHIP +.PD +Packet sockets can be used to configure physical-layer multicasting +and promiscuous mode. +.B PACKET_ADD_MEMBERSHIP +adds a binding and +.B PACKET_DROP_MEMBERSHIP +drops it. +They both expect a +.I packet_mreq +structure as argument: +.IP +.in +4n +.EX +struct packet_mreq { + int mr_ifindex; /* interface index */ + unsigned short mr_type; /* action */ + unsigned short mr_alen; /* address length */ + unsigned char mr_address[8]; /* physical\-layer address */ +}; +.EE +.in +.IP +.I mr_ifindex +contains the interface index for the interface whose status +should be changed. +The +.I mr_type +field specifies which action to perform. +.B PACKET_MR_PROMISC +enables receiving all packets on a shared medium (often known as +"promiscuous mode"), +.B PACKET_MR_MULTICAST +binds the socket to the physical-layer multicast group specified in +.I mr_address +and +.IR mr_alen , +and +.B PACKET_MR_ALLMULTI +sets the socket up to receive all multicast packets arriving at +the interface. +.IP +In addition, the traditional ioctls +.BR SIOCSIFFLAGS , +.BR SIOCADDMULTI , +.B SIOCDELMULTI +can be used for the same purpose. +.TP +.BR PACKET_AUXDATA " (since Linux 2.6.21)" +.\" commit 8dc4194474159660d7f37c495e3fc3f10d0db8cc +If this binary option is enabled, the packet socket passes a metadata +structure along with each packet in the +.BR recvmsg (2) +control field. +The structure can be read with +.BR cmsg (3). +It is defined as +.IP +.in +4n +.EX +struct tpacket_auxdata { + __u32 tp_status; + __u32 tp_len; /* packet length */ + __u32 tp_snaplen; /* captured length */ + __u16 tp_mac; + __u16 tp_net; + __u16 tp_vlan_tci; + __u16 tp_vlan_tpid; /* Since Linux 3.14; earlier, these + were unused padding bytes */ +.\" commit a0cdfcf39362410d5ea983f4daf67b38de129408 added tp_vlan_tpid +}; +.EE +.in +.TP +.BR PACKET_FANOUT " (since Linux 3.1)" +.\" commit dc99f600698dcac69b8f56dda9a8a00d645c5ffc +To scale processing across threads, packet sockets can form a fanout +group. +In this mode, each matching packet is enqueued onto only one +socket in the group. +A socket joins a fanout group by calling +.BR setsockopt (2) +with level +.B SOL_PACKET +and option +.BR PACKET_FANOUT . +Each network namespace can have up to 65536 independent groups. +A socket selects a group by encoding the ID in the first 16 bits of +the integer option value. +The first packet socket to join a group implicitly creates it. +To successfully join an existing group, subsequent packet sockets +must have the same protocol, device settings, fanout mode, and +flags (see below). +Packet sockets can leave a fanout group only by closing the socket. +The group is deleted when the last socket is closed. +.IP +Fanout supports multiple algorithms to spread traffic between sockets, +as follows: +.RS +.IP \[bu] 3 +The default mode, +.BR PACKET_FANOUT_HASH , +sends packets from the same flow to the same socket to maintain +per-flow ordering. +For each packet, it chooses a socket by taking the packet flow hash +modulo the number of sockets in the group, where a flow hash is a hash +over network-layer address and optional transport-layer port fields. +.IP \[bu] +The load-balance mode +.B PACKET_FANOUT_LB +implements a round-robin algorithm. +.IP \[bu] +.B PACKET_FANOUT_CPU +selects the socket based on the CPU that the packet arrived on. +.IP \[bu] +.B PACKET_FANOUT_ROLLOVER +processes all data on a single socket, moving to the next when one +becomes backlogged. +.IP \[bu] +.B PACKET_FANOUT_RND +selects the socket using a pseudo-random number generator. +.IP \[bu] +.B PACKET_FANOUT_QM +.\" commit 2d36097d26b5991d71a2cf4a20c1a158f0f1bfcd +(available since Linux 3.14) +selects the socket using the recorded queue_mapping of the received skb. +.RE +.IP +Fanout modes can take additional options. +IP fragmentation causes packets from the same flow to have different +flow hashes. +The flag +.BR PACKET_FANOUT_FLAG_DEFRAG , +if set, causes packets to be defragmented before fanout is applied, to +preserve order even in this case. +Fanout mode and options are communicated in the second 16 bits of the +integer option value. +The flag +.B PACKET_FANOUT_FLAG_ROLLOVER +enables the roll over mechanism as a backup strategy: if the +original fanout algorithm selects a backlogged socket, the packet +rolls over to the next available one. +.TP +.BR PACKET_LOSS " (with " PACKET_TX_RING ) +When a malformed packet is encountered on a transmit ring, +the default is to reset its +.I tp_status +to +.B TP_STATUS_WRONG_FORMAT +and abort the transmission immediately. +The malformed packet blocks itself and subsequently enqueued packets from +being sent. +The format error must be fixed, the associated +.I tp_status +reset to +.BR TP_STATUS_SEND_REQUEST , +and the transmission process restarted via +.BR send (2). +However, if +.B PACKET_LOSS +is set, any malformed packet will be skipped, its +.I tp_status +reset to +.BR TP_STATUS_AVAILABLE , +and the transmission process continued. +.TP +.BR PACKET_RESERVE " (with " PACKET_RX_RING ) +By default, a packet receive ring writes packets immediately following the +metadata structure and alignment padding. +This integer option reserves additional headroom. +.TP +.B PACKET_RX_RING +Create a memory-mapped ring buffer for asynchronous packet reception. +The packet socket reserves a contiguous region of application address +space, lays it out into an array of packet slots and copies packets +(up to +.IR tp_snaplen ) +into subsequent slots. +Each packet is preceded by a metadata structure similar to +.IR tpacket_auxdata . +The protocol fields encode the offset to the data +from the start of the metadata header. +.I tp_net +stores the offset to the network layer. +If the packet socket is of type +.BR SOCK_DGRAM , +then +.I tp_mac +is the same. +If it is of type +.BR SOCK_RAW , +then that field stores the offset to the link-layer frame. +Packet socket and application communicate the head and tail of the ring +through the +.I tp_status +field. +The packet socket owns all slots with +.I tp_status +equal to +.BR TP_STATUS_KERNEL . +After filling a slot, it changes the status of the slot to transfer +ownership to the application. +During normal operation, the new +.I tp_status +value has at least the +.B TP_STATUS_USER +bit set to signal that a received packet has been stored. +When the application has finished processing a packet, it transfers +ownership of the slot back to the socket by setting +.I tp_status +equal to +.BR TP_STATUS_KERNEL . +.IP +Packet sockets implement multiple variants of the packet ring. +The implementation details are described in +.I Documentation/networking/packet_mmap.rst +in the Linux kernel source tree. +.TP +.B PACKET_STATISTICS +Retrieve packet socket statistics in the form of a structure +.IP +.in +4n +.EX +struct tpacket_stats { + unsigned int tp_packets; /* Total packet count */ + unsigned int tp_drops; /* Dropped packet count */ +}; +.EE +.in +.IP +Receiving statistics resets the internal counters. +The statistics structure differs when using a ring of variant +.BR TPACKET_V3 . +.TP +.BR PACKET_TIMESTAMP " (with " PACKET_RX_RING "; since Linux 2.6.36)" +.\" commit 614f60fa9d73a9e8fdff3df83381907fea7c5649 +The packet receive ring always stores a timestamp in the metadata header. +By default, this is a software generated timestamp generated when the +packet is copied into the ring. +This integer option selects the type of timestamp. +Besides the default, it support the two hardware formats described in +.I Documentation/networking/timestamping.rst +in the Linux kernel source tree. +.TP +.BR PACKET_TX_RING " (since Linux 2.6.31)" +.\" commit 69e3c75f4d541a6eb151b3ef91f34033cb3ad6e1 +Create a memory-mapped ring buffer for packet transmission. +This option is similar to +.B PACKET_RX_RING +and takes the same arguments. +The application writes packets into slots with +.I tp_status +equal to +.B TP_STATUS_AVAILABLE +and schedules them for transmission by changing +.I tp_status +to +.BR TP_STATUS_SEND_REQUEST . +When packets are ready to be transmitted, the application calls +.BR send (2) +or a variant thereof. +The +.I buf +and +.I len +fields of this call are ignored. +If an address is passed using +.BR sendto (2) +or +.BR sendmsg (2), +then that overrides the socket default. +On successful transmission, the socket resets +.I tp_status +to +.BR TP_STATUS_AVAILABLE . +It immediately aborts the transmission on error unless +.B PACKET_LOSS +is set. +.TP +.BR PACKET_VERSION " (with " PACKET_RX_RING "; since Linux 2.6.27)" +.\" commit bbd6ef87c544d88c30e4b762b1b61ef267a7d279 +By default, +.B PACKET_RX_RING +creates a packet receive ring of variant +.BR TPACKET_V1 . +To create another variant, configure the desired variant by setting this +integer option before creating the ring. +.TP +.BR PACKET_QDISC_BYPASS " (since Linux 3.14)" +.\" commit d346a3fae3ff1d99f5d0c819bf86edf9094a26a1 +By default, packets sent through packet sockets pass through the kernel's +qdisc (traffic control) layer, which is fine for the vast majority of use +cases. +For traffic generator appliances using packet sockets +that intend to brute-force flood the network\[em]for example, +to test devices under load in a similar +fashion to pktgen\[em]this layer can be bypassed by setting +this integer option to 1. +A side effect is that packet buffering in the qdisc layer is avoided, +which will lead to increased drops when network +device transmit queues are busy; +therefore, use at your own risk. +.SS Ioctls +.B SIOCGSTAMP +can be used to receive the timestamp of the last received packet. +Argument is a +.I struct timeval +variable. +.\" FIXME Document SIOCGSTAMPNS +.PP +In addition, all standard ioctls defined in +.BR netdevice (7) +and +.BR socket (7) +are valid on packet sockets. +.SS Error handling +Packet sockets do no error handling other than errors occurred +while passing the packet to the device driver. +They don't have the concept of a pending error. +.SH ERRORS +.TP +.B EADDRNOTAVAIL +Unknown multicast group address passed. +.TP +.B EFAULT +User passed invalid memory address. +.TP +.B EINVAL +Invalid argument. +.TP +.B EMSGSIZE +Packet is bigger than interface MTU. +.TP +.B ENETDOWN +Interface is not up. +.TP +.B ENOBUFS +Not enough memory to allocate the packet. +.TP +.B ENODEV +Unknown device name or interface index specified in interface address. +.TP +.B ENOENT +No packet received. +.TP +.B ENOTCONN +No interface address passed. +.TP +.B ENXIO +Interface address contained an invalid interface index. +.TP +.B EPERM +User has insufficient privileges to carry out this operation. +.PP +In addition, other errors may be generated by the low-level driver. +.SH VERSIONS +.B AF_PACKET +is a new feature in Linux 2.2. +Earlier Linux versions supported only +.BR SOCK_PACKET . +.SH NOTES +For portable programs it is suggested to use +.B AF_PACKET +via +.BR pcap (3); +although this covers only a subset of the +.B AF_PACKET +features. +.PP +The +.B SOCK_DGRAM +packet sockets make no attempt to create or parse the IEEE 802.2 LLC +header for a IEEE 802.3 frame. +When +.B ETH_P_802_3 +is specified as protocol for sending the kernel creates the +802.3 frame and fills out the length field; the user has to supply the LLC +header to get a fully conforming packet. +Incoming 802.3 packets are not multiplexed on the DSAP/SSAP protocol +fields; instead they are supplied to the user as protocol +.B ETH_P_802_2 +with the LLC header prefixed. +It is thus not possible to bind to +.BR ETH_P_802_3 ; +bind to +.B ETH_P_802_2 +instead and do the protocol multiplex yourself. +The default for sending is the standard Ethernet DIX +encapsulation with the protocol filled in. +.PP +Packet sockets are not subject to the input or output firewall chains. +.SS Compatibility +In Linux 2.0, the only way to get a packet socket was with the call: +.PP +.in +4n +.EX +socket(AF_INET, SOCK_PACKET, protocol) +.EE +.in +.PP +This is still supported, but deprecated and strongly discouraged. +The main difference between the two methods is that +.B SOCK_PACKET +uses the old +.I struct sockaddr_pkt +to specify an interface, which doesn't provide physical-layer +independence. +.PP +.in +4n +.EX +struct sockaddr_pkt { + unsigned short spkt_family; + unsigned char spkt_device[14]; + unsigned short spkt_protocol; +}; +.EE +.in +.PP +.I spkt_family +contains +the device type, +.I spkt_protocol +is the IEEE 802.3 protocol type as defined in +.I <sys/if_ether.h> +and +.I spkt_device +is the device name as a null-terminated string, for example, eth0. +.PP +This structure is obsolete and should not be used in new code. +.SH BUGS +.SS LLC header handling +The IEEE 802.2/803.3 LLC handling could be considered as a bug. +.SS MSG_TRUNC issues +The +.B MSG_TRUNC +.BR recvmsg (2) +extension is an ugly hack and should be replaced by a control message. +There is currently no way to get the original destination address of +packets via +.BR SOCK_DGRAM . +.SS spkt_device device name truncation +The +.I spkt_device +field of +.I sockaddr_pkt +has a size of 14 bytes, +which is less than the constant +.B IFNAMSIZ +defined in +.I <net/if.h> +which is 16 bytes and describes the system limit for a network interface name. +This means the names of network devices longer than 14 bytes +will be truncated to fit into +.IR spkt_device . +All these lengths include the terminating null byte (\[aq]\e0\[aq])). +.PP +Issues from this with old code typically show up with +very long interface names used by the +.B Predictable Network Interface Names +feature enabled by default in many modern Linux distributions. +.PP +The preferred solution is to rewrite code to avoid +.BR SOCK_PACKET . +Possible user solutions are to disable +.B Predictable Network Interface Names +or to rename the interface to a name of at most 13 bytes, +for example using the +.BR ip (8) +tool. +.SS Documentation issues +Socket filters are not documented. +.\" .SH CREDITS +.\" This man page was written by Andi Kleen with help from Matthew Wilcox. +.\" AF_PACKET in Linux 2.2 was implemented +.\" by Alexey Kuznetsov, based on code by Alan Cox and others. +.SH SEE ALSO +.BR socket (2), +.BR pcap (3), +.BR capabilities (7), +.BR ip (7), +.BR raw (7), +.BR socket (7), +.BR ip (8), +.PP +RFC\ 894 for the standard IP Ethernet encapsulation. +RFC\ 1700 for the IEEE 802.3 IP encapsulation. +.PP +The +.I <linux/if_ether.h> +include file for physical-layer protocols. +.PP +The Linux kernel source tree. +.I Documentation/networking/filter.rst +describes how to apply Berkeley Packet Filters to packet sockets. +.I tools/testing/selftests/net/psock_tpacket.c +contains example source code for all available versions of +.B PACKET_RX_RING +and +.BR PACKET_TX_RING . |