diff options
Diffstat (limited to 'man7/socket.7')
-rw-r--r-- | man7/socket.7 | 1266 |
1 files changed, 1266 insertions, 0 deletions
diff --git a/man7/socket.7 b/man7/socket.7 new file mode 100644 index 0000000..2cc24d9 --- /dev/null +++ b/man7/socket.7 @@ -0,0 +1,1266 @@ +'\" t +.\" SPDX-License-Identifier: Linux-man-pages-1-para +.\" +.\" This man page is Copyright (C) 1999 Andi Kleen <ak@muc.de>. +.\" and copyright (c) 1999 Matthew Wilcox. +.\" +.\" 2002-10-30, Michael Kerrisk, <mtk.manpages@gmail.com> +.\" Added description of SO_ACCEPTCONN +.\" 2004-05-20, aeb, added SO_RCVTIMEO/SO_SNDTIMEO text. +.\" Modified, 27 May 2004, Michael Kerrisk <mtk.manpages@gmail.com> +.\" Added notes on capability requirements +.\" A few small grammar fixes +.\" 2010-06-13 Jan Engelhardt <jengelh@medozas.de> +.\" Documented SO_DOMAIN and SO_PROTOCOL. +.\" +.\" FIXME +.\" The following are not yet documented: +.\" +.\" SO_PEERNAME (2.4?) +.\" get only +.\" Seems to do something similar to getpeername(), but then +.\" why is it necessary / how does it differ? +.\" +.\" SO_TIMESTAMPING (2.6.30) +.\" Documentation/networking/timestamping.txt +.\" commit cb9eff097831007afb30d64373f29d99825d0068 +.\" Author: Patrick Ohly <patrick.ohly@intel.com> +.\" +.\" SO_WIFI_STATUS (3.3) +.\" commit 6e3e939f3b1bf8534b32ad09ff199d88800835a0 +.\" Author: Johannes Berg <johannes.berg@intel.com> +.\" Also: SCM_WIFI_STATUS +.\" +.\" SO_NOFCS (3.4) +.\" commit 3bdc0eba0b8b47797f4a76e377dd8360f317450f +.\" Author: Ben Greear <greearb@candelatech.com> +.\" +.\" SO_GET_FILTER (3.8) +.\" commit a8fc92778080c845eaadc369a0ecf5699a03bef0 +.\" Author: Pavel Emelyanov <xemul@parallels.com> +.\" +.\" SO_MAX_PACING_RATE (3.13) +.\" commit 62748f32d501f5d3712a7c372bbb92abc7c62bc7 +.\" Author: Eric Dumazet <edumazet@google.com> +.\" +.\" SO_BPF_EXTENSIONS (3.14) +.\" commit ea02f9411d9faa3553ed09ce0ec9f00ceae9885e +.\" Author: Michal Sekletar <msekleta@redhat.com> +.\" +.TH socket 7 2023-07-15 "Linux man-pages 6.05.01" +.SH NAME +socket \- Linux socket interface +.SH SYNOPSIS +.nf +.B #include <sys/socket.h> +.PP +.IB sockfd " = socket(int " socket_family ", int " socket_type ", int " protocol ); +.fi +.SH DESCRIPTION +This manual page describes the Linux networking socket layer user +interface. +The BSD compatible sockets +are the uniform interface +between the user process and the network protocol stacks in the kernel. +The protocol modules are grouped into +.I protocol families +such as +.BR AF_INET ", " AF_IPX ", and " AF_PACKET , +and +.I socket types +such as +.B SOCK_STREAM +or +.BR SOCK_DGRAM . +See +.BR socket (2) +for more information on families and types. +.SS Socket-layer functions +These functions are used by the user process to send or receive packets +and to do other socket operations. +For more information, see their respective manual pages. +.PP +.BR socket (2) +creates a socket, +.BR connect (2) +connects a socket to a remote socket address, +the +.BR bind (2) +function binds a socket to a local socket address, +.BR listen (2) +tells the socket that new connections shall be accepted, and +.BR accept (2) +is used to get a new socket with a new incoming connection. +.BR socketpair (2) +returns two connected anonymous sockets (implemented only for a few +local families like +.BR AF_UNIX ) +.PP +.BR send (2), +.BR sendto (2), +and +.BR sendmsg (2) +send data over a socket, and +.BR recv (2), +.BR recvfrom (2), +.BR recvmsg (2) +receive data from a socket. +.BR poll (2) +and +.BR select (2) +wait for arriving data or a readiness to send data. +In addition, the standard I/O operations like +.BR write (2), +.BR writev (2), +.BR sendfile (2), +.BR read (2), +and +.BR readv (2) +can be used to read and write data. +.PP +.BR getsockname (2) +returns the local socket address and +.BR getpeername (2) +returns the remote socket address. +.BR getsockopt (2) +and +.BR setsockopt (2) +are used to set or get socket layer or protocol options. +.BR ioctl (2) +can be used to set or read some other options. +.PP +.BR close (2) +is used to close a socket. +.BR shutdown (2) +closes parts of a full-duplex socket connection. +.PP +Seeking, or calling +.BR pread (2) +or +.BR pwrite (2) +with a nonzero position is not supported on sockets. +.PP +It is possible to do nonblocking I/O on sockets by setting the +.B O_NONBLOCK +flag on a socket file descriptor using +.BR fcntl (2). +Then all operations that would block will (usually) +return with +.B EAGAIN +(operation should be retried later); +.BR connect (2) +will return +.B EINPROGRESS +error. +The user can then wait for various events via +.BR poll (2) +or +.BR select (2). +.TS +tab(:) allbox; +c s s +l l lx. +I/O events +Event:Poll flag:Occurrence +Read:POLLIN:T{ +New data arrived. +T} +Read:POLLIN:T{ +A connection setup has been completed +(for connection-oriented sockets) +T} +Read:POLLHUP:T{ +A disconnection request has been initiated by the other end. +T} +Read:POLLHUP:T{ +A connection is broken (only for connection-oriented protocols). +When the socket is written +.B SIGPIPE +is also sent. +T} +Write:POLLOUT:T{ +Socket has enough send buffer space for writing new data. +T} +Read/Write:T{ +POLLIN | +.br +POLLOUT +T}:T{ +An outgoing +.BR connect (2) +finished. +T} +Read/Write:POLLERR:T{ +An asynchronous error occurred. +T} +Read/Write:POLLHUP:T{ +The other end has shut down one direction. +T} +Exception:POLLPRI:T{ +Urgent data arrived. +.B SIGURG +is sent then. +T} +.\" FIXME . The following is not true currently: +.\" It is no I/O event when the connection +.\" is broken from the local end using +.\" .BR shutdown (2) +.\" or +.\" .BR close (2). +.TE +.PP +An alternative to +.BR poll (2) +and +.BR select (2) +is to let the kernel inform the application about events +via a +.B SIGIO +signal. +For that the +.B O_ASYNC +flag must be set on a socket file descriptor via +.BR fcntl (2) +and a valid signal handler for +.B SIGIO +must be installed via +.BR sigaction (2). +See the +.I Signals +discussion below. +.SS Socket address structures +Each socket domain has its own format for socket addresses, +with a domain-specific address structure. +Each of these structures begins with an +integer "family" field (typed as +.IR sa_family_t ) +that indicates the type of the address structure. +This allows +the various system calls (e.g., +.BR connect (2), +.BR bind (2), +.BR accept (2), +.BR getsockname (2), +.BR getpeername (2)), +which are generic to all socket domains, +to determine the domain of a particular socket address. +.PP +To allow any type of socket address to be passed to +interfaces in the sockets API, +the type +.I struct sockaddr +is defined. +The purpose of this type is purely to allow casting of +domain-specific socket address types to a "generic" type, +so as to avoid compiler warnings about type mismatches in +calls to the sockets API. +.PP +In addition, the sockets API provides the data type +.IR "struct sockaddr_storage". +This type +is suitable to accommodate all supported domain-specific socket +address structures; it is large enough and is aligned properly. +(In particular, it is large enough to hold +IPv6 socket addresses.) +The structure includes the following field, which can be used to identify +the type of socket address actually stored in the structure: +.PP +.in +4n +.EX + sa_family_t ss_family; +.EE +.in +.PP +The +.I sockaddr_storage +structure is useful in programs that must handle socket addresses +in a generic way +(e.g., programs that must deal with both IPv4 and IPv6 socket addresses). +.SS Socket options +The socket options listed below can be set by using +.BR setsockopt (2) +and read with +.BR getsockopt (2) +with the socket level set to +.B SOL_SOCKET +for all sockets. +Unless otherwise noted, +.I optval +is a pointer to an +.IR int . +.\" FIXME . +.\" In the list below, the text used to describe argument types +.\" for each socket option should be more consistent +.\" +.\" SO_ACCEPTCONN is in POSIX.1-2001, and its origin is explained in +.\" W R Stevens, UNPv1 +.TP +.B SO_ACCEPTCONN +Returns a value indicating whether or not this socket has been marked +to accept connections with +.BR listen (2). +The value 0 indicates that this is not a listening socket, +the value 1 indicates that this is a listening socket. +This socket option is read-only. +.TP +.BR SO_ATTACH_FILTER " (since Linux 2.2), " SO_ATTACH_BPF " (since Linux 3.19)" +Attach a classic BPF +.RB ( SO_ATTACH_FILTER ) +or an extended BPF +.RB ( SO_ATTACH_BPF ) +program to the socket for use as a filter of incoming packets. +A packet will be dropped if the filter program returns zero. +If the filter program returns a +nonzero value which is less than the packet's data length, +the packet will be truncated to the length returned. +If the value returned by the filter is greater than or equal to the +packet's data length, the packet is allowed to proceed unmodified. +.IP +The argument for +.B SO_ATTACH_FILTER +is a +.I sock_fprog +structure, defined in +.IR <linux/filter.h> : +.IP +.in +4n +.EX +struct sock_fprog { + unsigned short len; + struct sock_filter *filter; +}; +.EE +.in +.IP +The argument for +.B SO_ATTACH_BPF +is a file descriptor returned by the +.BR bpf (2) +system call and must refer to a program of type +.BR BPF_PROG_TYPE_SOCKET_FILTER . +.IP +These options may be set multiple times for a given socket, +each time replacing the previous filter program. +The classic and extended versions may be called on the same socket, +but the previous filter will always be replaced such that a socket +never has more than one filter defined. +.IP +Both classic and extended BPF are explained in the kernel source file +.I Documentation/networking/filter.txt +.TP +.BR SO_ATTACH_REUSEPORT_CBPF ", " SO_ATTACH_REUSEPORT_EBPF +For use with the +.B SO_REUSEPORT +option, these options allow the user to set a classic BPF +.RB ( SO_ATTACH_REUSEPORT_CBPF ) +or an extended BPF +.RB ( SO_ATTACH_REUSEPORT_EBPF ) +program which defines how packets are assigned to +the sockets in the reuseport group (that is, all sockets which have +.B SO_REUSEPORT +set and are using the same local address to receive packets). +.IP +The BPF program must return an index between 0 and N\-1 representing +the socket which should receive the packet +(where N is the number of sockets in the group). +If the BPF program returns an invalid index, +socket selection will fall back to the plain +.B SO_REUSEPORT +mechanism. +.IP +Sockets are numbered in the order in which they are added to the group +(that is, the order of +.BR bind (2) +calls for UDP sockets or the order of +.BR listen (2) +calls for TCP sockets). +New sockets added to a reuseport group will inherit the BPF program. +When a socket is removed from a reuseport group (via +.BR close (2)), +the last socket in the group will be moved into the closed socket's +position. +.IP +These options may be set repeatedly at any time on any socket in the group +to replace the current BPF program used by all sockets in the group. +.IP +.B SO_ATTACH_REUSEPORT_CBPF +takes the same argument type as +.B SO_ATTACH_FILTER +and +.B SO_ATTACH_REUSEPORT_EBPF +takes the same argument type as +.BR SO_ATTACH_BPF . +.IP +UDP support for this feature is available since Linux 4.5; +TCP support is available since Linux 4.6. +.TP +.B SO_BINDTODEVICE +Bind this socket to a particular device like \[lq]eth0\[rq], +as specified in the passed interface name. +If the +name is an empty string or the option length is zero, the socket device +binding is removed. +The passed option is a variable-length null-terminated +interface name string with the maximum size of +.BR IFNAMSIZ . +If a socket is bound to an interface, +only packets received from that particular interface are processed by the +socket. +Note that this works only for some socket types, particularly +.B AF_INET +sockets. +It is not supported for packet sockets (use normal +.BR bind (2) +there). +.IP +Before Linux 3.8, +this socket option could be set, but could not retrieved with +.BR getsockopt (2). +Since Linux 3.8, it is readable. +The +.I optlen +argument should contain the buffer size available +to receive the device name and is recommended to be +.B IFNAMSIZ +bytes. +The real device name length is reported back in the +.I optlen +argument. +.TP +.B SO_BROADCAST +Set or get the broadcast flag. +When enabled, datagram sockets are allowed to send +packets to a broadcast address. +This option has no effect on stream-oriented sockets. +.TP +.B SO_BSDCOMPAT +Enable BSD bug-to-bug compatibility. +This is used by the UDP protocol module in Linux 2.0 and 2.2. +If enabled, ICMP errors received for a UDP socket will not be passed +to the user program. +In later kernel versions, support for this option has been phased out: +Linux 2.4 silently ignores it, and Linux 2.6 generates a kernel warning +(printk()) if a program uses this option. +Linux 2.0 also enabled BSD bug-to-bug compatibility +options (random header changing, skipping of the broadcast flag) for raw +sockets with this option, but that was removed in Linux 2.2. +.TP +.B SO_DEBUG +Enable socket debugging. +Allowed only for processes with the +.B CAP_NET_ADMIN +capability or an effective user ID of 0. +.TP +.BR SO_DETACH_FILTER " (since Linux 2.2), " SO_DETACH_BPF " (since Linux 3.19)" +These two options, which are synonyms, +may be used to remove the classic or extended BPF +program attached to a socket with either +.B SO_ATTACH_FILTER +or +.BR SO_ATTACH_BPF . +The option value is ignored. +.TP +.BR SO_DOMAIN " (since Linux 2.6.32)" +Retrieves the socket domain as an integer, returning a value such as +.BR AF_INET6 . +See +.BR socket (2) +for details. +This socket option is read-only. +.TP +.B SO_ERROR +Get and clear the pending socket error. +This socket option is read-only. +Expects an integer. +.TP +.B SO_DONTROUTE +Don't send via a gateway, send only to directly connected hosts. +The same effect can be achieved by setting the +.B MSG_DONTROUTE +flag on a socket +.BR send (2) +operation. +Expects an integer boolean flag. +.TP +.BR SO_INCOMING_CPU " (gettable since Linux 3.19, settable since Linux 4.4)" +.\" getsockopt 2c8c56e15df3d4c2af3d656e44feb18789f75837 +.\" setsockopt 70da268b569d32a9fddeea85dc18043de9d89f89 +Sets or gets the CPU affinity of a socket. +Expects an integer flag. +.IP +.in +4n +.EX +int cpu = 1; +setsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, + sizeof(cpu)); +.EE +.in +.IP +Because all of the packets for a single stream +(i.e., all packets for the same 4-tuple) +arrive on the single RX queue that is associated with a particular CPU, +the typical use case is to employ one listening process per RX queue, +with the incoming flow being handled by a listener +on the same CPU that is handling the RX queue. +This provides optimal NUMA behavior and keeps CPU caches hot. +.\" +.\" From an email conversation with Eric Dumazet: +.\" >> Note that setting the option is not supported if SO_REUSEPORT is used. +.\" > +.\" > Please define "not supported". Does this yield an API diagnostic? +.\" > If so, what is it? +.\" > +.\" >> Socket will be selected from an array, either by a hash or BPF program +.\" >> that has no access to this information. +.\" > +.\" > Sorry -- I'm lost here. How does this comment relate to the proposed +.\" > man page text above? +.\" +.\" Simply that : +.\" +.\" If an application uses both SO_INCOMING_CPU and SO_REUSEPORT, then +.\" SO_REUSEPORT logic, selecting the socket to receive the packet, ignores +.\" SO_INCOMING_CPU setting. +.TP +.BR SO_INCOMING_NAPI_ID " (gettable since Linux 4.12)" +.\" getsockopt 6d4339028b350efbf87c61e6d9e113e5373545c9 +Returns a system-level unique ID called NAPI ID that is associated +with a RX queue on which the last packet associated with that +socket is received. +.IP +This can be used by an application to split the incoming flows among worker +threads based on the RX queue on which the packets associated with the +flows are received. +It allows each worker thread to be associated with +a NIC HW receive queue and service all the connection +requests received on that RX queue. +This mapping between an app thread and +a HW NIC queue streamlines the +flow of data from the NIC to the application. +.TP +.B SO_KEEPALIVE +Enable sending of keep-alive messages on connection-oriented sockets. +Expects an integer boolean flag. +.TP +.B SO_LINGER +Sets or gets the +.B SO_LINGER +option. +The argument is a +.I linger +structure. +.IP +.in +4n +.EX +struct linger { + int l_onoff; /* linger active */ + int l_linger; /* how many seconds to linger for */ +}; +.EE +.in +.IP +When enabled, a +.BR close (2) +or +.BR shutdown (2) +will not return until all queued messages for the socket have been +successfully sent or the linger timeout has been reached. +Otherwise, +the call returns immediately and the closing is done in the background. +When the socket is closed as part of +.BR exit (2), +it always lingers in the background. +.TP +.B SO_LOCK_FILTER +.\" commit d59577b6ffd313d0ab3be39cb1ab47e29bdc9182 +When set, this option will prevent +changing the filters associated with the socket. +These filters include any set using the socket options +.BR SO_ATTACH_FILTER , +.BR SO_ATTACH_BPF , +.BR SO_ATTACH_REUSEPORT_CBPF , +and +.BR SO_ATTACH_REUSEPORT_EBPF . +.IP +The typical use case is for a privileged process to set up a raw socket +(an operation that requires the +.B CAP_NET_RAW +capability), apply a restrictive filter, set the +.B SO_LOCK_FILTER +option, +and then either drop its privileges or pass the socket file descriptor +to an unprivileged process via a UNIX domain socket. +.IP +Once the +.B SO_LOCK_FILTER +option has been enabled, attempts to change or remove the filter +attached to a socket, or to disable the +.B SO_LOCK_FILTER +option will fail with the error +.BR EPERM . +.TP +.BR SO_MARK " (since Linux 2.6.25)" +.\" commit 4a19ec5800fc3bb64e2d87c4d9fdd9e636086fe0 +.\" and 914a9ab386a288d0f22252fc268ecbc048cdcbd5 +Set the mark for each packet sent through this socket +(similar to the netfilter MARK target but socket-based). +Changing the mark can be used for mark-based +routing without netfilter or for packet filtering. +Setting this option requires the +.B CAP_NET_ADMIN +capability. +.TP +.B SO_OOBINLINE +If this option is enabled, +out-of-band data is directly placed into the receive data stream. +Otherwise, out-of-band data is passed only when the +.B MSG_OOB +flag is set during receiving. +.\" don't document it because it can do too much harm. +.\".B SO_NO_CHECK +.\" The kernel has support for the SO_NO_CHECK socket +.\" option (boolean: 0 == default, calculate checksum on xmit, +.\" 1 == do not calculate checksum on xmit). +.\" Additional note from Andi Kleen on SO_NO_CHECK (2010-08-30) +.\" On Linux UDP checksums are essentially free and there's no reason +.\" to turn them off and it would disable another safety line. +.\" That is why I didn't document the option. +.TP +.B SO_PASSCRED +Enable or disable the receiving of the +.B SCM_CREDENTIALS +control message. +For more information, see +.BR unix (7). +.TP +.B SO_PASSSEC +Enable or disable the receiving of the +.B SCM_SECURITY +control message. +For more information, see +.BR unix (7). +.TP +.BR SO_PEEK_OFF " (since Linux 3.4)" +.\" commit ef64a54f6e558155b4f149bb10666b9e914b6c54 +This option, which is currently supported only for +.BR unix (7) +sockets, sets the value of the "peek offset" for the +.BR recv (2) +system call when used with +.B MSG_PEEK +flag. +.IP +When this option is set to a negative value +(it is set to \-1 for all new sockets), +traditional behavior is provided: +.BR recv (2) +with the +.B MSG_PEEK +flag will peek data from the front of the queue. +.IP +When the option is set to a value greater than or equal to zero, +then the next peek at data queued in the socket will occur at +the byte offset specified by the option value. +At the same time, the "peek offset" will be +incremented by the number of bytes that were peeked from the queue, +so that a subsequent peek will return the next data in the queue. +.IP +If data is removed from the front of the queue via a call to +.BR recv (2) +(or similar) without the +.B MSG_PEEK +flag, the "peek offset" will be decreased by the number of bytes removed. +In other words, receiving data without the +.B MSG_PEEK +flag will cause the "peek offset" to be adjusted to maintain +the correct relative position in the queued data, +so that a subsequent peek will retrieve the data that would have been +retrieved had the data not been removed. +.IP +For datagram sockets, if the "peek offset" points to the middle of a packet, +the data returned will be marked with the +.B MSG_TRUNC +flag. +.IP +The following example serves to illustrate the use of +.BR SO_PEEK_OFF . +Suppose a stream socket has the following queued input data: +.IP +.in +4n +.EX +aabbccddeeff +.EE +.in +.IP +The following sequence of +.BR recv (2) +calls would have the effect noted in the comments: +.IP +.in +4n +.EX +int ov = 4; // Set peek offset to 4 +setsockopt(fd, SOL_SOCKET, SO_PEEK_OFF, &ov, sizeof(ov)); +\& +recv(fd, buf, 2, MSG_PEEK); // Peeks "cc"; offset set to 6 +recv(fd, buf, 2, MSG_PEEK); // Peeks "dd"; offset set to 8 +recv(fd, buf, 2, 0); // Reads "aa"; offset set to 6 +recv(fd, buf, 2, MSG_PEEK); // Peeks "ee"; offset set to 8 +.EE +.in +.TP +.B SO_PEERCRED +Return the credentials of the peer process connected to this socket. +For further details, see +.BR unix (7). +.TP +.BR SO_PEERSEC " (since Linux 2.6.2)" +Return the security context of the peer socket connected to this socket. +For further details, see +.BR unix (7) +and +.BR ip (7). +.TP +.B SO_PRIORITY +Set the protocol-defined priority for all packets to be sent on +this socket. +Linux uses this value to order the networking queues: +packets with a higher priority may be processed first depending +on the selected device queueing discipline. +.\" For +.\" .BR ip (7), +.\" this also sets the IP type-of-service (TOS) field for outgoing packets. +Setting a priority outside the range 0 to 6 requires the +.B CAP_NET_ADMIN +capability. +.TP +.BR SO_PROTOCOL " (since Linux 2.6.32)" +Retrieves the socket protocol as an integer, returning a value such as +.BR IPPROTO_SCTP . +See +.BR socket (2) +for details. +This socket option is read-only. +.TP +.B SO_RCVBUF +Sets or gets the maximum socket receive buffer in bytes. +The kernel doubles this value (to allow space for bookkeeping overhead) +when it is set using +.\" Most (all?) other implementations do not do this -- MTK, Dec 05 +.BR setsockopt (2), +and this doubled value is returned by +.BR getsockopt (2). +.\" The following thread on LMKL is quite informative: +.\" getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behavior +.\" 17 July 2012 +.\" http://thread.gmane.org/gmane.linux.kernel/1328935 +The default value is set by the +.I /proc/sys/net/core/rmem_default +file, and the maximum allowed value is set by the +.I /proc/sys/net/core/rmem_max +file. +The minimum (doubled) value for this option is 256. +.TP +.BR SO_RCVBUFFORCE " (since Linux 2.6.14)" +Using this socket option, a privileged +.RB ( CAP_NET_ADMIN ) +process can perform the same task as +.BR SO_RCVBUF , +but the +.I rmem_max +limit can be overridden. +.TP +.BR SO_RCVLOWAT " and " SO_SNDLOWAT +Specify the minimum number of bytes in the buffer until the socket layer +will pass the data to the protocol +.RB ( SO_SNDLOWAT ) +or the user on receiving +.RB ( SO_RCVLOWAT ). +These two values are initialized to 1. +.B SO_SNDLOWAT +is not changeable on Linux +.RB ( setsockopt (2) +fails with the error +.BR ENOPROTOOPT ). +.B SO_RCVLOWAT +is changeable +only since Linux 2.4. +.IP +Before Linux 2.6.28 +.\" Tested on kernel 2.6.14 -- mtk, 30 Nov 05 +.BR select (2), +.BR poll (2), +and +.BR epoll (7) +did not respect the +.B SO_RCVLOWAT +setting on Linux, +and indicated a socket as readable when even a single byte of data +was available. +A subsequent read from the socket would then block until +.B SO_RCVLOWAT +bytes are available. +Since Linux 2.6.28, +.\" commit c7004482e8dcb7c3c72666395cfa98a216a4fb70 +.BR select (2), +.BR poll (2), +and +.BR epoll (7) +indicate a socket as readable only if at least +.B SO_RCVLOWAT +bytes are available. +.TP +.BR SO_RCVTIMEO " and " SO_SNDTIMEO +.\" Not implemented in Linux 2.0. +.\" Implemented in Linux 2.1.11 for getsockopt: always return a zero struct. +.\" Implemented in Linux 2.3.41 for setsockopt, and actually used. +Specify the receiving or sending timeouts until reporting an error. +The argument is a +.IR "struct timeval" . +If an input or output function blocks for this period of time, and +data has been sent or received, the return value of that function +will be the amount of data transferred; if no data has been transferred +and the timeout has been reached, then \-1 is returned with +.I errno +set to +.B EAGAIN +or +.BR EWOULDBLOCK , +.\" in fact to EAGAIN +or +.B EINPROGRESS +(for +.BR connect (2)) +just as if the socket was specified to be nonblocking. +If the timeout is set to zero (the default), +then the operation will never timeout. +Timeouts only have effect for system calls that perform socket I/O (e.g., +.BR accept (2), +.BR connect (2), +.BR read (2), +.BR recvmsg (2), +.BR send (2), +.BR sendmsg (2)); +timeouts have no effect for +.BR select (2), +.BR poll (2), +.BR epoll_wait (2), +and so on. +.TP +.B SO_REUSEADDR +.\" commit c617f398edd4db2b8567a28e899a88f8f574798d +.\" https://lwn.net/Articles/542629/ +Indicates that the rules used in validating addresses supplied in a +.BR bind (2) +call should allow reuse of local addresses. +For +.B AF_INET +sockets this +means that a socket may bind, except when there +is an active listening socket bound to the address. +When the listening socket is bound to +.B INADDR_ANY +with a specific port then it is not possible +to bind to this port for any local address. +Argument is an integer boolean flag. +.TP +.BR SO_REUSEPORT " (since Linux 3.9)" +Permits multiple +.B AF_INET +or +.B AF_INET6 +sockets to be bound to an identical socket address. +This option must be set on each socket (including the first socket) +prior to calling +.BR bind (2) +on the socket. +To prevent port hijacking, +all of the processes binding to the same address must have the same +effective UID. +This option can be employed with both TCP and UDP sockets. +.IP +For TCP sockets, this option allows +.BR accept (2) +load distribution in a multi-threaded server to be improved by +using a distinct listener socket for each thread. +This provides improved load distribution as compared +to traditional techniques such using a single +.BR accept (2)ing +thread that distributes connections, +or having multiple threads that compete to +.BR accept (2) +from the same socket. +.IP +For UDP sockets, +the use of this option can provide better distribution +of incoming datagrams to multiple processes (or threads) as compared +to the traditional technique of having multiple processes +compete to receive datagrams on the same socket. +.TP +.BR SO_RXQ_OVFL " (since Linux 2.6.33)" +.\" commit 3b885787ea4112eaa80945999ea0901bf742707f +Indicates that an unsigned 32-bit value ancillary message (cmsg) +should be attached to received skbs indicating +the number of packets dropped by the socket since its creation. +.TP +.BR SO_SELECT_ERR_QUEUE " (since Linux 3.10)" +.\" commit 7d4c04fc170087119727119074e72445f2bb192b +.\" Author: Keller, Jacob E <jacob.e.keller@intel.com> +When this option is set on a socket, +an error condition on a socket causes notification not only via the +.I exceptfds +set of +.BR select (2). +Similarly, +.BR poll (2) +also returns a +.B POLLPRI +whenever an +.B POLLERR +event is returned. +.\" It does not affect wake up. +.IP +Background: this option was added when waking up on an error condition +occurred only via the +.I readfds +and +.I writefds +sets of +.BR select (2). +The option was added to allow monitoring for error conditions via the +.I exceptfds +argument without simultaneously having to receive notifications (via +.IR readfds ) +for regular data that can be read from the socket. +After changes in Linux 4.16, +.\" commit 6e5d58fdc9bedd0255a8 +.\" ("skbuff: Fix not waking applications when errors are enqueued") +the use of this flag to achieve the desired notifications +is no longer necessary. +This option is nevertheless retained for backwards compatibility. +.TP +.B SO_SNDBUF +Sets or gets the maximum socket send buffer in bytes. +The kernel doubles this value (to allow space for bookkeeping overhead) +when it is set using +.\" Most (all?) other implementations do not do this -- MTK, Dec 05 +.\" See also the comment to SO_RCVBUF (17 Jul 2012 LKML mail) +.BR setsockopt (2), +and this doubled value is returned by +.BR getsockopt (2). +The default value is set by the +.I /proc/sys/net/core/wmem_default +file and the maximum allowed value is set by the +.I /proc/sys/net/core/wmem_max +file. +The minimum (doubled) value for this option is 2048. +.TP +.BR SO_SNDBUFFORCE " (since Linux 2.6.14)" +Using this socket option, a privileged +.RB ( CAP_NET_ADMIN ) +process can perform the same task as +.BR SO_SNDBUF , +but the +.I wmem_max +limit can be overridden. +.TP +.B SO_TIMESTAMP +Enable or disable the receiving of the +.B SO_TIMESTAMP +control message. +The timestamp control message is sent with level +.B SOL_SOCKET +and a +.I cmsg_type +of +.BR SCM_TIMESTAMP . +The +.I cmsg_data +field is a +.I "struct timeval" +indicating the +reception time of the last packet passed to the user in this call. +See +.BR cmsg (3) +for details on control messages. +.TP +.BR SO_TIMESTAMPNS " (since Linux 2.6.22)" +.\" commit 92f37fd2ee805aa77925c1e64fd56088b46094fc +Enable or disable the receiving of the +.B SO_TIMESTAMPNS +control message. +The timestamp control message is sent with level +.B SOL_SOCKET +and a +.I cmsg_type +of +.BR SCM_TIMESTAMPNS . +The +.I cmsg_data +field is a +.I "struct timespec" +indicating the +reception time of the last packet passed to the user in this call. +The clock used for the timestamp is +.BR CLOCK_REALTIME . +See +.BR cmsg (3) +for details on control messages. +.IP +A socket cannot mix +.B SO_TIMESTAMP +and +.BR SO_TIMESTAMPNS : +the two modes are mutually exclusive. +.TP +.B SO_TYPE +Gets the socket type as an integer (e.g., +.BR SOCK_STREAM ). +This socket option is read-only. +.TP +.BR SO_BUSY_POLL " (since Linux 3.11)" +Sets the approximate time in microseconds to busy poll on a blocking receive +when there is no data. +Increasing this value requires +.BR CAP_NET_ADMIN . +The default for this option is controlled by the +.I /proc/sys/net/core/busy_read +file. +.IP +The value in the +.I /proc/sys/net/core/busy_poll +file determines how long +.BR select (2) +and +.BR poll (2) +will busy poll when they operate on sockets with +.B SO_BUSY_POLL +set and no events to report are found. +.IP +In both cases, +busy polling will only be done when the socket last received data +from a network device that supports this option. +.IP +While busy polling may improve latency of some applications, +care must be taken when using it since this will increase +both CPU utilization and power usage. +.SS Signals +When writing onto a connection-oriented socket that has been shut down +(by the local or the remote end) +.B SIGPIPE +is sent to the writing process and +.B EPIPE +is returned. +The signal is not sent when the write call +specified the +.B MSG_NOSIGNAL +flag. +.PP +When requested with the +.B FIOSETOWN +.BR fcntl (2) +or +.B SIOCSPGRP +.BR ioctl (2), +.B SIGIO +is sent when an I/O event occurs. +It is possible to use +.BR poll (2) +or +.BR select (2) +in the signal handler to find out which socket the event occurred on. +An alternative (in Linux 2.2) is to set a real-time signal using the +.B F_SETSIG +.BR fcntl (2); +the handler of the real time signal will be called with +the file descriptor in the +.I si_fd +field of its +.IR siginfo_t . +See +.BR fcntl (2) +for more information. +.PP +Under some circumstances (e.g., multiple processes accessing a +single socket), the condition that caused the +.B SIGIO +may have already disappeared when the process reacts to the signal. +If this happens, the process should wait again because Linux +will resend the signal later. +.\" .SS Ancillary messages +.SS /proc interfaces +The core socket networking parameters can be accessed +via files in the directory +.IR /proc/sys/net/core/ . +.TP +.I rmem_default +contains the default setting in bytes of the socket receive buffer. +.TP +.I rmem_max +contains the maximum socket receive buffer size in bytes which a user may +set by using the +.B SO_RCVBUF +socket option. +.TP +.I wmem_default +contains the default setting in bytes of the socket send buffer. +.TP +.I wmem_max +contains the maximum socket send buffer size in bytes which a user may +set by using the +.B SO_SNDBUF +socket option. +.TP +.IR message_cost " and " message_burst +configure the token bucket filter used to load limit warning messages +caused by external network events. +.TP +.I netdev_max_backlog +Maximum number of packets in the global input queue. +.TP +.I optmem_max +Maximum length of ancillary data and user control data like the iovecs +per socket. +.\" netdev_fastroute is not documented because it is experimental +.SS Ioctls +These operations can be accessed using +.BR ioctl (2): +.PP +.in +4n +.EX +.IB error " = ioctl(" ip_socket ", " ioctl_type ", " &value_result ");" +.EE +.in +.TP +.B SIOCGSTAMP +Return a +.I struct timeval +with the receive timestamp of the last packet passed to the user. +This is useful for accurate round trip time measurements. +See +.BR setitimer (2) +for a description of +.IR "struct timeval" . +.\" +This ioctl should be used only if the socket options +.B SO_TIMESTAMP +and +.B SO_TIMESTAMPNS +are not set on the socket. +Otherwise, it returns the timestamp of the +last packet that was received while +.B SO_TIMESTAMP +and +.B SO_TIMESTAMPNS +were not set, or it fails if no such packet has been received, +(i.e., +.BR ioctl (2) +returns \-1 with +.I errno +set to +.BR ENOENT ). +.TP +.B SIOCSPGRP +Set the process or process group that is to receive +.B SIGIO +or +.B SIGURG +signals when I/O becomes possible or urgent data is available. +The argument is a pointer to a +.IR pid_t . +For further details, see the description of +.B F_SETOWN +in +.BR fcntl (2). +.TP +.B FIOASYNC +Change the +.B O_ASYNC +flag to enable or disable asynchronous I/O mode of the socket. +Asynchronous I/O mode means that the +.B SIGIO +signal or the signal set with +.B F_SETSIG +is raised when a new I/O event occurs. +.IP +Argument is an integer boolean flag. +(This operation is synonymous with the use of +.BR fcntl (2) +to set the +.B O_ASYNC +flag.) +.\" +.TP +.B SIOCGPGRP +Get the current process or process group that receives +.B SIGIO +or +.B SIGURG +signals, +or 0 +when none is set. +.PP +Valid +.BR fcntl (2) +operations: +.TP +.B FIOGETOWN +The same as the +.B SIOCGPGRP +.BR ioctl (2). +.TP +.B FIOSETOWN +The same as the +.B SIOCSPGRP +.BR ioctl (2). +.SH VERSIONS +.B SO_BINDTODEVICE +was introduced in Linux 2.0.30. +.B SO_PASSCRED +is new in Linux 2.2. +The +.I /proc +interfaces were introduced in Linux 2.2. +.B SO_RCVTIMEO +and +.B SO_SNDTIMEO +are supported since Linux 2.3.41. +Earlier, timeouts were fixed to +a protocol-specific setting, and could not be read or written. +.SH NOTES +Linux assumes that half of the send/receive buffer is used for internal +kernel structures; thus the values in the corresponding +.I /proc +files are twice what can be observed on the wire. +.PP +Linux will allow port reuse only with the +.B SO_REUSEADDR +option +when this option was set both in the previous program that performed a +.BR bind (2) +to the port and in the program that wants to reuse the port. +This differs from some implementations (e.g., FreeBSD) +where only the later program needs to set the +.B SO_REUSEADDR +option. +Typically this difference is invisible, since, for example, a server +program is designed to always set this option. +.\" .SH AUTHORS +.\" This man page was written by Andi Kleen. +.SH SEE ALSO +.BR wireshark (1), +.BR bpf (2), +.BR connect (2), +.BR getsockopt (2), +.BR setsockopt (2), +.BR socket (2), +.BR pcap (3), +.BR address_families (7), +.BR capabilities (7), +.BR ddp (7), +.BR ip (7), +.BR ipv6 (7), +.BR packet (7), +.BR tcp (7), +.BR udp (7), +.BR unix (7), +.BR tcpdump (8) |