summaryrefslogtreecommitdiffstats
path: root/man7/epoll.7
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--man7/epoll.7610
1 files changed, 610 insertions, 0 deletions
diff --git a/man7/epoll.7 b/man7/epoll.7
new file mode 100644
index 0000000..02a53e9
--- /dev/null
+++ b/man7/epoll.7
@@ -0,0 +1,610 @@
+.\" Copyright (C) 2003 Davide Libenzi
+.\"
+.\" SPDX-License-Identifier: GPL-2.0-or-later
+.\"
+.\" Davide Libenzi <davidel@xmailserver.org>
+.\"
+.TH epoll 7 2023-05-03 "Linux man-pages 6.05.01"
+.SH NAME
+epoll \- I/O event notification facility
+.SH SYNOPSIS
+.nf
+.B #include <sys/epoll.h>
+.fi
+.SH DESCRIPTION
+The
+.B epoll
+API performs a similar task to
+.BR poll (2):
+monitoring multiple file descriptors to see if I/O is possible on any of them.
+The
+.B epoll
+API can be used either as an edge-triggered or a level-triggered
+interface and scales well to large numbers of watched file descriptors.
+.PP
+The central concept of the
+.B epoll
+API is the
+.B epoll
+.IR instance ,
+an in-kernel data structure which, from a user-space perspective,
+can be considered as a container for two lists:
+.IP \[bu] 3
+The
+.I interest
+list (sometimes also called the
+.B epoll
+set): the set of file descriptors that the process has registered
+an interest in monitoring.
+.IP \[bu]
+The
+.I ready
+list: the set of file descriptors that are "ready" for I/O.
+The ready list is a subset of
+(or, more precisely, a set of references to)
+the file descriptors in the interest list.
+The ready list is dynamically populated
+by the kernel as a result of I/O activity on those file descriptors.
+.PP
+The following system calls are provided to
+create and manage an
+.B epoll
+instance:
+.IP \[bu] 3
+.BR epoll_create (2)
+creates a new
+.B epoll
+instance and returns a file descriptor referring to that instance.
+(The more recent
+.BR epoll_create1 (2)
+extends the functionality of
+.BR epoll_create (2).)
+.IP \[bu]
+Interest in particular file descriptors is then registered via
+.BR epoll_ctl (2),
+which adds items to the interest list of the
+.B epoll
+instance.
+.IP \[bu]
+.BR epoll_wait (2)
+waits for I/O events,
+blocking the calling thread if no events are currently available.
+(This system call can be thought of as fetching items from
+the ready list of the
+.B epoll
+instance.)
+.\"
+.SS Level-triggered and edge-triggered
+The
+.B epoll
+event distribution interface is able to behave both as edge-triggered
+(ET) and as level-triggered (LT).
+The difference between the two mechanisms
+can be described as follows.
+Suppose that
+this scenario happens:
+.IP (1) 5
+The file descriptor that represents the read side of a pipe
+.RI ( rfd )
+is registered on the
+.B epoll
+instance.
+.IP (2)
+A pipe writer writes 2\ kB of data on the write side of the pipe.
+.IP (3)
+A call to
+.BR epoll_wait (2)
+is done that will return
+.I rfd
+as a ready file descriptor.
+.IP (4)
+The pipe reader reads 1\ kB of data from
+.IR rfd .
+.IP (5)
+A call to
+.BR epoll_wait (2)
+is done.
+.PP
+If the
+.I rfd
+file descriptor has been added to the
+.B epoll
+interface using the
+.B EPOLLET
+(edge-triggered)
+flag, the call to
+.BR epoll_wait (2)
+done in step
+.B 5
+will probably hang despite the available data still present in the file
+input buffer;
+meanwhile the remote peer might be expecting a response based on the
+data it already sent.
+The reason for this is that edge-triggered mode
+delivers events only when changes occur on the monitored file descriptor.
+So, in step
+.B 5
+the caller might end up waiting for some data that is already present inside
+the input buffer.
+In the above example, an event on
+.I rfd
+will be generated because of the write done in
+.B 2
+and the event is consumed in
+.BR 3 .
+Since the read operation done in
+.B 4
+does not consume the whole buffer data, the call to
+.BR epoll_wait (2)
+done in step
+.B 5
+might block indefinitely.
+.PP
+An application that employs the
+.B EPOLLET
+flag should use nonblocking file descriptors to avoid having a blocking
+read or write starve a task that is handling multiple file descriptors.
+The suggested way to use
+.B epoll
+as an edge-triggered
+.RB ( EPOLLET )
+interface is as follows:
+.IP (1) 5
+with nonblocking file descriptors; and
+.IP (2)
+by waiting for an event only after
+.BR read (2)
+or
+.BR write (2)
+return
+.BR EAGAIN .
+.PP
+By contrast, when used as a level-triggered interface
+(the default, when
+.B EPOLLET
+is not specified),
+.B epoll
+is simply a faster
+.BR poll (2),
+and can be used wherever the latter is used since it shares the
+same semantics.
+.PP
+Since even with edge-triggered
+.BR epoll ,
+multiple events can be generated upon receipt of multiple chunks of data,
+the caller has the option to specify the
+.B EPOLLONESHOT
+flag, to tell
+.B epoll
+to disable the associated file descriptor after the receipt of an event with
+.BR epoll_wait (2).
+When the
+.B EPOLLONESHOT
+flag is specified,
+it is the caller's responsibility to rearm the file descriptor using
+.BR epoll_ctl (2)
+with
+.BR EPOLL_CTL_MOD .
+.PP
+If multiple threads
+(or processes, if child processes have inherited the
+.B epoll
+file descriptor across
+.BR fork (2))
+are blocked in
+.BR epoll_wait (2)
+waiting on the same epoll file descriptor and a file descriptor
+in the interest list that is marked for edge-triggered
+.RB ( EPOLLET )
+notification becomes ready,
+just one of the threads (or processes) is awoken from
+.BR epoll_wait (2).
+This provides a useful optimization for avoiding "thundering herd" wake-ups
+in some scenarios.
+.\"
+.SS Interaction with autosleep
+If the system is in
+.B autosleep
+mode via
+.I /sys/power/autosleep
+and an event happens which wakes the device from sleep, the device
+driver will keep the device awake only until that event is queued.
+To keep the device awake until the event has been processed,
+it is necessary to use the
+.BR epoll_ctl (2)
+.B EPOLLWAKEUP
+flag.
+.PP
+When the
+.B EPOLLWAKEUP
+flag is set in the
+.B events
+field for a
+.IR "struct epoll_event" ,
+the system will be kept awake from the moment the event is queued,
+through the
+.BR epoll_wait (2)
+call which returns the event until the subsequent
+.BR epoll_wait (2)
+call.
+If the event should keep the system awake beyond that time,
+then a separate
+.I wake_lock
+should be taken before the second
+.BR epoll_wait (2)
+call.
+.SS /proc interfaces
+The following interfaces can be used to limit the amount of
+kernel memory consumed by epoll:
+.\" Following was added in Linux 2.6.28, but them removed in Linux 2.6.29
+.\" .TP
+.\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
+.\" This specifies an upper limit on the number of epoll instances
+.\" that can be created per real user ID.
+.TP
+.IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
+This specifies a limit on the total number of
+file descriptors that a user can register across
+all epoll instances on the system.
+The limit is per real user ID.
+Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
+and roughly 160 bytes on a 64-bit kernel.
+Currently,
+.\" Linux 2.6.29 (in Linux 2.6.28, the default was 1/32 of lowmem)
+the default value for
+.I max_user_watches
+is 1/25 (4%) of the available low memory,
+divided by the registration cost in bytes.
+.SS Example for suggested usage
+While the usage of
+.B epoll
+when employed as a level-triggered interface does have the same
+semantics as
+.BR poll (2),
+the edge-triggered usage requires more clarification to avoid stalls
+in the application event loop.
+In this example, listener is a
+nonblocking socket on which
+.BR listen (2)
+has been called.
+The function
+.I do_use_fd()
+uses the new ready file descriptor until
+.B EAGAIN
+is returned by either
+.BR read (2)
+or
+.BR write (2).
+An event-driven state machine application should, after having received
+.BR EAGAIN ,
+record its current state so that at the next call to
+.I do_use_fd()
+it will continue to
+.BR read (2)
+or
+.BR write (2)
+from where it stopped before.
+.PP
+.in +4n
+.EX
+#define MAX_EVENTS 10
+struct epoll_event ev, events[MAX_EVENTS];
+int listen_sock, conn_sock, nfds, epollfd;
+\&
+/* Code to set up listening socket, \[aq]listen_sock\[aq],
+ (socket(), bind(), listen()) omitted. */
+\&
+epollfd = epoll_create1(0);
+if (epollfd == \-1) {
+ perror("epoll_create1");
+ exit(EXIT_FAILURE);
+}
+\&
+ev.events = EPOLLIN;
+ev.data.fd = listen_sock;
+if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
+ perror("epoll_ctl: listen_sock");
+ exit(EXIT_FAILURE);
+}
+\&
+for (;;) {
+ nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
+ if (nfds == \-1) {
+ perror("epoll_wait");
+ exit(EXIT_FAILURE);
+ }
+\&
+ for (n = 0; n < nfds; ++n) {
+ if (events[n].data.fd == listen_sock) {
+ conn_sock = accept(listen_sock,
+ (struct sockaddr *) &addr, &addrlen);
+ if (conn_sock == \-1) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+ setnonblocking(conn_sock);
+ ev.events = EPOLLIN | EPOLLET;
+ ev.data.fd = conn_sock;
+ if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
+ &ev) == \-1) {
+ perror("epoll_ctl: conn_sock");
+ exit(EXIT_FAILURE);
+ }
+ } else {
+ do_use_fd(events[n].data.fd);
+ }
+ }
+}
+.EE
+.in
+.PP
+When used as an edge-triggered interface, for performance reasons, it is
+possible to add the file descriptor inside the
+.B epoll
+interface
+.RB ( EPOLL_CTL_ADD )
+once by specifying
+.RB ( EPOLLIN | EPOLLOUT ).
+This allows you to avoid
+continuously switching between
+.B EPOLLIN
+and
+.B EPOLLOUT
+calling
+.BR epoll_ctl (2)
+with
+.BR EPOLL_CTL_MOD .
+.SS Questions and answers
+.IP \[bu] 3
+What is the key used to distinguish the file descriptors registered in an
+interest list?
+.IP
+The key is the combination of the file descriptor number and
+the open file description
+(also known as an "open file handle",
+the kernel's internal representation of an open file).
+.IP \[bu]
+What happens if you register the same file descriptor on an
+.B epoll
+instance twice?
+.IP
+You will probably get
+.BR EEXIST .
+However, it is possible to add a duplicate
+.RB ( dup (2),
+.BR dup2 (2),
+.BR fcntl (2)
+.BR F_DUPFD )
+file descriptor to the same
+.B epoll
+instance.
+.\" But a file descriptor duplicated by fork(2) can't be added to the
+.\" set, because the [file *, fd] pair is already in the epoll set.
+.\" That is a somewhat ugly inconsistency. On the one hand, a child process
+.\" cannot add the duplicate file descriptor to the epoll set. (In every
+.\" other case that I can think of, file descriptors duplicated by fork have
+.\" similar semantics to file descriptors duplicated by dup() and friends.) On
+.\" the other hand, the very fact that the child has a duplicate of the
+.\" file descriptor means that even if the parent closes its file descriptor,
+.\" then epoll_wait() in the parent will continue to receive notifications for
+.\" that file descriptor because of the duplicated file descriptor in the child.
+.\"
+.\" See http://thread.gmane.org/gmane.linux.kernel/596462/
+.\" "epoll design problems with common fork/exec patterns"
+.\"
+.\" mtk, Feb 2008
+This can be a useful technique for filtering events,
+if the duplicate file descriptors are registered with different
+.I events
+masks.
+.IP \[bu]
+Can two
+.B epoll
+instances wait for the same file descriptor?
+If so, are events reported to both
+.B epoll
+file descriptors?
+.IP
+Yes, and events would be reported to both.
+However, careful programming may be needed to do this correctly.
+.IP \[bu]
+Is the
+.B epoll
+file descriptor itself poll/epoll/selectable?
+.IP
+Yes.
+If an
+.B epoll
+file descriptor has events waiting, then it will
+indicate as being readable.
+.IP \[bu]
+What happens if one attempts to put an
+.B epoll
+file descriptor into its own file descriptor set?
+.IP
+The
+.BR epoll_ctl (2)
+call fails
+.RB ( EINVAL ).
+However, you can add an
+.B epoll
+file descriptor inside another
+.B epoll
+file descriptor set.
+.IP \[bu]
+Can I send an
+.B epoll
+file descriptor over a UNIX domain socket to another process?
+.IP
+Yes, but it does not make sense to do this, since the receiving process
+would not have copies of the file descriptors in the interest list.
+.IP \[bu]
+Will closing a file descriptor cause it to be removed from all
+.B epoll
+interest lists?
+.IP
+Yes, but be aware of the following point.
+A file descriptor is a reference to an open file description (see
+.BR open (2)).
+Whenever a file descriptor is duplicated via
+.BR dup (2),
+.BR dup2 (2),
+.BR fcntl (2)
+.BR F_DUPFD ,
+or
+.BR fork (2),
+a new file descriptor referring to the same open file description is
+created.
+An open file description continues to exist until all
+file descriptors referring to it have been closed.
+.IP
+A file descriptor is removed from an
+interest list only after all the file descriptors referring to the underlying
+open file description have been closed.
+This means that even after a file descriptor that is part of an
+interest list has been closed,
+events may be reported for that file descriptor if other file
+descriptors referring to the same underlying file description remain open.
+To prevent this happening,
+the file descriptor must be explicitly removed from the interest list (using
+.BR epoll_ctl (2)
+.BR EPOLL_CTL_DEL )
+before it is duplicated.
+Alternatively,
+the application must ensure that all file descriptors are closed
+(which may be difficult if file descriptors were duplicated
+behind the scenes by library functions that used
+.BR dup (2)
+or
+.BR fork (2)).
+.IP \[bu]
+If more than one event occurs between
+.BR epoll_wait (2)
+calls, are they combined or reported separately?
+.IP
+They will be combined.
+.IP \[bu]
+Does an operation on a file descriptor affect the
+already collected but not yet reported events?
+.IP
+You can do two operations on an existing file descriptor.
+Remove would be meaningless for
+this case.
+Modify will reread available I/O.
+.IP \[bu]
+Do I need to continuously read/write a file descriptor
+until
+.B EAGAIN
+when using the
+.B EPOLLET
+flag (edge-triggered behavior)?
+.IP
+Receiving an event from
+.BR epoll_wait (2)
+should suggest to you that such
+file descriptor is ready for the requested I/O operation.
+You must consider it ready until the next (nonblocking)
+read/write yields
+.BR EAGAIN .
+When and how you will use the file descriptor is entirely up to you.
+.IP
+For packet/token-oriented files (e.g., datagram socket,
+terminal in canonical mode),
+the only way to detect the end of the read/write I/O space
+is to continue to read/write until
+.BR EAGAIN .
+.IP
+For stream-oriented files (e.g., pipe, FIFO, stream socket), the
+condition that the read/write I/O space is exhausted can also be detected by
+checking the amount of data read from / written to the target file
+descriptor.
+For example, if you call
+.BR read (2)
+by asking to read a certain amount of data and
+.BR read (2)
+returns a lower number of bytes, you
+can be sure of having exhausted the read I/O space for the file
+descriptor.
+The same is true when writing using
+.BR write (2).
+(Avoid this latter technique if you cannot guarantee that
+the monitored file descriptor always refers to a stream-oriented file.)
+.SS Possible pitfalls and ways to avoid them
+.IP \[bu] 3
+.B Starvation (edge-triggered)
+.IP
+If there is a large amount of I/O space,
+it is possible that by trying to drain
+it the other files will not get processed causing starvation.
+(This problem is not specific to
+.BR epoll .)
+.IP
+The solution is to maintain a ready list
+and mark the file descriptor as ready
+in its associated data structure, thereby allowing the application to
+remember which files need to be processed but still round robin amongst
+all the ready files.
+This also supports ignoring subsequent events you
+receive for file descriptors that are already ready.
+.IP \[bu]
+.B If using an event cache...
+.IP
+If you use an event cache or store all the file descriptors returned from
+.BR epoll_wait (2),
+then make sure to provide a way to mark
+its closure dynamically (i.e., caused by
+a previous event's processing).
+Suppose you receive 100 events from
+.BR epoll_wait (2),
+and in event #47 a condition causes event #13 to be closed.
+If you remove the structure and
+.BR close (2)
+the file descriptor for event #13, then your
+event cache might still say there are events waiting for that
+file descriptor causing confusion.
+.IP
+One solution for this is to call, during the processing of event 47,
+.BR epoll_ctl ( EPOLL_CTL_DEL )
+to delete file descriptor 13 and
+.BR close (2),
+then mark its associated
+data structure as removed and link it to a cleanup list.
+If you find another
+event for file descriptor 13 in your batch processing,
+you will discover the file descriptor had been
+previously removed and there will be no confusion.
+.SH VERSIONS
+Some other systems provide similar mechanisms;
+for example,
+FreeBSD has
+.IR kqueue ,
+and Solaris has
+.IR /dev/poll .
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 2.5.44.
+.\" Its interface should be finalized in Linux 2.5.66.
+glibc 2.3.2.
+.SH NOTES
+The set of file descriptors that is being monitored via
+an epoll file descriptor can be viewed via the entry for
+the epoll file descriptor in the process's
+.IR /proc/ pid /fdinfo
+directory.
+See
+.BR proc (5)
+for further details.
+.PP
+The
+.BR kcmp (2)
+.B KCMP_EPOLL_TFD
+operation can be used to test whether a file descriptor
+is present in an epoll instance.
+.SH SEE ALSO
+.BR epoll_create (2),
+.BR epoll_create1 (2),
+.BR epoll_ctl (2),
+.BR epoll_wait (2),
+.BR poll (2),
+.BR select (2)