diff options
Diffstat (limited to '')
-rw-r--r-- | man7/epoll.7 | 610 |
1 files changed, 610 insertions, 0 deletions
diff --git a/man7/epoll.7 b/man7/epoll.7 new file mode 100644 index 0000000..02a53e9 --- /dev/null +++ b/man7/epoll.7 @@ -0,0 +1,610 @@ +.\" Copyright (C) 2003 Davide Libenzi +.\" +.\" SPDX-License-Identifier: GPL-2.0-or-later +.\" +.\" Davide Libenzi <davidel@xmailserver.org> +.\" +.TH epoll 7 2023-05-03 "Linux man-pages 6.05.01" +.SH NAME +epoll \- I/O event notification facility +.SH SYNOPSIS +.nf +.B #include <sys/epoll.h> +.fi +.SH DESCRIPTION +The +.B epoll +API performs a similar task to +.BR poll (2): +monitoring multiple file descriptors to see if I/O is possible on any of them. +The +.B epoll +API can be used either as an edge-triggered or a level-triggered +interface and scales well to large numbers of watched file descriptors. +.PP +The central concept of the +.B epoll +API is the +.B epoll +.IR instance , +an in-kernel data structure which, from a user-space perspective, +can be considered as a container for two lists: +.IP \[bu] 3 +The +.I interest +list (sometimes also called the +.B epoll +set): the set of file descriptors that the process has registered +an interest in monitoring. +.IP \[bu] +The +.I ready +list: the set of file descriptors that are "ready" for I/O. +The ready list is a subset of +(or, more precisely, a set of references to) +the file descriptors in the interest list. +The ready list is dynamically populated +by the kernel as a result of I/O activity on those file descriptors. +.PP +The following system calls are provided to +create and manage an +.B epoll +instance: +.IP \[bu] 3 +.BR epoll_create (2) +creates a new +.B epoll +instance and returns a file descriptor referring to that instance. +(The more recent +.BR epoll_create1 (2) +extends the functionality of +.BR epoll_create (2).) +.IP \[bu] +Interest in particular file descriptors is then registered via +.BR epoll_ctl (2), +which adds items to the interest list of the +.B epoll +instance. +.IP \[bu] +.BR epoll_wait (2) +waits for I/O events, +blocking the calling thread if no events are currently available. +(This system call can be thought of as fetching items from +the ready list of the +.B epoll +instance.) +.\" +.SS Level-triggered and edge-triggered +The +.B epoll +event distribution interface is able to behave both as edge-triggered +(ET) and as level-triggered (LT). +The difference between the two mechanisms +can be described as follows. +Suppose that +this scenario happens: +.IP (1) 5 +The file descriptor that represents the read side of a pipe +.RI ( rfd ) +is registered on the +.B epoll +instance. +.IP (2) +A pipe writer writes 2\ kB of data on the write side of the pipe. +.IP (3) +A call to +.BR epoll_wait (2) +is done that will return +.I rfd +as a ready file descriptor. +.IP (4) +The pipe reader reads 1\ kB of data from +.IR rfd . +.IP (5) +A call to +.BR epoll_wait (2) +is done. +.PP +If the +.I rfd +file descriptor has been added to the +.B epoll +interface using the +.B EPOLLET +(edge-triggered) +flag, the call to +.BR epoll_wait (2) +done in step +.B 5 +will probably hang despite the available data still present in the file +input buffer; +meanwhile the remote peer might be expecting a response based on the +data it already sent. +The reason for this is that edge-triggered mode +delivers events only when changes occur on the monitored file descriptor. +So, in step +.B 5 +the caller might end up waiting for some data that is already present inside +the input buffer. +In the above example, an event on +.I rfd +will be generated because of the write done in +.B 2 +and the event is consumed in +.BR 3 . +Since the read operation done in +.B 4 +does not consume the whole buffer data, the call to +.BR epoll_wait (2) +done in step +.B 5 +might block indefinitely. +.PP +An application that employs the +.B EPOLLET +flag should use nonblocking file descriptors to avoid having a blocking +read or write starve a task that is handling multiple file descriptors. +The suggested way to use +.B epoll +as an edge-triggered +.RB ( EPOLLET ) +interface is as follows: +.IP (1) 5 +with nonblocking file descriptors; and +.IP (2) +by waiting for an event only after +.BR read (2) +or +.BR write (2) +return +.BR EAGAIN . +.PP +By contrast, when used as a level-triggered interface +(the default, when +.B EPOLLET +is not specified), +.B epoll +is simply a faster +.BR poll (2), +and can be used wherever the latter is used since it shares the +same semantics. +.PP +Since even with edge-triggered +.BR epoll , +multiple events can be generated upon receipt of multiple chunks of data, +the caller has the option to specify the +.B EPOLLONESHOT +flag, to tell +.B epoll +to disable the associated file descriptor after the receipt of an event with +.BR epoll_wait (2). +When the +.B EPOLLONESHOT +flag is specified, +it is the caller's responsibility to rearm the file descriptor using +.BR epoll_ctl (2) +with +.BR EPOLL_CTL_MOD . +.PP +If multiple threads +(or processes, if child processes have inherited the +.B epoll +file descriptor across +.BR fork (2)) +are blocked in +.BR epoll_wait (2) +waiting on the same epoll file descriptor and a file descriptor +in the interest list that is marked for edge-triggered +.RB ( EPOLLET ) +notification becomes ready, +just one of the threads (or processes) is awoken from +.BR epoll_wait (2). +This provides a useful optimization for avoiding "thundering herd" wake-ups +in some scenarios. +.\" +.SS Interaction with autosleep +If the system is in +.B autosleep +mode via +.I /sys/power/autosleep +and an event happens which wakes the device from sleep, the device +driver will keep the device awake only until that event is queued. +To keep the device awake until the event has been processed, +it is necessary to use the +.BR epoll_ctl (2) +.B EPOLLWAKEUP +flag. +.PP +When the +.B EPOLLWAKEUP +flag is set in the +.B events +field for a +.IR "struct epoll_event" , +the system will be kept awake from the moment the event is queued, +through the +.BR epoll_wait (2) +call which returns the event until the subsequent +.BR epoll_wait (2) +call. +If the event should keep the system awake beyond that time, +then a separate +.I wake_lock +should be taken before the second +.BR epoll_wait (2) +call. +.SS /proc interfaces +The following interfaces can be used to limit the amount of +kernel memory consumed by epoll: +.\" Following was added in Linux 2.6.28, but them removed in Linux 2.6.29 +.\" .TP +.\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)" +.\" This specifies an upper limit on the number of epoll instances +.\" that can be created per real user ID. +.TP +.IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)" +This specifies a limit on the total number of +file descriptors that a user can register across +all epoll instances on the system. +The limit is per real user ID. +Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, +and roughly 160 bytes on a 64-bit kernel. +Currently, +.\" Linux 2.6.29 (in Linux 2.6.28, the default was 1/32 of lowmem) +the default value for +.I max_user_watches +is 1/25 (4%) of the available low memory, +divided by the registration cost in bytes. +.SS Example for suggested usage +While the usage of +.B epoll +when employed as a level-triggered interface does have the same +semantics as +.BR poll (2), +the edge-triggered usage requires more clarification to avoid stalls +in the application event loop. +In this example, listener is a +nonblocking socket on which +.BR listen (2) +has been called. +The function +.I do_use_fd() +uses the new ready file descriptor until +.B EAGAIN +is returned by either +.BR read (2) +or +.BR write (2). +An event-driven state machine application should, after having received +.BR EAGAIN , +record its current state so that at the next call to +.I do_use_fd() +it will continue to +.BR read (2) +or +.BR write (2) +from where it stopped before. +.PP +.in +4n +.EX +#define MAX_EVENTS 10 +struct epoll_event ev, events[MAX_EVENTS]; +int listen_sock, conn_sock, nfds, epollfd; +\& +/* Code to set up listening socket, \[aq]listen_sock\[aq], + (socket(), bind(), listen()) omitted. */ +\& +epollfd = epoll_create1(0); +if (epollfd == \-1) { + perror("epoll_create1"); + exit(EXIT_FAILURE); +} +\& +ev.events = EPOLLIN; +ev.data.fd = listen_sock; +if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) { + perror("epoll_ctl: listen_sock"); + exit(EXIT_FAILURE); +} +\& +for (;;) { + nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1); + if (nfds == \-1) { + perror("epoll_wait"); + exit(EXIT_FAILURE); + } +\& + for (n = 0; n < nfds; ++n) { + if (events[n].data.fd == listen_sock) { + conn_sock = accept(listen_sock, + (struct sockaddr *) &addr, &addrlen); + if (conn_sock == \-1) { + perror("accept"); + exit(EXIT_FAILURE); + } + setnonblocking(conn_sock); + ev.events = EPOLLIN | EPOLLET; + ev.data.fd = conn_sock; + if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, + &ev) == \-1) { + perror("epoll_ctl: conn_sock"); + exit(EXIT_FAILURE); + } + } else { + do_use_fd(events[n].data.fd); + } + } +} +.EE +.in +.PP +When used as an edge-triggered interface, for performance reasons, it is +possible to add the file descriptor inside the +.B epoll +interface +.RB ( EPOLL_CTL_ADD ) +once by specifying +.RB ( EPOLLIN | EPOLLOUT ). +This allows you to avoid +continuously switching between +.B EPOLLIN +and +.B EPOLLOUT +calling +.BR epoll_ctl (2) +with +.BR EPOLL_CTL_MOD . +.SS Questions and answers +.IP \[bu] 3 +What is the key used to distinguish the file descriptors registered in an +interest list? +.IP +The key is the combination of the file descriptor number and +the open file description +(also known as an "open file handle", +the kernel's internal representation of an open file). +.IP \[bu] +What happens if you register the same file descriptor on an +.B epoll +instance twice? +.IP +You will probably get +.BR EEXIST . +However, it is possible to add a duplicate +.RB ( dup (2), +.BR dup2 (2), +.BR fcntl (2) +.BR F_DUPFD ) +file descriptor to the same +.B epoll +instance. +.\" But a file descriptor duplicated by fork(2) can't be added to the +.\" set, because the [file *, fd] pair is already in the epoll set. +.\" That is a somewhat ugly inconsistency. On the one hand, a child process +.\" cannot add the duplicate file descriptor to the epoll set. (In every +.\" other case that I can think of, file descriptors duplicated by fork have +.\" similar semantics to file descriptors duplicated by dup() and friends.) On +.\" the other hand, the very fact that the child has a duplicate of the +.\" file descriptor means that even if the parent closes its file descriptor, +.\" then epoll_wait() in the parent will continue to receive notifications for +.\" that file descriptor because of the duplicated file descriptor in the child. +.\" +.\" See http://thread.gmane.org/gmane.linux.kernel/596462/ +.\" "epoll design problems with common fork/exec patterns" +.\" +.\" mtk, Feb 2008 +This can be a useful technique for filtering events, +if the duplicate file descriptors are registered with different +.I events +masks. +.IP \[bu] +Can two +.B epoll +instances wait for the same file descriptor? +If so, are events reported to both +.B epoll +file descriptors? +.IP +Yes, and events would be reported to both. +However, careful programming may be needed to do this correctly. +.IP \[bu] +Is the +.B epoll +file descriptor itself poll/epoll/selectable? +.IP +Yes. +If an +.B epoll +file descriptor has events waiting, then it will +indicate as being readable. +.IP \[bu] +What happens if one attempts to put an +.B epoll +file descriptor into its own file descriptor set? +.IP +The +.BR epoll_ctl (2) +call fails +.RB ( EINVAL ). +However, you can add an +.B epoll +file descriptor inside another +.B epoll +file descriptor set. +.IP \[bu] +Can I send an +.B epoll +file descriptor over a UNIX domain socket to another process? +.IP +Yes, but it does not make sense to do this, since the receiving process +would not have copies of the file descriptors in the interest list. +.IP \[bu] +Will closing a file descriptor cause it to be removed from all +.B epoll +interest lists? +.IP +Yes, but be aware of the following point. +A file descriptor is a reference to an open file description (see +.BR open (2)). +Whenever a file descriptor is duplicated via +.BR dup (2), +.BR dup2 (2), +.BR fcntl (2) +.BR F_DUPFD , +or +.BR fork (2), +a new file descriptor referring to the same open file description is +created. +An open file description continues to exist until all +file descriptors referring to it have been closed. +.IP +A file descriptor is removed from an +interest list only after all the file descriptors referring to the underlying +open file description have been closed. +This means that even after a file descriptor that is part of an +interest list has been closed, +events may be reported for that file descriptor if other file +descriptors referring to the same underlying file description remain open. +To prevent this happening, +the file descriptor must be explicitly removed from the interest list (using +.BR epoll_ctl (2) +.BR EPOLL_CTL_DEL ) +before it is duplicated. +Alternatively, +the application must ensure that all file descriptors are closed +(which may be difficult if file descriptors were duplicated +behind the scenes by library functions that used +.BR dup (2) +or +.BR fork (2)). +.IP \[bu] +If more than one event occurs between +.BR epoll_wait (2) +calls, are they combined or reported separately? +.IP +They will be combined. +.IP \[bu] +Does an operation on a file descriptor affect the +already collected but not yet reported events? +.IP +You can do two operations on an existing file descriptor. +Remove would be meaningless for +this case. +Modify will reread available I/O. +.IP \[bu] +Do I need to continuously read/write a file descriptor +until +.B EAGAIN +when using the +.B EPOLLET +flag (edge-triggered behavior)? +.IP +Receiving an event from +.BR epoll_wait (2) +should suggest to you that such +file descriptor is ready for the requested I/O operation. +You must consider it ready until the next (nonblocking) +read/write yields +.BR EAGAIN . +When and how you will use the file descriptor is entirely up to you. +.IP +For packet/token-oriented files (e.g., datagram socket, +terminal in canonical mode), +the only way to detect the end of the read/write I/O space +is to continue to read/write until +.BR EAGAIN . +.IP +For stream-oriented files (e.g., pipe, FIFO, stream socket), the +condition that the read/write I/O space is exhausted can also be detected by +checking the amount of data read from / written to the target file +descriptor. +For example, if you call +.BR read (2) +by asking to read a certain amount of data and +.BR read (2) +returns a lower number of bytes, you +can be sure of having exhausted the read I/O space for the file +descriptor. +The same is true when writing using +.BR write (2). +(Avoid this latter technique if you cannot guarantee that +the monitored file descriptor always refers to a stream-oriented file.) +.SS Possible pitfalls and ways to avoid them +.IP \[bu] 3 +.B Starvation (edge-triggered) +.IP +If there is a large amount of I/O space, +it is possible that by trying to drain +it the other files will not get processed causing starvation. +(This problem is not specific to +.BR epoll .) +.IP +The solution is to maintain a ready list +and mark the file descriptor as ready +in its associated data structure, thereby allowing the application to +remember which files need to be processed but still round robin amongst +all the ready files. +This also supports ignoring subsequent events you +receive for file descriptors that are already ready. +.IP \[bu] +.B If using an event cache... +.IP +If you use an event cache or store all the file descriptors returned from +.BR epoll_wait (2), +then make sure to provide a way to mark +its closure dynamically (i.e., caused by +a previous event's processing). +Suppose you receive 100 events from +.BR epoll_wait (2), +and in event #47 a condition causes event #13 to be closed. +If you remove the structure and +.BR close (2) +the file descriptor for event #13, then your +event cache might still say there are events waiting for that +file descriptor causing confusion. +.IP +One solution for this is to call, during the processing of event 47, +.BR epoll_ctl ( EPOLL_CTL_DEL ) +to delete file descriptor 13 and +.BR close (2), +then mark its associated +data structure as removed and link it to a cleanup list. +If you find another +event for file descriptor 13 in your batch processing, +you will discover the file descriptor had been +previously removed and there will be no confusion. +.SH VERSIONS +Some other systems provide similar mechanisms; +for example, +FreeBSD has +.IR kqueue , +and Solaris has +.IR /dev/poll . +.SH STANDARDS +Linux. +.SH HISTORY +Linux 2.5.44. +.\" Its interface should be finalized in Linux 2.5.66. +glibc 2.3.2. +.SH NOTES +The set of file descriptors that is being monitored via +an epoll file descriptor can be viewed via the entry for +the epoll file descriptor in the process's +.IR /proc/ pid /fdinfo +directory. +See +.BR proc (5) +for further details. +.PP +The +.BR kcmp (2) +.B KCMP_EPOLL_TFD +operation can be used to test whether a file descriptor +is present in an epoll instance. +.SH SEE ALSO +.BR epoll_create (2), +.BR epoll_create1 (2), +.BR epoll_ctl (2), +.BR epoll_wait (2), +.BR poll (2), +.BR select (2) |