http://www.win.tue.nl/~aeb/linux/lk/lk-12.html
12. Handling of asynchronous events
One wants to be notified of various events, like data that has become available, files that have changed, and signals that have been raised. FreeBSD has the nicekqueue API. Let us discuss the Unix/Linux situation.
It is easy to wait for a single event. Usually one does a
(blocking) read()
, and that is it.
Many mechanisms exist to wait for any of a set of events, or just to test whether anything interesting happened.
12.1 O_NONBLOCK
If the open()
call that opened a file includes the
O_NONBLOCK flag, the file is opened in non-blocking mode. Neither
the open()
nor any subsequent operations on the returned
file descriptor will cause the calling process to wait.
A nonblocking open is useful (i) in order to obtain a file descriptor for
subsequent use when no I/O is planned, e.g.
for ioctl()
calls to get or set properties of a device;
especially on device files, an ordinary open might have unwanted side effects,
such as a tape rewind etc. (ii) when reading from a pipe: the read will return
immediately when no data is available; when writing to a pipe: the write will
return immediately (without writing anything) when there are no readers.
O_NOACCESS
An obscure Linux feature is that one can open a file with the O_NOACCESS flag (defined as 3, where O_RDONLY is 0, O_WRONLY is 1 and O_RDWR is 2). In order to open a file with this mode, one needs both read and write permission. This had the same purpose: announce that no reading or writing was going to be done, and only a file descriptor for ioctl use was needed. (Used in LILO, fdformat, and a few similar utilities.)
People would love to have this facility also for directories, so that one
could do a fd = open(".", O_NOACCESS)
, go elsewhere, and
return by fchdir(fd)
. But an O_NOACCESS open fails on
directories.
12.2 select
The select()
mechanism was introduced in 4.2BSD. The
prototype of this system call is
int select(int nfds, fd_set *restrict readfds, fd_set *restrict writefds, fd_set *restrict errorfds, struct timeval *restrict timeout);
It allows one to specify three sets of file descriptors (as bit masks)
and a timeout. The call returns when the timeout expires or when one of the file
descriptors inreadfds
has data available for
reading, one of those in writefds
has buffer
space available for writing, or an error occurred for one of those
in errorfds
. Upon return, the file descriptor
sets and the timeout are rewritten to indicate which file descriptor has the
stated condition, and how much time from the timeout is left. (Note that other
Unix-type systems do not rewrite the timeout.)
There are two select system calls. The old one uses a parameter block, the new one uses five parameters. Otherwise they are equivalent.
12.3 pselect
The pselect
system call was added in Linux 2.6.16 (and
was present earlier elsewhere). With only select()
it is
difficult, almost impossible, to handle signals correctly. A signal handler
itself cannot do very much: the main program is in some unknown state when the
signal is delivered. The usual solution is to only raise a flag in the signal
handler, and test that flag in the main program.
int gotsignal = 0; void sighand(int x) { gotsignal = 1; } int main() { ... signal(SIGINT, sighand); while (1) { if (gotsignal) ... select(); ... }
Now if one wants to wait for either a signal or some event on a file
descriptor, then testing the flag and if it is not set
calling select()
has a race: maybe the
signal arrived just after the flag was tested and just before select was called,
and the program may hang in select()
without
reacting to the signal.
The call pselect()
is designed to solve this problem.
This function is just like select()
but has prototype
int pselect(int nfds, fd_set *restrict readfds, fd_set *restrict writefds, fd_set *restrict errorfds, const struct timespec *restrict timeout, const sigset_t *restrict sigmask);
with a sixth parameter sigmask
, and it
does the equivalent of
sigset_t origmask; sigprocmask(SIG_SETMASK, &sigmask, &origmask); ready = select(nfds, &readfds, &writefds, &exceptfds, timeout); sigprocmask(SIG_SETMASK, &origmask, NULL);
as an atomic action. Now one can block the signals of interest until
the call of pselect()
and have
a sigmask
that unblocks them. If a signal
occurs, the call will return with errno
set
to EINTR.
This function uses a struct timespec (with nanoseconds) instead of a struct timeval (with microseconds), and does not update its value on return.
The self-pipe trick
Before the introduction of pselect()
people resorted
to obscure tricks to obtain the same effect. Famous is Daniel
Bernstein‘s self-pipe
trick: create a non-blocking pipe, and add a file descriptor for reading
from this pipe to the readfds
argument
of select()
. In the signal handler, write a byte to the pipe.
This works.
The system call
The pselect system call has a 7-parameter prototype (the 7th parameter being
the size of the 6th sigmask
parameter), but most
architectures cannot handle 7-parameter system calls, so there is also a
6-parameter version where the 6th parameter is a pointer to a struct that has
the last two parameters. Unlike the POSIX library routine, the system call does
return the leftover part of the timeout.
This system call starts changing the signal mask, and ends restoring it.
However, if it was interrupted by a signal, this signal should be delivered,
while the signal mask might block it. This is solved by the
recent TIF_RESTORE_SIGMASK
mechanism in the kernel. When
the pselect system call returns after being interrupted by a signal, it does not
immediately restore the original signal mask, but first runs the user‘s signal
handler, and first upon return from that the original signal mask is
restored.
12.4 poll
The poll()
system call is rather similar
to select()
. The prototype is
struct pollfd { int fd; /* file descriptor */ short events; /* requested events */ short revents; /* returned events */ }; int poll(struct pollfd *fds, nfds_t nfds, int timeout);
where the
fields events
amd revents
are
bitmasks indicating for what
events fd
should be watched, and what
conditions actually occurred. The timeout is in milliseconds; a negative number
means an infinite timeout.
ppoll
Just like pselect
is a version of select that allows
safe handling of signals, ppoll
is such a version
of poll
. The prototype is
int ppoll(struct pollfd *fds, nfds_t nfds, const struct timespec *timeout, const sigset_t *sigmask);
12.5 epoll
When the number of file descriptors becomes very large,
the select()
and poll()
mechanisms
become inefficient. With N descriptors, O(N) information must be copied from
user space to kernel and vice versa, and loops of length O(N) are needed to test
the conditions.
Solaris introduced the /dev/poll
mechanism
(see poll(7d)
on Solaris), where the idea is that one
does the copy from user space to kernel only once (by writing an array of struct
pollfd‘s to /dev/poll
) and gets only interesting information
back (via an ioctl on this device that copies the interesting struct pollfd back
to userspace).
Linux tries something similar using the three system
calls epoll_create
, epoll_ctl
, epoll_wait
(added
in 2.5.44, see epoll(7)
). Benchmarks seem to indicate that the
performance is comparable to that of select and poll until one has thousands of
descriptors, only a small fraction of which is ready. (And then epoll is clearly
better.) In most tests, the FreeBSD kqueue wins.
For a discussion of these and several other mechanisms, especially for the context of web servers, see the C10K site.
epoll_pwait
Just like pselect
and ppoll
are
versions of select
and poll
, there is
(since 2.6.19) a epoll_pwait
version
of epoll_wait
that includes a signal mask.
12.6 dnotify
The above was about notification about file descriptors that become ready for
I/O. A different type of notification is that about file system events. In
2.4.0-test9 the dnotify feature was introduced. Today it is
obsoleted by inotify
(see below).
See Documentation/dnotify.txt
and fs/dnotify.c
.
The idea was that one could register interest in changes in a
directory dir
using fd = open(dir,
O_RDONLY)
followed by fcntl(fd, F_NOTIFY, ...)
.
Notification occurs via delivery of a signal.
/* dnotify demo, basically from Documentation/dnotify.txt */ #define _GNU_SOURCE #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> static volatile int dir_fd; /* A very weak interface: we report that something changed, but the only info is in which directory, but not what the change is. */ static void handler(int sig, siginfo_t *si, void *data) { dir_fd = si->si_fd; } int main(void) { struct sigaction act; int fd; act.sa_sigaction = handler; sigemptyset(&act.sa_mask); act.sa_flags = SA_SIGINFO; sigaction(SIGRTMIN + 1, &act, NULL); fd = open(".", O_RDONLY); fcntl(fd, F_SETSIG, SIGRTMIN + 1); fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_DELETE|DN_RENAME|DN_MULTISHOT); while (1) { pause(); printf("Got some event on fd=%d\n", dir_fd); } }
There are many problems with this interface. It can only watch directories.
If one wants to watch many directories, it takes many file descriptors.
Moreover, the open file pins the filesystem so that it cannot be unmounted. When
something happens it is unknown what, and a stat()
on all
files of interest is needed. The communication mechanism, signals, is
unfortunate. Dnotify is obsolete now.
12.7 inotify
(Since 2.6.13.) Inotify is implemented using three new system calls and the
usual read()
, poll()
, close()
calls:
int inotify_init(void); int inotify_add_watch (int fd, const char *pathname, int mask); int inotify_rm_watch (int fd, int wd);
The first returns a file descriptor: fd =
inotify_init()
. The second tells what to watch, and what to watch
for, and returns a watch descriptor: wd =
inotify_add_watch(fd, "/home/aeb", IN_CREATE | IN_DELETE)
. The file
descriptor fd
can be used in
a read()
call, and then returns an array of
struct inotify_event‘s. One can
use select()
and poll()
on
it. A watch is removed
by inotify_rm_watch(fd,wd)
. The inotify instance
is closed by close(fd)
.
An inotify_event is defined by
struct inotify_event { int wd; /* Watch descriptor */ uint32_t mask; /* Mask of events */ uint32_t cookie; /* Unique cookie associating related events (for rename(2)) */ uint32_t len; /* Size of ‘name‘ field */ char name[]; /* Optional null-terminated name */ };
The name
field defines the file
involved, when one is watching a directory.
There is a /proc
interface with settable limits:
% ls /proc/sys/fs/inotify/ max_queued_events max_user_instances max_user_watches % cat $_/* 16384 128 8192
Applications are inotify-tools, gamin and Beagle.
/* inotify demo, mimicking the above dnotify one */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/select.h> #include <sys/inotify.h> #define BUFSZ 16384 static void errexit(char *s) { fprintf(stderr, "%s\n", s); exit(1); } int main(void) { int ifd, wd, i, n; char buf[BUFSZ]; ifd = inotify_init(); if (ifd < 0) errexit("cannot obtain an inotify instance"); wd = inotify_add_watch(ifd, ".", IN_MODIFY|IN_CREATE|IN_DELETE); if (wd < 0) errexit("cannot add inotify watch"); while (1) { n = read(ifd, buf, sizeof(buf)); if (n <= 0) errexit("read problem"); i = 0; while (i < n) { struct inotify_event *ev; ev = (struct inotify_event *) &buf[i]; if (ev->len) printf("file %s %s\n", ev->name, (ev->mask & IN_CREATE) ? "created" : (ev->mask & IN_DELETE) ? "deleted" : "modified"); else printf("unexpected event - wd=%d mask=%d\n", ev->wd, ev->mask); i += sizeof(struct inotify_event) + ev->len; } printf("---\n"); } }