select
to implement selectors and therefore I feel like I have to add this extra post for completenessEffectively, Java doesn't really use select
to implement its selectors... Let's quickly see what they use stracing
the code from the post Java vs C Network Programming. Select and Selectors.
$ strace -f java TCPServerNIOSelector (...) [pid 10567] epoll_wait(7, <unfinished ...> [pid 10582] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 10582] futex(0x7f84e0025d18, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 10582] futex(0x7f84e0025d68, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49997828}) = -1 ETIMEDOUT (Connection timed out) [pid 10582] futex(0x7f84e0025d18, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 10582] futex(0x7f84e0025d68, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49997319} <unfinished ...> [pid 10567] <... epoll_wait resumed> [], 1024, 100) = 0 CTRL+C
We need to use the flag -f
to ask strace
to also trace child process. Just run without it, and you will see that the output ends after a clone
system call. Anyway, the relevant part is that the Java application is using epoll_wait
instead of select
.
But before we introduce epoll
we will first introduce poll
and then we will be able to discuss the difference between all three ways of writing asynchronous applications.
The poll
system call
The poll
system call works almost like select
but with a few differences:
- The file descriptor list is maintained in an array of
struct pollfd
items. - We can use more events than the
read
,write
andexception
file description sets ofselect
. You can check the man page forpoll
to get the complete list - As
poll
doesn't modify the list of file descriptors we do not have to initialise the list in each loop cycle select
actually can support a limited number of file descriptors defined by theFD_SETSIZE
constant that for Linux is usually set to 1024 (file descriptors are actually encoded in bits inside longs).- Fifth,
select
timeout has a precision of microseconds whilepoll
timeout is specified in miliseconds... However, the precision ofselect
timeout is arguable when not using a real-time kernel, as default system latency will be higher than microseconds.
Said all this, let's rewrite our example server using poll
. Here I have removed all the channels and buffers code so we can better see how to use the bare system call. It is an interesting exercise to rework the example using the objects introduced in previous instalments.
As usually I will split the code in blocks so it is easier to follow. You will find the complete source code after the explanation.
int main () {
int s;
struct pollfd *fdlist = NULL ; // So first realloc is a malloc
int nfd, n = 0;
/* Create and initialise a TCP server socket*/
if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
/* Server socket initialisation code omitted */
(...)
// Add server socket to poll list
add_fd (s, &fdlist, &n);
The main
function, just creates a server socket. I have omitted the code to bind and listen on the socket, to keep the explanation simpler. We had already talked about all that in previous instalments.
Then we call a function that we have named add_fd
. This is the function that allows us to add a file description to the list that poll
will check.
add_fd
Function
The add_fd
function, as already said, adds a file descriptor to the poll
list of file descriptors. This list is stored in an array of elements of type struct pollfd
which is defined as follows:
struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};
Where
fd
, is the file descriptor to monitor. This field can be-1
and in that case the entry will be ignored bypoll
. We will be using this featureevents
is bit mask indicating the events that we want to check in the file descriptorfd
revents
is a bit mask indicating which events have been detected. This field is updated bypoll
and is the one we have to check in order to know if some event was fired on the associated file descriptor.
With all this information we can check the code of add_fd
:
int add_fd (int fd, struct pollfd **fdlist, int *len) {
// First check if there is an unused poll structure
int i, n = *len;
struct pollfd *_fdlist = *fdlist;
for (i = 0; i < n; i++)
if (_fdlist[i].fd == -1) break;
if (i == n) // If no hole was found... reallocate memory
{
n++;
_fdlist = realloc (*fdlist, sizeof(struct pollfd) * n);
}
// i contains the right value in any case
_fdlist[i].events = POLLIN;
_fdlist[i].fd = fd;
*fdlist = _fdlist;
*len = n;
printf ("%d descriptor added. %d descriptors monitored\n", fd, n);
return 0;
}
The function works as follows. It first looks for an entry with an fd
file set to -1
. That means that the file descriptor was used but the associated connection was closed and it is available for reuse. In case there is no available entry for reuse, add_func
will resize the buffer to add one more entry to include the new file descriptor.
For the selected entry, the function sets the fd
field to contain the file descriptor we want to add and the events
to POLLIN
to monitor data coming IN
. This is basically the equivalent to the read file descriptor set of select
. For our echo server is all we need. You can check the man
page of poll
for a complete list of possible events.
Note that because the reallocation of the file descriptor buffer we have to pass the array as a reference as realloc
potentially may change the start address of the memory block to reallocate.
Note: realloc
usually doesn't change the pointer so it is very likely that the code will just work without passing the pointer.... but then it may fail in the future, for instance, under heavy memory load in the machine and then you won't know what is going on. Always read the man pages.
The poll
main loop
Time to look at the main loop:
while (1)
{
if ((nfd = poll (fdlist, n, 100)) < 0) perror ("poll:");
else
{
if (fdlist[0].revents & POLLIN) // Accept connection
{
int s1;
if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)
FAIL ("accept:");
printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
add_fd (s1, &fdlist, &n);
}
}
The poll
syscall accepts 3 parameters. The array with the file descriptors and events to monitor, the number of file descriptors to monitor and a timeout in miliseconds.
The second parameter indicates the number of items in the array to monitor. In this case I'm passing the size of the array... which doesn't need to be the same than the number of active file descriptors. In this case, and for this example this give me a simpler code to focus on how poll
work, but you may note that we may be asking poll
to go through many entries in fdlist
that actually doesn't need to be checked.
After a successful poll
the next thing we do is to check if we have got a return event on our file descriptor at index 0... that is, the server socket, the very first one we added to the array. If so, we accept the connection and we add the new file descriptor to the list. The same way we did with select
.
After that we just need to check what happen in the rest of file descriptors
for (i = 1; i < n; i++)
{
if ((fdlist[i].revents & POLLIN))
{
// Do the echo thing
continue;
}
printf ("Connection closed\n");
close (fdlist[i].fd);
fdlist[i].fd = -1; // Ignore this entry for next poll
}
In case there we receive data on any of the other file descriptors we do our echo thingy. Otherwise we just close the connection and set the fd
field to -1
so poll
will not consider that entry in the next loop. poll
always considers the POLLHUP
event even when not configured. This event is generated when the connection is closed... that is why the code above works.
Also note that I'm scanning the whole array. You should actually just check for the number returned by poll so in case all the file descriptor affected are in the lower entries the loop will finish early.
This is the complete code as example on how to use poll
.
#include <stdio.h>
#include <string.h>
#include <stdlib.h> // exit
#include <poll.h> // poll
#include <unistd.h> // read/write
#include <sys/types.h> // Socket
#include <sys/socket.h>
#include <netinet/in.h> // inet_addr
#include <arpa/inet.h> // hton
#define FAIL(s) do {perror (s); exit (EXIT_FAILURE);} while (0)
#define ERROR(s) do {fprintf (stderr, s); exit (EXIT_FAILURE);} while (0)
#define BUF_SIZE 1024
/* Adds a new file descriptor to the array reallocating it if needed */
int add_fd (int fd, struct pollfd **fdlist, int *len) {
// First check if there is an unused poll structure
int i, n = *len;
struct pollfd *_fdlist = *fdlist;
for (i = 0; i < n; i++)
if (_fdlist[i].fd == -1) break;
if (i == n) // If no hole was found... reallocate memory
{
n++;
_fdlist = realloc (*fdlist, sizeof(struct pollfd) * n);
}
// i contains the right value in any case
_fdlist[i].events = POLLIN;
_fdlist[i].fd = fd;
*fdlist = _fdlist;
*len = n;
printf ("%d descriptor added. %d descriptors monitored\n", fd, n);
return 0;
}
int main () {
int s;
unsigned char buf[BUF_SIZE];
int i, ops=1;
char *msg;
struct pollfd *fdlist = NULL ; // So first realloc is a malloc
int nfd,n = 0;
struct sockaddr_in client;
socklen_t sa_len = sizeof (struct sockaddr_in);
struct sockaddr_in addr;
/* Create and initialise a TCP server socket*/
if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons (1234); // Default port
setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &ops, sizeof(ops));
if ((bind (s, (struct sockaddr *) &addr, sa_len)) < 0) FAIL ("bind:");
if ((listen (s, 1)) < 0) FAIL("listen:");
// Add server socket to poll list
add_fd (s, &fdlist, &n);
while (1)
{
if ((nfd = poll (fdlist, n, 100)) < 0)
perror ("poll:");
else
{
if (fdlist[0].revents & POLLIN) // Accept connection
{
int s1;
if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)
FAIL ("accept:");
printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
add_fd (s1, &fdlist, &n);
}
for (i = 1; i < n; i++)
{
if ((fdlist[i].revents & POLLIN))
{
int con = fdlist[i].fd;
memset (buf, 0, BUF_SIZE);
int len = read (con, buf, BUF_SIZE);
if (len > 0)
{
len += 10;
msg = malloc (len);
memset (msg,0, len);
snprintf (msg, len, "ECHO : %s", buf);
printf ("RECV (%d) : %s", len, buf);
if (len > 0)
{
write (con, msg, len);
free (msg);
msg = NULL;
continue;
}
}
printf ("Connection closed\n");
close (fdlist[i].fd);
fdlist[i].fd = -1; // Ignore this entry for next poll
}
}
}
}
close (s);
return 0;
}
The Event Poll epoll
The poll
system call solves some of the issues of select
and it is a better option than it unless we are really in a need to support unusual/old systems. However, poll
still requires that we scan the whole array of file descriptors in order to find out which ones got events.
This may be a problem if our network server is expected to deal with thousands (or more) of connections simultaneously. In that case, even when only the last entry in the file descriptor array has changed we will have to go through hundreds of entries to check the revents
field until we found the one that we need to process.
To avoid this, GNU/Linux introduced the Event Poll or epoll
. Note that this is GNU/Linux specific and it is not portable. In case you need to write a portable application that needs to deal with a huge number of clients simultaneously your better option is to use something like libevent. This library provides a standard API independent of the underlying OS interface used. In simple words, Solaris, FreeBSD and even Windows have different APIs to solve this problem and all of them are different. libevent
is a wrapper around those APIs.
So, how does epoll
solves this problem?. Well, it decouples the events of the array of file descriptors. What this actually mean is that when you call the right system call (actually epoll_wait
), you get an array with the list of file descriptors that need attention so you do not need to scan the whole list checking all of them.
Creating the epoll
handler and adding the server socket
The first step to use the epoll
interface is to create an epoll
handler. This can be done with the epoll_create1
system call. Then, as we did with the poll
version, we just add the server socket so we are ready to accept connections, this is done with the epoll_ctl
system call using as second parameter EPOLL_CTL_ADD
.
int epollfd;
struct epoll_event ev, events[MAX_EVENTS];
(...)
// Create server socket on variable s
if ((epollfd = epoll_create1 (0)) < 0) FAIL ("epoll_create1:");
// Add server channel to poll list
memset (&ev, 0, sizeof(struct epoll_event));
ev.events = EPOLLIN; // epoll_wait always waits for EPOLLHUP no need to add here
ev.data.fd = s;
if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s, &ev)) FAIL ("epoll_ctl(SERVER):");
The epoll_create1
parameter is a flag that actually can only be 0 or EPOLL_CLOEXEC
that sets the close-on-exec flag to the epoll
file descriptor. This work the same than the O_CLOSEXEC
flag of open
.
The epoll_ctl
syscall allows us to add, delete and modify the list of file descriptor to monitor. In this case we are using the EPOLL_CTL_ADD
constant to indicate that we want to add a new file descriptor to the list. The struct event
passed as forth parameter indicates the events we want to monitor for this file descriptor. You can refer to the epoll_ctrl
man page for a list of all possible events.
The data field, is a kind of user data pointer in the structure that allows us to store some information that we will likely need when the file descriptor gets an event. The structures are defined like this:
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event {
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
The struct epoll_event
contains the events themselves (note that it is a 32bits value now) and then this struct epoll_data_t
field where, in addition to the file descriptor (that has its own field) we can store a pointer a 32bits value and a 64bits value. Whatever values we set there, they will be available on the struct epoll_event
returned when an event happen.
What is this useful for?... Well, think about our original select
implementation where we were using those channel
objects. The Java implementation had a method to get the associated channel directly out of the selector, but in C we had to maintain our own list of channels. Now, you can store the channel pointer in the ptr
field of the data part of the struct, and whenever an event happens in that file descriptor, we get together with it also the pointer to our channel object. In this example, as I had removed all the channel and buffer functions I'm not using these fields but if you plan rewrite the program using the channel objects this is one possible way to go.
The epoll
main loop
The main loop is almost the same than the one we wrote with poll
, but know we only get the list of events that we have to process so we do not really need to check all the file descriptors we are monitoring. This is why epoll
is really useful when you need to deal with a huge number of clients.
Let's see how the main loop looks like
while (1)
{
if ((nfd = epoll_wait (epollfd, events, MAX_EVENTS, 100)) < 0) perror ("epoll:");
else
{
for (i = 0; i < nfd; i++)
{
// Process events.
}
The events
is an array of struct epoll_events
that will be filed by epoll_wait
with all the events detected. In this case, in each call to epoll_wait
we can receive as many as MAX_EVENTS
events simultaneously. The actual number of events is the return value of the function. As with poll
the last parameter is a timeout in miliseconds.
Processing events
So, inside the for loop that goes through the array of events we will find this code
for (i = 0; i < nfd; i++) {
if (events[i].data.fd == s)
{
int s1;
if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0) FAIL ("accept:");
printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
ev.events = EPOLLIN; // epoll_wait always waits for EPOLLHUP no need to add here
ev.data.fd = s1;
if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s1, &ev)) FAIL ("epoll_ctl(ADD):");
}
else
{
int con = events[i].data.fd;
// The echo thingy
continue;
}
printf ("Connection closed\n");
if ((epoll_ctl (epollfd, EPOLL_CTL_DEL, con, NULL))) FAIL ("epol_ctl(DEL):");
close (con);
} // end for
Now, as we get the list of events, we cannot assume that the index 0 contains our first file descriptor (the server socket), so in the loop we need to check if the file descriptor is the server socket in order to accept the connection. We could have use one of the values in epoll_data
to store a flag indicating if the file descriptor is a server or a client socket... but we will be doing a integer comparison anyways so it won't be a big difference.
In this example, in case the file descriptor is not the server socket then it is a client socket and then we just do the echo thing.
The last part of the code is only executed when the connection is closed. As it happens with poll
, epoll_wait
will always check for the EPOLLHUP
event and it is not necessary to set it when adding the file descriptor. In our example, if the event is not EPOLLIN
then it shall be EPOLLHUP
so we close the connection and remove the file descriptor from the list using epoll_ctrl
and the constant EPOLL_CTL_DEL
. Note that the last parameter (the epoll_event *
) can be NULL
when deleting as the events are no longer relevant.
This is epoll
echo server complete code:
#include <stdio.h>
#include <string.h>
#include <stdlib.h> // exit
#include <stdarg.h>
#include <sys/epoll.h>
#include <unistd.h> // read/write
#include <time.h>
#include <sys/types.h> // Socket
#include <sys/socket.h>
#include <netinet/in.h> // inet_addr
#include <arpa/inet.h> // hton
#define FAIL(s) do {perror (s); exit (EXIT_FAILURE);} while (0)
#define ERROR(s) do {fprintf (stderr, s); exit (EXIT_FAILURE);} while (0)
#define BUF_SIZE 1024
#define MAX_EVENTS 64
int main () {
int s;
unsigned char buf[BUF_SIZE];
int i, ops=1;
char *msg;
int nfd,n = 0;
struct sockaddr_in client;
socklen_t sa_len = sizeof (struct sockaddr_in);
struct sockaddr_in addr;
int epollfd;
struct epoll_event ev, events[MAX_EVENTS];
if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons (1234); // Default port
setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &ops, sizeof(ops));
if ((bind (s, (struct sockaddr *) &addr, sa_len)) < 0) FAIL ("bind:");
if ((listen (s, 1)) < 0) FAIL("listen:");
if ((epollfd = epoll_create1 (0)) < 0) FAIL ("epoll_create1:");
// Add server channel to poll list
memset (&ev, 0, sizeof(struct epoll_event));
ev.events = EPOLLIN; // epoll_wait always waits for EPOLLHUP no need to add here
ev.data.fd = s;
if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s, &ev)) FAIL ("epoll_ctl(SERVER):");
while (1)
{
if ((nfd = epoll_wait (epollfd, events, MAX_EVENTS, 100)) < 0) perror ("epoll:");
else
{
for (i = 0; i < nfd; i++)
{
if (events[i].data.fd == s)
{
int s1;
if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0) FAIL ("accept:");
printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
ev.events = EPOLLIN; // epoll_wait always waits for EPOLLHUP no need to add here
ev.data.fd = s1;
if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s1, &ev)) FAIL ("epoll_ctl(ADD):");
}
else
{
int con = events[i].data.fd;
memset (buf, 0, BUF_SIZE);
int len = read (con, buf, BUF_SIZE);
if (len > 0)
{
len += 10;
msg = malloc (len);
memset (msg,0, len);
snprintf (msg, len, "ECHO : %s", buf);
printf ("RECV (%d) : %s", len, buf);
if (len > 0)
{
write (con, msg, len);
free (msg);
msg = NULL;
continue;
}
}
printf ("Connection closed\n");
if ((epoll_ctl (epollfd, EPOLL_CTL_DEL, con, NULL))) FAIL ("epol_ctl(DEL):");
close (con);
}
}
}
}
close (s);
return 0;
}
Conclusions
In this instalment we have introduced the poll
and epoll
syscalls/interfaces to implement asynchronous network application as alternatives to select
. As a quick summary: Use select
for maximum portability when you are expecting to work with a small number of clients. If you program doesn't targets old UNIX, just use poll
. If your program requires to deal with a huge number of clients (above 1024), use the event interface provided by your OS that for GNU/Linux is epoll
.
■