Java vs C Network Programming. Revisiting Selectors (poll & epoll)
 
PROGRAMMING
Java vs C Network Programming. Revisiting Selectors (poll & epoll)
2021-06-29 | by David "DeMO" Martínez Oliveira

I have titled this article as java vs C, but it won't have much java. However, as we will see in a sec Java does not really use select to implement selectors and therefore I feel like I have to add this extra post for completeness

Effectively, Java doesn't really use select to implement its selectors... Let's quickly see what they use stracing the code from the post Java vs C Network Programming. Select and Selectors.

$ strace -f java TCPServerNIOSelector
(...)

[pid 10567] epoll_wait(7,  <unfinished ...>
[pid 10582] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 10582] futex(0x7f84e0025d18, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 10582] futex(0x7f84e0025d68, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49997828}) = -1 ETIMEDOUT (Connection timed out)
[pid 10582] futex(0x7f84e0025d18, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 10582] futex(0x7f84e0025d68, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49997319} <unfinished ...>
[pid 10567] <... epoll_wait resumed> [], 1024, 100) = 0
CTRL+C

We need to use the flag -f to ask strace to also trace child process. Just run without it, and you will see that the output ends after a clone system call. Anyway, the relevant part is that the Java application is using epoll_wait instead of select.

But before we introduce epoll we will first introduce poll and then we will be able to discuss the difference between all three ways of writing asynchronous applications.

The poll system call

The poll system call works almost like select but with a few differences:

  • The file descriptor list is maintained in an array of struct pollfd items.
  • We can use more events than the read, write and exception file description sets of select. You can check the man page for poll to get the complete list
  • As poll doesn't modify the list of file descriptors we do not have to initialise the list in each loop cycle
  • select actually can support a limited number of file descriptors defined by the FD_SETSIZE constant that for Linux is usually set to 1024 (file descriptors are actually encoded in bits inside longs).
  • Fifth, select timeout has a precision of microseconds while poll timeout is specified in miliseconds... However, the precision of select timeout is arguable when not using a real-time kernel, as default system latency will be higher than microseconds.

Said all this, let's rewrite our example server using poll. Here I have removed all the channels and buffers code so we can better see how to use the bare system call. It is an interesting exercise to rework the example using the objects introduced in previous instalments.

As usually I will split the code in blocks so it is easier to follow. You will find the complete source code after the explanation.

int main () {
  int                  s; 
  struct pollfd        *fdlist = NULL ; // So first realloc is a malloc
  int                  nfd, n = 0;
  
  /* Create and initialise a TCP server socket*/
  if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
  /* Server socket initialisation code omitted */
  (...)

  // Add server socket to poll list
  add_fd (s, &fdlist, &n);

The main function, just creates a server socket. I have omitted the code to bind and listen on the socket, to keep the explanation simpler. We had already talked about all that in previous instalments.

Then we call a function that we have named add_fd. This is the function that allows us to add a file description to the list that poll will check.

add_fd Function

The add_fd function, as already said, adds a file descriptor to the poll list of file descriptors. This list is stored in an array of elements of type struct pollfd which is defined as follows:

           struct pollfd {
               int   fd;         /* file descriptor */
               short events;     /* requested events */
               short revents;    /* returned events */
           };

Where

  • fd, is the file descriptor to monitor. This field can be -1 and in that case the entry will be ignored by poll. We will be using this feature
  • events is bit mask indicating the events that we want to check in the file descriptor fd
  • revents is a bit mask indicating which events have been detected. This field is updated by poll and is the one we have to check in order to know if some event was fired on the associated file descriptor.

With all this information we can check the code of add_fd:

int add_fd (int fd, struct pollfd **fdlist, int *len) {
  // First check if there is an unused poll structure
  int i, n = *len;
  struct pollfd *_fdlist = *fdlist;
  
  for (i = 0; i < n; i++)
    if (_fdlist[i].fd == -1) break;

  if (i == n) // If no hole was found... reallocate memory
    {
      n++;
      _fdlist = realloc (*fdlist, sizeof(struct pollfd) * n);
    }
  // i contains the right value in any case
  _fdlist[i].events = POLLIN;
  _fdlist[i].fd = fd;
  *fdlist = _fdlist;
  *len = n;
  printf ("%d descriptor added. %d descriptors monitored\n", fd, n);
  return 0;
}

The function works as follows. It first looks for an entry with an fd file set to -1. That means that the file descriptor was used but the associated connection was closed and it is available for reuse. In case there is no available entry for reuse, add_func will resize the buffer to add one more entry to include the new file descriptor.

For the selected entry, the function sets the fd field to contain the file descriptor we want to add and the events to POLLIN to monitor data coming IN. This is basically the equivalent to the read file descriptor set of select. For our echo server is all we need. You can check the man page of poll for a complete list of possible events.

Note that because the reallocation of the file descriptor buffer we have to pass the array as a reference as realloc potentially may change the start address of the memory block to reallocate.

Note: realloc usually doesn't change the pointer so it is very likely that the code will just work without passing the pointer.... but then it may fail in the future, for instance, under heavy memory load in the machine and then you won't know what is going on. Always read the man pages.

The poll main loop

Time to look at the main loop:

  while (1)
    { 
      if ((nfd = poll (fdlist, n, 100)) < 0) perror ("poll:");
      else
       {
          if (fdlist[0].revents & POLLIN) // Accept connection
            {
              int s1;
              if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)
                FAIL ("accept:");
            printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
            add_fd (s1, &fdlist, &n);
          }
       }

The poll syscall accepts 3 parameters. The array with the file descriptors and events to monitor, the number of file descriptors to monitor and a timeout in miliseconds.

The second parameter indicates the number of items in the array to monitor. In this case I'm passing the size of the array... which doesn't need to be the same than the number of active file descriptors. In this case, and for this example this give me a simpler code to focus on how poll work, but you may note that we may be asking poll to go through many entries in fdlist that actually doesn't need to be checked.

After a successful poll the next thing we do is to check if we have got a return event on our file descriptor at index 0... that is, the server socket, the very first one we added to the array. If so, we accept the connection and we add the new file descriptor to the list. The same way we did with select.

After that we just need to check what happen in the rest of file descriptors

      for (i = 1; i < n; i++)
        {
          if ((fdlist[i].revents & POLLIN))
            {
              // Do the echo thing
              continue;
            }
        
          printf ("Connection closed\n");
          close (fdlist[i].fd);
          fdlist[i].fd = -1; // Ignore this entry for next poll
        }

In case there we receive data on any of the other file descriptors we do our echo thingy. Otherwise we just close the connection and set the fd field to -1 so poll will not consider that entry in the next loop. poll always considers the POLLHUP event even when not configured. This event is generated when the connection is closed... that is why the code above works.

Also note that I'm scanning the whole array. You should actually just check for the number returned by poll so in case all the file descriptor affected are in the lower entries the loop will finish early.

This is the complete code as example on how to use poll.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>    // exit

#include <poll.h>      // poll
#include <unistd.h>    // read/write
#include <sys/types.h> // Socket
#include <sys/socket.h>
#include <netinet/in.h> // inet_addr
#include <arpa/inet.h> // hton

#define FAIL(s) do {perror (s); exit (EXIT_FAILURE);} while (0)
#define ERROR(s) do {fprintf (stderr, s); exit (EXIT_FAILURE);} while (0)

#define BUF_SIZE 1024

/* Adds a new file descriptor to the array reallocating it if needed */
int add_fd (int fd, struct pollfd **fdlist, int *len) {
  // First check if there is an unused poll structure
  int i, n = *len;
  struct pollfd *_fdlist = *fdlist;
  
  for (i = 0; i < n; i++)
    if (_fdlist[i].fd == -1) break;

  if (i == n) // If no hole was found... reallocate memory
    {
      n++;
      _fdlist = realloc (*fdlist, sizeof(struct pollfd) * n);
    }
  // i contains the right value in any case
  _fdlist[i].events = POLLIN;
  _fdlist[i].fd = fd;
  *fdlist = _fdlist;
  *len = n;
  printf ("%d descriptor added. %d descriptors monitored\n", fd, n);
  return 0;
}

int main () {
  int                  s; 
  unsigned char        buf[BUF_SIZE];
  int                  i, ops=1;
  char                 *msg;
  struct pollfd        *fdlist = NULL ; // So first realloc is a malloc
  int                  nfd,n = 0;
  struct sockaddr_in   client;
  socklen_t            sa_len = sizeof (struct sockaddr_in);    
  struct sockaddr_in   addr;
  
  /* Create and initialise a TCP server socket*/
  if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
  
  addr.sin_family = AF_INET;
  addr.sin_addr.s_addr = htonl(INADDR_ANY);
  addr.sin_port = htons (1234); // Default port
  setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &ops, sizeof(ops));
  if ((bind (s, (struct sockaddr *) &addr, sa_len)) < 0)  FAIL ("bind:");
  if ((listen (s, 1)) < 0) FAIL("listen:");

  // Add server socket to poll list
  add_fd (s, &fdlist, &n);
  while (1)
    { 
      if ((nfd = poll (fdlist, n, 100)) < 0)
        perror ("poll:");
      else
        {
          if (fdlist[0].revents & POLLIN) // Accept connection
            {
              int s1;
              if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)
                FAIL ("accept:");
              printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
              add_fd (s1, &fdlist, &n);
            }
          for (i = 1; i < n; i++)
            {
              if ((fdlist[i].revents & POLLIN))
                {
                  int con = fdlist[i].fd;
                  memset (buf, 0, BUF_SIZE);
                  int len = read (con, buf, BUF_SIZE);
                  if (len > 0)
                    {
                      len += 10;
                      msg = malloc (len);
                      memset (msg,0, len);
                      snprintf (msg, len, "ECHO : %s", buf);
                      printf ("RECV (%d) : %s", len, buf);
                      if (len > 0)
                        {
                          write (con, msg, len);
                          free (msg);
                          msg = NULL;
                          continue;
                          }
                      }
        
            printf ("Connection closed\n");
            close (fdlist[i].fd);
            fdlist[i].fd = -1; // Ignore this entry for next poll
            }
        }
    }   
  }
  close (s);
  return 0;
}

The Event Poll epoll

The poll system call solves some of the issues of select and it is a better option than it unless we are really in a need to support unusual/old systems. However, poll still requires that we scan the whole array of file descriptors in order to find out which ones got events.

This may be a problem if our network server is expected to deal with thousands (or more) of connections simultaneously. In that case, even when only the last entry in the file descriptor array has changed we will have to go through hundreds of entries to check the revents field until we found the one that we need to process.

To avoid this, GNU/Linux introduced the Event Poll or epoll. Note that this is GNU/Linux specific and it is not portable. In case you need to write a portable application that needs to deal with a huge number of clients simultaneously your better option is to use something like libevent. This library provides a standard API independent of the underlying OS interface used. In simple words, Solaris, FreeBSD and even Windows have different APIs to solve this problem and all of them are different. libevent is a wrapper around those APIs.

So, how does epoll solves this problem?. Well, it decouples the events of the array of file descriptors. What this actually mean is that when you call the right system call (actually epoll_wait), you get an array with the list of file descriptors that need attention so you do not need to scan the whole list checking all of them.

Creating the epoll handler and adding the server socket

The first step to use the epoll interface is to create an epoll handler. This can be done with the epoll_create1 system call. Then, as we did with the poll version, we just add the server socket so we are ready to accept connections, this is done with the epoll_ctl system call using as second parameter EPOLL_CTL_ADD.

  int                  epollfd;
  struct epoll_event   ev, events[MAX_EVENTS];
  (...)
  // Create server socket on variable s
  
  if ((epollfd = epoll_create1 (0)) < 0) FAIL ("epoll_create1:");
  
  // Add server channel to poll list
  memset (&ev, 0, sizeof(struct epoll_event));
  ev.events = EPOLLIN;  // epoll_wait always waits for EPOLLHUP no need to add here
  ev.data.fd = s;
  if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s, &ev)) FAIL ("epoll_ctl(SERVER):");

The epoll_create1 parameter is a flag that actually can only be 0 or EPOLL_CLOEXEC that sets the close-on-exec flag to the epoll file descriptor. This work the same than the O_CLOSEXEC flag of open.

The epoll_ctl syscall allows us to add, delete and modify the list of file descriptor to monitor. In this case we are using the EPOLL_CTL_ADD constant to indicate that we want to add a new file descriptor to the list. The struct event passed as forth parameter indicates the events we want to monitor for this file descriptor. You can refer to the epoll_ctrl man page for a list of all possible events.

The data field, is a kind of user data pointer in the structure that allows us to store some information that we will likely need when the file descriptor gets an event. The structures are defined like this:

           typedef union epoll_data {
               void        *ptr;
               int          fd;
               uint32_t     u32;
               uint64_t     u64;
           } epoll_data_t;

           struct epoll_event {
               uint32_t     events;      /* Epoll events */
               epoll_data_t data;        /* User data variable */
           };

The struct epoll_event contains the events themselves (note that it is a 32bits value now) and then this struct epoll_data_t field where, in addition to the file descriptor (that has its own field) we can store a pointer a 32bits value and a 64bits value. Whatever values we set there, they will be available on the struct epoll_event returned when an event happen.

What is this useful for?... Well, think about our original select implementation where we were using those channel objects. The Java implementation had a method to get the associated channel directly out of the selector, but in C we had to maintain our own list of channels. Now, you can store the channel pointer in the ptr field of the data part of the struct, and whenever an event happens in that file descriptor, we get together with it also the pointer to our channel object. In this example, as I had removed all the channel and buffer functions I'm not using these fields but if you plan rewrite the program using the channel objects this is one possible way to go.

The epoll main loop

The main loop is almost the same than the one we wrote with poll, but know we only get the list of events that we have to process so we do not really need to check all the file descriptors we are monitoring. This is why epoll is really useful when you need to deal with a huge number of clients.

Let's see how the main loop looks like

  while (1)
    { 
      if ((nfd = epoll_wait (epollfd, events, MAX_EVENTS, 100)) < 0) perror ("epoll:");
      else
    {
      for (i = 0; i < nfd; i++) 
        {
          // Process events.
        }

The events is an array of struct epoll_events that will be filed by epoll_wait with all the events detected. In this case, in each call to epoll_wait we can receive as many as MAX_EVENTS events simultaneously. The actual number of events is the return value of the function. As with poll the last parameter is a timeout in miliseconds.

Processing events

So, inside the for loop that goes through the array of events we will find this code

for (i = 0; i < nfd; i++) {
      if (events[i].data.fd == s)
        {
          int s1;
          
          if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)  FAIL ("accept:");
          
          printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
          ev.events = EPOLLIN;  // epoll_wait always waits for EPOLLHUP no need to add here
          ev.data.fd = s1;
          if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s1, &ev)) FAIL ("epoll_ctl(ADD):");
        }
      else 
        {
          int con = events[i].data.fd;
          // The echo thingy 
          continue;
        }
        
        printf ("Connection closed\n");
        if ((epoll_ctl (epollfd, EPOLL_CTL_DEL, con, NULL))) FAIL ("epol_ctl(DEL):");
        close (con);          
} // end for

Now, as we get the list of events, we cannot assume that the index 0 contains our first file descriptor (the server socket), so in the loop we need to check if the file descriptor is the server socket in order to accept the connection. We could have use one of the values in epoll_data to store a flag indicating if the file descriptor is a server or a client socket... but we will be doing a integer comparison anyways so it won't be a big difference.

In this example, in case the file descriptor is not the server socket then it is a client socket and then we just do the echo thing.

The last part of the code is only executed when the connection is closed. As it happens with poll, epoll_wait will always check for the EPOLLHUP event and it is not necessary to set it when adding the file descriptor. In our example, if the event is not EPOLLIN then it shall be EPOLLHUP so we close the connection and remove the file descriptor from the list using epoll_ctrl and the constant EPOLL_CTL_DEL. Note that the last parameter (the epoll_event *) can be NULL when deleting as the events are no longer relevant.

This is epoll echo server complete code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>    // exit
#include <stdarg.h>

#include <sys/epoll.h>
#include <unistd.h>    // read/write
#include <time.h>
#include <sys/types.h> // Socket
#include <sys/socket.h>
#include <netinet/in.h> // inet_addr
#include <arpa/inet.h> // hton

#define FAIL(s) do {perror (s); exit (EXIT_FAILURE);} while (0)
#define ERROR(s) do {fprintf (stderr, s); exit (EXIT_FAILURE);} while (0)

#define BUF_SIZE 1024

#define MAX_EVENTS 64

int main () {
  int                  s; 
  unsigned char        buf[BUF_SIZE];
  int                  i, ops=1;
  char                 *msg;  
  int                  nfd,n = 0;
  struct sockaddr_in   client;
  socklen_t            sa_len = sizeof (struct sockaddr_in);    
  struct sockaddr_in   addr;
  int                  epollfd;
  struct epoll_event   ev, events[MAX_EVENTS];
  
  if ((s = socket (AF_INET, SOCK_STREAM, 0)) < 0) FAIL ("socket:");
 
  addr.sin_family = AF_INET;
  addr.sin_addr.s_addr = htonl(INADDR_ANY);
  addr.sin_port = htons (1234); // Default port
  setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &ops, sizeof(ops));
  if ((bind (s, (struct sockaddr *) &addr, sa_len)) < 0)  FAIL ("bind:");
  if ((listen (s, 1)) < 0) FAIL("listen:");

  if ((epollfd = epoll_create1 (0)) < 0) FAIL ("epoll_create1:");
  
  // Add server channel to poll list
  memset (&ev, 0, sizeof(struct epoll_event));
  ev.events = EPOLLIN;  // epoll_wait always waits for EPOLLHUP no need to add here
  ev.data.fd = s;
  if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s, &ev)) FAIL ("epoll_ctl(SERVER):");
  
  while (1)
    { 
      if ((nfd = epoll_wait (epollfd, events, MAX_EVENTS, 100)) < 0) perror ("epoll:");
      else
      {
        for (i = 0; i < nfd; i++)
          {
            if (events[i].data.fd == s)
              {
                int s1;
                if ((s1 = accept (s, (struct sockaddr*) &client, &sa_len)) < 0)  FAIL ("accept:");
                printf ("+ Connection from : %s (fd:%d)\n", inet_ntoa (client.sin_addr), s1);
                ev.events = EPOLLIN;  // epoll_wait always waits for EPOLLHUP no need to add here
                ev.data.fd = s1;
                if (epoll_ctl (epollfd, EPOLL_CTL_ADD, s1, &ev)) FAIL ("epoll_ctl(ADD):");
              }
              else 
              {
                  int con = events[i].data.fd;
                  memset (buf, 0, BUF_SIZE);
                  int len = read (con, buf, BUF_SIZE);
                  if (len > 0)
                  {
                      len += 10;
                      msg = malloc (len);
                      memset (msg,0, len);
                      snprintf (msg, len, "ECHO : %s", buf);
                      printf ("RECV (%d) : %s", len, buf);
                      if (len > 0)
                      {
                          write (con, msg, len);
                          free (msg);
                          msg = NULL;
                          continue;
                      }
                  }
        
            printf ("Connection closed\n");
            if ((epoll_ctl (epollfd, EPOLL_CTL_DEL, con, NULL))) FAIL ("epol_ctl(DEL):");
            close (con);
            }
        }
    }
      
  }
  close (s);
  return 0;   
}

Conclusions

In this instalment we have introduced the poll and epoll syscalls/interfaces to implement asynchronous network application as alternatives to select. As a quick summary: Use select for maximum portability when you are expecting to work with a small number of clients. If you program doesn't targets old UNIX, just use poll. If your program requires to deal with a huge number of clients (above 1024), use the event interface provided by your OS that for GNU/Linux is epoll.


 
Tu publicidad aquí :)