Perform I/O Operations in Parallel

The POSIX.1b standard defines a new set of I/O operations which can significantly reduce the time an application spends waiting at I/O. The new functions allow a program to initiate one or more I/O operations and then immediately resume normal work while the I/O operations are executed in parallel. This functionality is available if the unistd.h file defines the symbol _POSIX_ASYNCHRONOUS_IO.

These functions are part of the library with realtime functions named librt. They are not actually part of the libc binary. The implementation of these functions can be done using support in the kernel (if available) or using an implementation based on threads at userlevel. In the latter case it might be necessary to link applications with the thread library libpthread in addition to librt.

All AIO operations operate on files which were opened previously. There might be arbitrarily many operations running for one file. The asynchronous I/O operations are controlled using a data structure named struct aiocb (AIO control block). It is defined in aio.h as follows.

function>struct aiocb/function> The POSIX.1b standard mandates that the struct aiocb structure contains at least the members described in the following table. There might be more elements which are used by the implementation, but depending upon these elements is not portable and is highly deprecated.

int aio_fildes

This element specifies the file descriptor to be used for the operation. It must be a legal descriptor, otherwise the operation will fail.

The device on which the file is opened must allow the seek operation. I.e., it is not possible to use any of the AIO operations on devices like terminals where an lseek call would lead to an error.

off_t aio_offset

This element specifies the offset in the file at which the operation (input or output) is performed. Since the operations are carried out in arbitrary order and more than one operation for one file descriptor can be started, one cannot expect a current read/write position of the file descriptor.

volatile void *aio_buf

This is a pointer to the buffer with the data to be written or the place where the read data is stored.

size_t aio_nbytes

This element specifies the length of the buffer pointed to by aio_buf.

int aio_reqprio

If the platform has defined _POSIX_PRIORITIZED_IO and _POSIX_PRIORITY_SCHEDULING, the AIO requests are processed based on the current scheduling priority. The aio_reqprio element can then be used to lower the priority of the AIO operation.

struct sigevent aio_sigevent

This element specifies how the calling process is notified once the operation terminates. If the sigev_notify element is SIGEV_NONE, no notification is sent. If it is SIGEV_SIGNAL, the signal determined by sigev_signo is sent. Otherwise, sigev_notify must be SIGEV_THREAD. In this case, a thread is created which starts executing the function pointed to by sigev_notify_function.

int aio_lio_opcode

This element is only used by the lio_listio and lio_listio64 functions. Since these functions allow an arbitrary number of operations to start at once, and each operation can be input or output (or nothing), the information must be stored in the control block. The possible values are:

LIO_READ

Start a read operation. Read from the file at position aio_offset and store the next aio_nbytes bytes in the buffer pointed to by aio_buf.

LIO_WRITE

Start a write operation. Write aio_nbytes bytes starting at aio_buf into the file starting at position aio_offset.

LIO_NOP

Do nothing for this control block. This value is useful sometimes when an array of struct aiocb values contains holes, i.e., some of the values must not be handled although the whole array is presented to the lio_listio function.

When the sources are compiled using _FILE_OFFSET_BITS == 64 on a 32 bit machine, this type is in fact struct aiocb64, since the LFS interface transparently replaces the struct aiocb definition.

For use with the AIO functions defined in the LFS, there is a similar type defined which replaces the types of the appropriate members with larger types but otherwise is equivalent to struct aiocb. Particularly, all member names are the same.

function>struct aiocb64/function>

int aio_fildes

This element specifies the file descriptor which is used for the operation. It must be a legal descriptor since otherwise the operation fails for obvious reasons.

The device on which the file is opened must allow the seek operation. I.e., it is not possible to use any of the AIO operations on devices like terminals where an lseek call would lead to an error.

off64_t aio_offset

This element specifies at which offset in the file the operation (input or output) is performed. Since the operation are carried in arbitrary order and more than one operation for one file descriptor can be started, one cannot expect a current read/write position of the file descriptor.

volatile void *aio_buf

This is a pointer to the buffer with the data to be written or the place where the read data is stored.

size_t aio_nbytes

This element specifies the length of the buffer pointed to by aio_buf.

int aio_reqprio

If for the platform _POSIX_PRIORITIZED_IO and _POSIX_PRIORITY_SCHEDULING are defined the AIO requests are processed based on the current scheduling priority. The aio_reqprio element can then be used to lower the priority of the AIO operation.

struct sigevent aio_sigevent

This element specifies how the calling process is notified once the operation terminates. If the sigev_notify, element is SIGEV_NONE no notification is sent. If it is SIGEV_SIGNAL, the signal determined by sigev_signo is sent. Otherwise, sigev_notify must be SIGEV_THREAD in which case a thread which starts executing the function pointed to by sigev_notify_function.

int aio_lio_opcode

This element is only used by the lio_listio and [lio_listio64 functions. Since these functions allow an arbitrary number of operations to start at once, and since each operation can be input or output (or nothing), the information must be stored in the control block. See the description of struct aiocb for a description of the possible values.

When the sources are compiled using _FILE_OFFSET_BITS == 64 on a 32 bit machine, this type is available under the name struct aiocb64, since the LFS transparently replaces the old interface.

Asynchronous Read and Write Operations

int function>aio_read/function> (struct aiocb *aiocbp) This function initiates an asynchronous read operation. It immediately returns after the operation was enqueued or when an error was encountered.

The first aiocbp-aio_nbytes bytes of the file for which aiocbp-aio_fildes is a descriptor are written to the buffer starting at aiocbp-aio_buf. Reading starts at the absolute position aiocbp-aio_offset in the file.

If prioritized I/O is supported by the platform the aiocbp-aio_reqprio value is used to adjust the priority before the request is actually enqueued.

The calling process is notified about the termination of the read request according to the aiocbp-aio_sigevent value.

When aio_read returns, the return value is zero if no error occurred that can be found before the process is enqueued. If such an early error is found, the function returns -1 and sets errno to one of the following values:

EAGAIN

The request was not enqueued due to (temporarily) exceeded resource limitations.

ENOSYS

The aio_read function is not implemented.

EBADF

The aiocbp-aio_fildes descriptor is not valid. This condition need not be recognized before enqueueing the request and so this error might also be signaled asynchronously.

EINVAL

The aiocbp-aio_offset or aiocbp-aio_reqpiro value is invalid. This condition need not be recognized before enqueueing the request and so this error might also be signaled asynchronously.

If aio_read returns zero, the current status of the request can be queried using aio_error and aio_return functions. As long as the value returned by aio_error is EINPROGRESS the operation has not yet completed. If aio_error returns zero, the operation successfully terminated, otherwise the value is to be interpreted as an error code. If the function terminated, the result of the operation can be obtained using a call to aio_return. The returned value is the same as an equivalent call to read would have returned. Possible error codes returned by aio_error are:

EBADF

The aiocbp-aio_fildes descriptor is not valid.

ECANCELED

The operation was canceled before the operation was finished (the section called “Cancellation of AIO Operations”)

EINVAL

The aiocbp-aio_offset value is invalid.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact aio_read64 since the LFS interface transparently replaces the normal implementation.

int function>aio_read64/function> (struct aiocb *aiocbp) This function is similar to the aio_read function. The only difference is that on 32 bit machines, the file descriptor should be opened in the large file mode. Internally, aio_read64 uses functionality equivalent to lseek64 (the section called “Setting the File Position of a Descriptor”) to position the file descriptor correctly for the reading, as opposed to lseek functionality used in aio_read.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is available under the name aio_read and so transparently replaces the interface for small files on 32 bit machines.

To write data asynchronously to a file, there exists an equivalent pair of functions with a very similar interface.

int function>aio_write/function> (struct aiocb *aiocbp) This function initiates an asynchronous write operation. The function call immediately returns after the operation was enqueued or if before this happens an error was encountered.

The first aiocbp-aio_nbytes bytes from the buffer starting at aiocbp-aio_buf are written to the file for which aiocbp-aio_fildes is an descriptor, starting at the absolute position aiocbp-aio_offset in the file.

If prioritized I/O is supported by the platform, the aiocbp-aio_reqprio value is used to adjust the priority before the request is actually enqueued.

The calling process is notified about the termination of the read request according to the aiocbp-aio_sigevent value.

When aio_write returns, the return value is zero if no error occurred that can be found before the process is enqueued. If such an early error is found the function returns -1 and sets errno to one of the following values.

EAGAIN

The request was not enqueued due to (temporarily) exceeded resource limitations.

ENOSYS

The aio_write function is not implemented.

EBADF

The aiocbp-aio_fildes descriptor is not valid. This condition may not be recognized before enqueueing the request, and so this error might also be signaled asynchronously.

EINVAL

The aiocbp-aio_offset or aiocbp-aio_reqprio value is invalid. This condition may not be recognized before enqueueing the request and so this error might also be signaled asynchronously.

In the case aio_write returns zero, the current status of the request can be queried using aio_error and aio_return functions. As long as the value returned by aio_error is EINPROGRESS the operation has not yet completed. If aio_error returns zero, the operation successfully terminated, otherwise the value is to be interpreted as an error code. If the function terminated, the result of the operation can be get using a call to aio_return. The returned value is the same as an equivalent call to read would have returned. Possible error codes returned by aio_error are:

EBADF

The aiocbp-aio_fildes descriptor is not valid.

ECANCELED

The operation was canceled before the operation was finished. (the section called “Cancellation of AIO Operations”)

EINVAL

The aiocbp-aio_offset value is invalid.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is in fact aio_write64 since the LFS interface transparently replaces the normal implementation.

int function>aio_write64/function> (struct aiocb *aiocbp) This function is similar to the aio_write function. The only difference is that on 32 bit machines the file descriptor should be opened in the large file mode. Internally aio_write64 uses functionality equivalent to lseek64 (the section called “Setting the File Position of a Descriptor”) to position the file descriptor correctly for the writing, as opposed to lseek functionality used in aio_write.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is available under the name aio_write and so transparently replaces the interface for small files on 32 bit machines.

Besides these functions with the more or less traditional interface, POSIX.1b also defines a function which can initiate more than one operation at a time, and which can handle freely mixed read and write operations. It is therefore similar to a combination of readv and writev.

int function>lio_listio/function> (int mode, struct aiocb *const list[], int nent, struct sigevent *sig) The lio_listio function can be used to enqueue an arbitrary number of read and write requests at one time. The requests can all be meant for the same file, all for different files or every solution in between.

lio_listio gets the nent requests from the array pointed to by list. The operation to be performed is determined by the aio_lio_opcode member in each element of list. If this field is LIO_READ a read operation is enqueued, similar to a call of aio_read for this element of the array (except that the way the termination is signalled is different, as we will see below). If the aio_lio_opcode member is LIO_WRITE a write operation is enqueued. Otherwise the aio_lio_opcode must be LIO_NOP in which case this element of list is simply ignored. This "operation" is useful in situations where one has a fixed array of struct aiocb elements from which only a few need to be handled at a time. Another situation is where the lio_listio call was canceled before all requests are processed (the section called “Cancellation of AIO Operations”) and the remaining requests have to be reissued.

The other members of each element of the array pointed to by list must have values suitable for the operation as described in the documentation for aio_read and aio_write above.

The mode argument determines how lio_listio behaves after having enqueued all the requests. If mode is LIO_WAIT it waits until all requests terminated. Otherwise mode must be LIO_NOWAIT and in this case the function returns immediately after having enqueued all the requests. In this case the caller gets a notification of the termination of all requests according to the sig parameter. If sig is NULL no notification is send. Otherwise a signal is sent or a thread is started, just as described in the description for aio_read or aio_write.

If mode is LIO_WAIT, the return value of lio_listio is 0 when all requests completed successfully. Otherwise the function return -1 and errno is set accordingly. To find out which request or requests failed one has to use the aio_error function on all the elements of the array list.

In case mode is LIO_NOWAIT, the function returns 0 if all requests were enqueued correctly. The current state of the requests can be found using aio_error and aio_return as described above. If lio_listio returns -1 in this mode, the global variable errno is set accordingly. If a request did not yet terminate, a call to aio_error returns EINPROGRESS. If the value is different, the request is finished and the error value (or 0) is returned and the result of the operation can be retrieved using aio_return.

Possible values for errno are:

EAGAIN

The resources necessary to queue all the requests are not available at the moment. The error status for each element of list must be checked to determine which request failed.

Another reason could be that the system wide limit of AIO requests is exceeded. This cannot be the case for the implementation on GNU systems since no arbitrary limits exist.

EINVAL

The mode parameter is invalid or nent is larger than AIO_LISTIO_MAX.

EIO

One or more of the request's I/O operations failed. The error status of each request should be checked to determine which one failed.

ENOSYS

The lio_listio function is not supported.

If the mode parameter is LIO_NOWAIT and the caller cancels a request, the error status for this request returned by aio_error is ECANCELED.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is in fact lio_listio64 since the LFS interface transparently replaces the normal implementation.

int function>lio_listio64/function> (int mode, struct aiocb *const list, int nent, struct sigevent *sig) This function is similar to the lio_listio function. The only difference is that on 32 bit machines, the file descriptor should be opened in the large file mode. Internally, lio_listio64 uses functionality equivalent to lseek64 (the section called “Setting the File Position of a Descriptor”) to position the file descriptor correctly for the reading or writing, as opposed to lseek functionality used in lio_listio.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is available under the name lio_listio and so transparently replaces the interface for small files on 32 bit machines.

Getting the Status of AIO Operations

As already described in the documentation of the functions in the last section, it must be possible to get information about the status of an I/O request. When the operation is performed truly asynchronously (as with aio_read and aio_write and with lio_listio when the mode is LIO_NOWAIT), one sometimes needs to know whether a specific request already terminated and if so, what the result was. The following two functions allow you to get this kind of information.

int function>aio_error/function> (const struct aiocb *aiocbp) This function determines the error state of the request described by the struct aiocb variable pointed to by aiocbp. If the request has not yet terminated the value returned is always EINPROGRESS. Once the request has terminated the value aio_error returns is either 0 if the request completed successfully or it returns the value which would be stored in the errno variable if the request would have been done using read, write, or fsync.

The function can return ENOSYS if it is not implemented. It could also return EINVAL if the aiocbp parameter does not refer to an asynchronous operation whose return status is not yet known.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact aio_error64 since the LFS interface transparently replaces the normal implementation.

int function>aio_error64/function> (const struct aiocb64 *aiocbp) This function is similar to aio_error with the only difference that the argument is a reference to a variable of type struct aiocb64.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is available under the name aio_error and so transparently replaces the interface for small files on 32 bit machines.

ssize_t function>aio_return/function> (const struct aiocb *aiocbp) This function can be used to retrieve the return status of the operation carried out by the request described in the variable pointed to by aiocbp. As long as the error status of this request as returned by aio_error is EINPROGRESS the return of this function is undefined.

Once the request is finished this function can be used exactly once to retrieve the return value. Following calls might lead to undefined behavior. The return value itself is the value which would have been returned by the read, write, or fsync call.

The function can return ENOSYS if it is not implemented. It could also return EINVAL if the aiocbp parameter does not refer to an asynchronous operation whose return status is not yet known.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact aio_return64 since the LFS interface transparently replaces the normal implementation.

int function>aio_return64/function> (const struct aiocb64 *aiocbp) This function is similar to aio_return with the only difference that the argument is a reference to a variable of type struct aiocb64.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is available under the name aio_return and so transparently replaces the interface for small files on 32 bit machines.

Getting into a Consistent State

When dealing with asynchronous operations it is sometimes necessary to get into a consistent state. This would mean for AIO that one wants to know whether a certain request or a group of request were processed. This could be done by waiting for the notification sent by the system after the operation terminated, but this sometimes would mean wasting resources (mainly computation time). Instead POSIX.1b defines two functions which will help with most kinds of consistency.

The aio_fsync and aio_fsync64 functions are only available if the symbol _POSIX_SYNCHRONIZED_IO is defined in unistd.h.

int function>aio_fsync/function> (int op, struct aiocb *aiocbp) Calling this function forces all I/O operations operating queued at the time of the function call operating on the file descriptor aiocbp-aio_fildes into the synchronized I/O completion state (the section called “Synchronizing I/O operations”). The aio_fsync function returns immediately but the notification through the method described in aiocbp-aio_sigevent will happen only after all requests for this file descriptor have terminated and the file is synchronized. This also means that requests for this very same file descriptor which are queued after the synchronization request are not affected.

If op is O_DSYNC the synchronization happens as with a call to fdatasync. Otherwise op should be O_SYNC and the synchronization happens as with fsync.

As long as the synchronization has not happened, a call to aio_error with the reference to the object pointed to by aiocbp returns EINPROGRESS. Once the synchronization is done aio_error return 0 if the synchronization was not successful. Otherwise the value returned is the value to which the fsync or fdatasync function would have set the errno variable. In this case nothing can be assumed about the consistency for the data written to this file descriptor.

The return value of this function is 0 if the request was successfully enqueued. Otherwise the return value is -1 and errno is set to one of the following values:

EAGAIN

The request could not be enqueued due to temporary lack of resources.

EBADF

The file descriptor aiocbp-aio_fildes is not valid or not open for writing.

EINVAL

The implementation does not support I/O synchronization or the op parameter is other than O_DSYNC and O_SYNC.

ENOSYS

This function is not implemented.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact aio_return64 since the LFS interface transparently replaces the normal implementation.

int function>aio_fsync64/function> (int op, struct aiocb64 *aiocbp) This function is similar to aio_fsync with the only difference that the argument is a reference to a variable of type struct aiocb64.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is available under the name aio_fsync and so transparently replaces the interface for small files on 32 bit machines.

Another method of synchronization is to wait until one or more requests of a specific set terminated. This could be achieved by the aio_* functions to notify the initiating process about the termination but in some situations this is not the ideal solution. In a program which constantly updates clients somehow connected to the server it is not always the best solution to go round robin since some connections might be slow. On the other hand letting the aio_* function notify the caller might also be not the best solution since whenever the process works on preparing data for on client it makes no sense to be interrupted by a notification since the new client will not be handled before the current client is served. For situations like this aio_suspend should be used.

int function>aio_suspend/function> (const struct aiocb *const list[], int nent, const struct timespec *timeout) When calling this function, the calling thread is suspended until at least one of the requests pointed to by the nent elements of the array list has completed. If any of the requests has already completed at the time aio_suspend is called, the function returns immediately. Whether a request has terminated or not is determined by comparing the error status of the request with EINPROGRESS. If an element of list is NULL, the entry is simply ignored.

If no request has finished, the calling process is suspended. If timeout is NULL, the process is not woken until a request has finished. If timeout is not NULL, the process remains suspended at least as long as specified in timeout. In this case, aio_suspend returns with an error.

The return value of the function is 0 if one or more requests from the list have terminated. Otherwise the function returns -1 and errno is set to one of the following values:

EAGAIN

None of the requests from the list completed in the time specified by timeout.

EINTR

A signal interrupted the aio_suspend function. This signal might also be sent by the AIO implementation while signalling the termination of one of the requests.

ENOSYS

The aio_suspend function is not implemented.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is in fact aio_suspend64 since the LFS interface transparently replaces the normal implementation.

int function>aio_suspend64/function> (const struct aiocb64 *const list[], int nent, const struct timespec *timeout) This function is similar to aio_suspend with the only difference that the argument is a reference to a variable of type struct aiocb64.

When the sources are compiled with _FILE_OFFSET_BITS == 64 this function is available under the name aio_suspend and so transparently replaces the interface for small files on 32 bit machines.

Cancellation of AIO Operations

When one or more requests are asynchronously processed, it might be useful in some situations to cancel a selected operation, e.g., if it becomes obvious that the written data is no longer accurate and would have to be overwritten soon. As an example, assume an application, which writes data in files in a situation where new incoming data would have to be written in a file which will be updated by an enqueued request. The POSIX AIO implementation provides such a function, but this function is not capable of forcing the cancellation of the request. It is up to the implementation to decide whether it is possible to cancel the operation or not. Therefore using this function is merely a hint.

int function>aio_cancel/function> (int fildes, struct aiocb *aiocbp) The aio_cancel function can be used to cancel one or more outstanding requests. If the aiocbp parameter is NULL, the function tries to cancel all of the outstanding requests which would process the file descriptor fildes (i.e., whose aio_fildes member is fildes). If aiocbp is not NULL, aio_cancel attempts to cancel the specific request pointed to by aiocbp.

For requests which were successfully canceled, the normal notification about the termination of the request should take place. I.e., depending on the struct sigevent object which controls this, nothing happens, a signal is sent or a thread is started. If the request cannot be canceled, it terminates the usual way after performing the operation.

After a request is successfully canceled, a call to aio_error with a reference to this request as the parameter will return ECANCELED and a call to aio_return will return -1. If the request wasn't canceled and is still running the error status is still EINPROGRESS.

The return value of the function is AIO_CANCELED if there were requests which haven't terminated and which were successfully canceled. If there is one or more requests left which couldn't be canceled, the return value is AIO_NOTCANCELED. In this case aio_error must be used to find out which of the, perhaps multiple, requests (in aiocbp is NULL) weren't successfully canceled. If all requests already terminated at the time aio_cancel is called the return value is AIO_ALLDONE.

If an error occurred during the execution of aio_cancel the function returns -1 and sets errno to one of the following values.

EBADF

The file descriptor fildes is not valid.

ENOSYS

aio_cancel is not implemented.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is in fact aio_cancel64 since the LFS interface transparently replaces the normal implementation.

int function>aio_cancel64/function> (int fildes, struct aiocb64 *aiocbp) This function is similar to aio_cancel with the only difference that the argument is a reference to a variable of type struct aiocb64.

When the sources are compiled with _FILE_OFFSET_BITS == 64, this function is available under the name aio_cancel and so transparently replaces the interface for small files on 32 bit machines.

How to optimize the AIO implementation

The POSIX standard does not specify how the AIO functions are implemented. They could be system calls, but it is also possible to emulate them at userlevel.

At the point of this writing, the available implementation is a userlevel implementation which uses threads for handling the enqueued requests. While this implementation requires making some decisions about limitations, hard limitations are something which is best avoided in the GNU C library. Therefore, the GNU C library provides a means for tuning the AIO implementation according to the individual use.

function>struct aioinit/function> This data type is used to pass the configuration or tunable parameters to the implementation. The program has to initialize the members of this struct and pass it to the implementation using the aio_init function.

int aio_threads

This member specifies the maximal number of threads which may be used at any one time.

int aio_num

This number provides an estimate on the maximal number of simultaneously enqueued requests.

int aio_locks

Unused.

int aio_usedba

Unused.

int aio_debug

Unused.

int aio_numusers

Unused.

int aio_reserved[2]

Unused.

void function>aio_init/function> (const struct aioinit *init) This function must be called before any other AIO function. Calling it is completely voluntary, as it is only meant to help the AIO implementation perform better.

Before calling the aio_init, function the members of a variable of type struct aioinit must be initialized. Then a reference to this variable is passed as the parameter to aio_init which itself may or may not pay attention to the hints.

The function has no return value and no error cases are defined. It is a extension which follows a proposal from the SGI implementation in Irix 6. It is not covered by POSIX.1b or Unix98.