IO_uring Fixed Buffer Versus Non-Fixed Buffer Performance Comparison on NVMe

Background

For classes of storage systems — such as those used in high performance computing — I/O throughput, memory bandwidth utilization, and efficient use of memory buffers are of utmost importance. After doing some initial investigation into io_uring, I was intrigued by the possibility of using fixed buffers as a means for increasing I/O throughput. However, io_uring only supports the fixed option for requests using a single memory buffer — vectored I/O functions (readv(), writev(), etc.) are not supported. The work here compares fixed and non-fixed performance to provide insight into the trade-offs of using vectored vs. non-vectored buffers.

If you’re brand new to io_uring, I suggest that you take a look at https://unixism.net/loti/ for documentation and examples. Note, that this document is based on the usage of liburing (https://github.com/axboe/liburing) rather than the “low-level” interface.

TL;DR

Overview of IO_uring Read Functions and Fixed Buffer Registration

Here are the liburing function prototypes for the standard, fixed, and vectored read calls:

void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes, off_t offset);
void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes,
off_t offset, int buf_index);
void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd,
const struct iovec *iovecs,
unsigned nr_vecs, off_t offset);

Right away we see that, other than the *sqe parameter, io_uring_prep_read() and io_uring_prep_readv() look like their non-io_uring counterparts, read(2) and readv(2). The fixed read function, io_uring_prep_read_fixed(), mimics io_uring_prep_read() except for the final parameter, buf_index.

So what is buf_index? To fully answer, we should first dig a little bit into io_uring buffer registration. To use buffers in fixed mode, they must first be registered into io_uring. Below is the liburing buffer registration function:

int io_uring_register_buffers(struct io_uring *ring, 
const struct iovec *iovecs,
unsigned nr_iovecs)
{
int ret;
ret = __sys_io_uring_register(ring->ring_fd,
IORING_REGISTER_BUFFERS,
iovecs, nr_iovecs);
if (ret < 0)
return -errno;
return 0;
}

We see the function allows the user to submit an array of iovecs to io_uring’s kernel component. The kernel side will perform a handful of preparations on each buffer so that subsequent I/O’s (using these buffers) will have reduced overhead. I/O’s which do not use fixed buffers must undergo these steps on each operation. Some of the registration preparations include:

Currently (Feb. 2021), the use of fixed buffers may not occur without the user providing the buf_index. In other words, io_uring does not determine a buffer’s registration status based on its address alone and for this reason, io_uring_prep_read_fixed() and io_uring_prep_write_fixed() require an additional parameter, buf_index. Therefore, it’s the user’s responsibility to provide the array position from the io_uring_register_buffers() call to which the buffer belongs.

Test Parameters

Here are some characteristics of the test:

Results

IOPs —Single Process, Single Core, Fixed vs. Non-Fixed: 4k, 32k, 128k

Only the 4k IOPs test shows a significant IOPs increase for fixed over non-fixed and primarily at higher queue depths. The larger I/O sizes, 32k and 128k, show effectively equivalent performance.

CPU Utilization

The 4k test is where the notable performance difference occurred. It appears that the kernel overhead in the non-fixed case is higher due to the inlined registration work needed on each request. There is less user CPU remaining for the application to enqueue and process event completions and, as as a result, we observe lower IOPs.

The 128k test results show that device can be saturated without using the entire CPU. While the user CPU is the same for both, the non-fixed case is using between 3–8% more sys CPU to provide the same IOPs.

Kernel Profile — Where are the System CPU Cycles Spent?

Kernel Profile for 4k Fixed Buffer (QueueDepth=128)
Kernel Profile for 4k Non-Fixed Buffer (QueueDepth=128)

Here are kernel profile pie charts taken during the execution of the most demanding 4k IOPs test (queue depth == 128).

The profile charts for fixed and non-fixed 4k IOPs test show both containing __x86_indirect_thunk_rax() and read_tsc() as the two most frequently profiled routines. However, non-fixed contains two functions which do not appear in the fixed case: copy_user_generic_str() and internal_get_user_page(). The overhead of these calls is likely responsible for the performance reduction.

IOPs — Two Processes, Two Cores: 4k

The single core 4k IOPs test left me wondering if there were any IOPs available on the NVMe device, and if so, how were fixed and non-fixed performance characteristics?

From an IOPs standpoint, the results are good — the system was able to gain another 300k IOPs by adding an additional core. Both the fixed and non-fixed cases saw effectively equivalent throughput.

We can see above that idle CPU is available for all queue depths, however as expected, the fixed case shows less CPU utilization.

IOPs — Single Process, Single Core, Fixed vs. Vectored: 128k

Until now, the tests have compare fixed and non-fixed using contiguous buffers in the non-fixed cases — this section shows the impact of vectored buffers. The vectored case is setup by supplying 32 iovs, each with a 4kb buffer, to io_uring_prep_readv(). Since io_uring does not support “fixed” operation using vectors, using vectors implies that the I/O is non-fixed. In addition to the non-fixed preparation activities, vectored I/O has additional overhead associated with processing of the vector array.

IOPs performance is near equivalent at the point where device saturation occurs (queue-depth=4). Until then, the vectored case shows visibly lower performance due to the added processing overhead. It appears that vectoring imposes a higher penalty than the lack of registration as indicated by the CPU utilization chart below.

Conclusion and Recommendations

My primary objective was to measure the amount of performance — in terms of throughput and CPU utilization —that would lost if my application were to employ vectored buffers instead of contiguous. The results seem obvious and largely unsurprising but hopefully provide useful information and time savings to those building io_uring-based applications.

Use Fixed Buffers if Possible

IO_uring’s fixed buffers provide clear CPU usage reduction and therefore should be used when possible. Note that using fixed buffers in your io_uring application will likely require consideration in the initial design since the set of fixed buffers must be declared in one shot — there’s no dynamic addition or removal of individual fixed buffers to or from the registered set.

Non-Fixed Buffers are OK Too

Using io_uring with any heap allocated buffer still gives great performance. The results have not indicated any additional bulk buffer memory copies or inordinately expensive CPU costs.

Vectored I/O

Vectored buffers, in my view, provide a lot more allocation flexibility since they allow for dynamic composition and I prefer to use them unless their overhead is untenably high. The cost of using vectored I/O in io_uring is two-fold as it results in the inability to use fixed buffers and the additional kernel processing of the iov array. Fortunately, some of the cost may be refunded if I/O’s using a single vector employ read_fixed() or write_fixed() — this aligns with the fact that non-fixed small I/O’s incur the most noticeable throughput reduction. Another technique to reduce vectored overhead could be to employ a very small number of large, contiguous, registered buffers for large I/O’s. While this is a bit wasteful in terms of memory allocation, NVMe devices are so fast that only a few 128kib buffers are needed to saturate a device (using random reads)! This dedication of additional memory may increase overall system performance by reducing CPU cycles spent servicing I/O.

Distributed Storage Systems Programmer w/ focus on distributed erasure coding, parallel log structuring, and hierarchical storage.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store