IO_uring Fixed Buffer Versus Non-Fixed Buffer Performance Comparison on NVMe

Background

For classes of storage systems — such as those used in high performance computing — I/O throughput, memory bandwidth utilization, and efficient use of memory buffers are of utmost importance. After doing some initial investigation into io_uring, I was intrigued by the possibility of using fixed buffers as a means for increasing I/O throughput. However, io_uring only supports the fixed option for requests using a single memory buffer — vectored I/O functions (readv(), writev(), etc.) are not supported. The work here compares fixed and non-fixed performance to provide insight into the trade-offs of using vectored vs. non-vectored buffers.

TL;DR

  • Non-fixed buffers provide near equivalent performance to fixed at the cost of additional CPU cycles— the results here do not indicate a significantly higher memory bandwidth demand over fixed buffers.
  • Fixed buffer IOPs outpace non-fixed as the amount of idle CPU approaches zero.
  • 4k I/O’s see the most IOPs improvement — close to 10% at higher queue depths
  • Larger I/O sizes using fixed buffers see little or no throughput gain but do show reduced CPU utilization.
  • Vectored I/O overhead (as compared to contiguous) is considerable — regardless of buffer registration.
  • Applications seeking to employ a pool of uniformly sized buffers (i.e. pools of 4k buffers) may employ fixed buffers for small requests — reaping modest gains — while using vectored I/O for larger requests without taking a throughput hit but at the cost of more CPU cycles.

Overview of IO_uring Read Functions and Fixed Buffer Registration

Here are the liburing function prototypes for the standard, fixed, and vectored read calls:

void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes, off_t offset);
void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd,
void *buf, unsigned nbytes,
off_t offset, int buf_index);
void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd,
const struct iovec *iovecs,
unsigned nr_vecs, off_t offset);
int io_uring_register_buffers(struct io_uring *ring, 
const struct iovec *iovecs,
unsigned nr_iovecs)
{
int ret;
ret = __sys_io_uring_register(ring->ring_fd,
IORING_REGISTER_BUFFERS,
iovecs, nr_iovecs);
if (ret < 0)
return -errno;
return 0;
}

Test Parameters

Here are some characteristics of the test:

  • Random reads over a set of queue depths and block sizes
  • Compare fixed and non-fixed io_uring buffers
  • Comparison of 128k vectored (32x4k iovs) vs. 128k fixed —i.e. read() vs. readv()
  • O_DIRECT
  • One test process per core — bound by /usr/bin/taskset
  • io_uring_queue_init() flags == 0
  • Measurements were taken from the system (using dstat)

Results

IOPs —Single Process, Single Core, Fixed vs. Non-Fixed: 4k, 32k, 128k

Only the 4k IOPs test shows a significant IOPs increase for fixed over non-fixed and primarily at higher queue depths. The larger I/O sizes, 32k and 128k, show effectively equivalent performance.

Kernel Profile for 4k Fixed Buffer (QueueDepth=128)
Kernel Profile for 4k Non-Fixed Buffer (QueueDepth=128)

IOPs — Two Processes, Two Cores: 4k

The single core 4k IOPs test left me wondering if there were any IOPs available on the NVMe device, and if so, how were fixed and non-fixed performance characteristics?

IOPs — Single Process, Single Core, Fixed vs. Vectored: 128k

Until now, the tests have compare fixed and non-fixed using contiguous buffers in the non-fixed cases — this section shows the impact of vectored buffers. The vectored case is setup by supplying 32 iovs, each with a 4kb buffer, to io_uring_prep_readv(). Since io_uring does not support “fixed” operation using vectors, using vectors implies that the I/O is non-fixed. In addition to the non-fixed preparation activities, vectored I/O has additional overhead associated with processing of the vector array.

Conclusion and Recommendations

My primary objective was to measure the amount of performance — in terms of throughput and CPU utilization —that would lost if my application were to employ vectored buffers instead of contiguous. The results seem obvious and largely unsurprising but hopefully provide useful information and time savings to those building io_uring-based applications.

Use Fixed Buffers if Possible

IO_uring’s fixed buffers provide clear CPU usage reduction and therefore should be used when possible. Note that using fixed buffers in your io_uring application will likely require consideration in the initial design since the set of fixed buffers must be declared in one shot — there’s no dynamic addition or removal of individual fixed buffers to or from the registered set.

Non-Fixed Buffers are OK Too

Using io_uring with any heap allocated buffer still gives great performance. The results have not indicated any additional bulk buffer memory copies or inordinately expensive CPU costs.

Vectored I/O

Vectored buffers, in my view, provide a lot more allocation flexibility since they allow for dynamic composition and I prefer to use them unless their overhead is untenably high. The cost of using vectored I/O in io_uring is two-fold as it results in the inability to use fixed buffers and the additional kernel processing of the iov array. Fortunately, some of the cost may be refunded if I/O’s using a single vector employ read_fixed() or write_fixed() — this aligns with the fact that non-fixed small I/O’s incur the most noticeable throughput reduction. Another technique to reduce vectored overhead could be to employ a very small number of large, contiguous, registered buffers for large I/O’s. While this is a bit wasteful in terms of memory allocation, NVMe devices are so fast that only a few 128kib buffers are needed to saturate a device (using random reads)! This dedication of additional memory may increase overall system performance by reducing CPU cycles spent servicing I/O.

Distributed Storage Systems Programmer w/ focus on distributed erasure coding, parallel log structuring, and hierarchical storage.