Diogo

Posted on Dec 29

uRocket - Reactor Networking in C# with io_uring

#csharp #linux #performance #networking

As a network performance enthusiast I've worked with multiple HTTP web frameworks using the C# System.Net.Socket as the interface between the framework and the OS. Working mainly in Linux, one of the aspects that always frustrated me was the non-existent support for io_uring in C#(Socket uses epoll), so I guess, it was time to do it myself.

uRocket (micro ring socket) is a single acceptor multi reactor architecture with await/async support, this means that as a user I can await reads from the wire and write to it as I please. The acceptor and reactors are fully customizable relying on a C-written shim. Basically, uRocket (C#) interops with liburingshim (compiled from uringshim.c) which is an interface between the C# and liburing.

What is the reactor pattern and why it matters

The reactor pattern decouples I/O operations from application threads by using event notification to multiplex thousands of connections across a small thread pool. In uRocket, a single acceptor thread handles incoming connections via io_uring's multishot accept, distributing clients across reactor threads—each owning its own io_uring instance, buffer ring, and connection table. This architecture can eliminate thread-per-connection overhead while avoiding cross-thread contention entirely. With io_uring, reactors achieve unprecedented efficiency: submissions and completions occur through shared memory rings requiring zero syscalls in steady state, the kernel selects receive buffers directly from pre-registered rings (true zero-copy), and multishot operations fire hundreds of completions from a single submission. Application code can still spawn one task per connection for familiar async/await patterns—the reactor handles I/O multiplexing underneath while your code remains sequential and readable. This combination of reactor-pattern efficiency with idiomatic C# async/await is what makes uRocket unique.

io_uring

io_uring is a modern Linux kernel interface (introduced in 2019) that revolutionizes asynchronous I/O by replacing the traditional "syscall-per-operation" model with shared memory ring buffers. Instead of calling read(), write(), or accept() individually—each triggering expensive kernel transitions—applications submit I/O requests by writing entries to a Submission Queue (SQ) that lives in memory shared between userspace and the kernel. The kernel processes these asynchronously and reports completions via a Completion Queue, also in shared memory. Once initialized, most operations require zero syscalls: applications write SQEs (Submission Queue Events), the kernel polls for work (especially with SQPOLL mode), processes requests, and writes CQEs(Completion Queue Events)—all without crossing the userspace/kernel boundary. io_uring introduces powerful features beyond older APIs like epoll: multishot operations allow a single submission to produce hundreds of completions (one accept submission handles all incoming connections), buffer rings let the kernel select pre-registered buffers and return only a 16-bit ID rather than copying data, and batching enables processing thousands of events per iteration. This design eliminates the fundamental bottlenecks of traditional I/O: syscall overhead, data copies, and resubmission costs.

Benchmarking uRocket vs System.Net.Socket

Since I do not own multiple server machines or top of the like Network Interface Cards, there is of course some level of noise in these benchmarks. The load is generated using wrk and the source code for each:
uRocket
System.Net.Socket

A few notes:

Non pipelined requests
No HTTP parsing, this is not a HTTP framework benchmark.
No TCP fragmentation so each request - one response
Requests are sent through localhost and the load (wrk) is running on the same machine as the webservers, sadly due to budget issues which causes some bottleneck as we will see.
Both uRocket and System.NET.Sockets are built as native AoT with exact same flags:

<PropertyGroup>
    <ServerGarbageCollection>true</ServerGarbageCollection>
    <TieredPGO>true</TieredPGO>
    <SelfContained>true</SelfContained>
</PropertyGroup>

 <ItemGroup Condition="$(PublishAot) == 'true'">
        <RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.HillClimbing.Disable" Value="true" />
</ItemGroup>

OS: Ubuntu Server 24.04, .NET 10
Processor: i9 14900K
RAM: 64GB 6000MHz

uRocket is still in early development phase so these results will likely be at least a little bit different in the future, take it with a little grain of salt, of course as the uRocket maintainer, I am biased. Maybe the code for System.Net.Socket could have some better optimization for the Socket configuration, I checked it vs Microsoft's asp net platform entry(uses System.Net.Socket) at TechEmpower benchmarks and it was significantly better performing so it looks legit to me.

Results

During the benchmarking I configured uRocket for a different number of reactors, always armed with multishot with or without SQPolling. The System.Net.Socket config remained always the same so this can be a point in favour of System.Net.Socket.

In the results table we can find:

The type (uRocket or Socket)
Nº of Reactors (only applies for uRocket)
Load (wrk command parameters)
CPU usage - i9 14900k has 32 Threads so each 100% - 1 Thread
RPS, Requests per Second (Average of 10 runs each)

High Load -t>16 -c512

uRocket delivers much less CPU usage even for higher RPS values, I noticed during the test that my hardware bottlenecks uRocket for nº of reactors > 12, this is because for higher RPS the load generator (wrk) also needs more CPU, we can see that for example for nº reactors < 12 the CPU usage is linear to the nº of reactors and then it plateaus.

We can also see that the perceived efficiency (RPS/CPU) is inversely proportional to the nº of reactors, best case is for 4 reactors(4377) and worst case for 16 reactors(2503) while for Socket all results are similar(~1700), which makes sense because as there is no thread pinning, there is an OS optimization to run the reactors on the best CPU threads i9 14900k has few performance cores, also the wrk load is lower.

Lower Load -t<16

For lower load Socket somehow pulls a lot of CPU usage, the results are way too favorable towards io_uring seeing 4x better RPS/CPU ration for many cases, again, the Socket implementation might not be fully optimized.

Conclusion

When I started this benchmarking I had one reference from an older reddit post stating 23% extra performance for io_uring which can actually be seen for the maximum RPS 3_354_231 vs 2_728_015 (~23% more performance). I also found online in other benchmarks that typically io_uring solutions consume up to 50% less CPU usage so it also checks out. Hope this was interesting for you and again, this is a simple benchmark and may have inaccuracies.

Future Work

uRocket still requires some features and polishing, after that it will be integrated into frameworks such as Wired.IO and GenHTTP to test it on an actual HTTP framework.

Latency and startup time tests are also planned as it's extremely fast booting specially with native AoT.

Top comments (3)

Fábio Reis • Dec 30

These results are crazy, had never heard about io_uring.
Does this mean that ASP.NET could be a lot more efficient?

Diogo • Dec 30

Possibly yes but you would only notice it in very high RPS services which are the minority.

Andreas Nägeli • Dec 30

Well I can certainly see a small spike in CPU usage in ASP.NET Core apps even when serving static files (maybe 1-3% load). So io_uring might help to maintain lower C-States on mostly idle systems.