published on August 09, 2018

WireGuard and the Linux Networking Subsystem

Before this article truly begins I’d like to explain what I’ve done in the last few months and what this article truly is about: As some of you may know, I worked on WireGuard as a part of the Google Summer of Code for the last 3 months, which basically is contract work for students, giving you a stipend while you work for an organization of your choice, in my case the Linux Foundation, in the open source world if selected. This article will explain the work I’ve done and some other stuff related to it. While I am making this article as a part of the program, for which I need to showcase the work I did once the work period ended, I was not forced to make it a blog article or had anything particular to say apart showcasing my work, hence I’ll organize it this way:
At the end of this post is going to be a crude list of the tasks I’ve accomplished during this summer and a conclusion about the GSoC et cetera, while everything before it will be about WireGuard and the Linux kernel itself, larger in scope, and my GRO research overall, explaining more stuff, so that this blog post is interesting even to the people not having an in-depth knowledge about networks or WireGuard.

This Preface looked way too serious, nah? Let’s get back to my old silly style of writing for what’s next!

Intro to WireGuard

Yeah, I know, boring stuff but we gotta pass through it for those that don’t know.

Project got a fancy logo, so better make good use of it!

So WireGuard, for those that can’t read its website, is a simple level 3 VPN protocol which aims to be secure, sneaky and simple. In a nutshell you exchange keys a la ssh, you get yourself in the Allowed IPs list and you just got yourself in a Cryptokey Routing Table, which allows you to connect to the servers.

What’s great with this, however, is that on top of being more efficient than older protocols it is actually much smaller, too. Take for example this Lines of Code(LoC) count from latest WireGuard git at the time of writing, with tests, compatibility hacks and crypto removed:

➜  src git:(grt/dql) ✗ cloc *               
      34 text files.
      33 unique files.                              
      21 files ignored.

github.com/AlDanial/cloc v 1.76  T=1.01 s (27.8 files/s, 5040.1 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                               14            623            298           3294
C/C++ Header                    13            133             77            574
make                             1             20              3             63
-------------------------------------------------------------------------------
SUM:                            28            776            378           3931
-------------------------------------------------------------------------------

If that isn’t directly obvious to you, in that particular case, having a VPN protocol in-kernel that small makes it have a much smaller attack surface and so contribute to security and, in our case performance, but you cannot really derive performance from a LoC count.

So yay, got through my propaganda? Now’s the real talk: While the project is fancy and all, I was contracted to fix out several TODOs to make WireGuard acceptance in mainline Linux faster and, as a matter of fact, it currently is on the Linux Kernel Mailing List and Linus Torvalds himself is pressing for its adoption so I’d say we’re pretty good so far!

Since this was a contract work and that I spent most of my summer on ONE task I’m going to talk about this one instead of going through the codebase and explain what could be better or what couldn’t because:

it’s easier for me
i’m going to get yelled at for ranting like an old man on stuff that I think could be improved while missing critical knowledge of linux to understand why this is dumb.

Generic Receive Offload

So this big thing that I’ve wasted almost all my summer on is called GRO, short for Generic Receive Offload. To understand what it does let’s talk about internet protocols (and bear with me for this crazy story or just skip the paragraph if you know everything):

A long time ago, there lived a messenger, and angry people wanting to talk throughout the country. After a bit of inner thinking and pondering about the feasability of teleportation, he settled that before anything he needed to identify everyone and find them, and so he gave them addresses, which he called IPs. Those were ordained so that he could find easily an house without knowing any map of the country. Now that the messenger made everyone have an IP, he could finally help out the villagers send their packages! So he got a bag in which he threw everyone’s packages, got several other messangers to deliver packages with him and split the big packages into packets. Then messengers ran as fast as they could to the address specified to deliver those packets he called UDP(the gal had a kick for weird names).
Unfortunately, UDP had issues: indeed the messengers were running so fast that sometimes they lost packets and they had no real delivery order, so packets could be received out of order, which was inadmissible for our needy citizens. Our messenger then had a brilliant idea, which was TCP: he labelled each package, to know which one to send first, and would make sure the recipient received the package, otherwise he would replicate it himself and deliver it again.
But, alas, there was yet another issue: what happened to your expansive huge Chinese vase that was broken into pieces for easier delivery? Well the deliveryman had yet another trick up his sleeve: a glue that can piece back every packet back together, lowering the work needed by the recipient; this is GRO.

Understood that weird story? No? Fine then let’s continue!
To stop out with the excentric stories GRO is very useful in our case since every small packet actually have a checksum to verify its integrity, so that the data is indeed correct. But that is expansive, computation wise and so GRO was born: we get from a theoretical complexity of \(O(n \times jitter)\) to \(O(n+jitter)\) where jitter is the complexity cost of calling the checksum verification function.

To go back to the original topic this would, for WireGuard, make us able to have a bunch of packets encrypted with the same peer key which would bring the performance improvements. Unfortunately, GRO in its original TCP implementation, actually concatenates packets, getting rid of their headers, which would make us unable to identify the separate payloads to decrypt, henceforth we need to dig deeper into…

The Linux Networking Subsystem

Before continuing further I’d like you to acknowledge that WireGuard transmits everything through UDP, and so that the TCP comment I made above doesn’t really stand for WireGuard and as such more research on how GRO actually behaved inside the Linux Kernel was needed. Which also meant dealing with a bunch of dragons, which is always fun. Henceforth we begin looking at the offload definition called during interrupts, let’s begin with udp4_gro_receive for now, hmmm, the most it does is call udp_gro_receive so let’s go down the rabbit hole, this sets the same_flow var, aight aight, but what does this call_gro_receive_sk do and return? Oh, basically nothing, just sets a value or return a list, doesn’t seem to be helpful…
That might be into udp4_gro_complete then, after all gro_receive sets the same_flow, so this should merge the packets! So calling udp_gro_complete I see, at least this is coherent and… wait… this only calls the module-side gro_complete.. Oh yeah because you see I forgot to tell you but gro_receive and gro_complete have two definitions, both in the kernel and modules, so that you can define your own gro behaviour; naming both the same is absolutely not confusing! Well, it seems our concatenation should happen on the module side of things then, great! So let’s look at vxlan for our use case. Wait. gro_receive doesn’t seem to do much apart setting flags, and it uses call_gro_receive too which we know is doing nothing, greaaaat, so that must be in gro_complete then! WAIT another gro_complete call? eth_gro_complete? Sheesh, let’s look at it. Ok so now it calls gro_complete as a callback again, great, just what I wanted. That might be a bad driver then, let’s look at another! geneve, ah geneve my good friend you must be doing something interesting, nope? Nope. Same deal. Argh. And what even is this NAPI on this struct caller, why is it not simply API? Net API maybe? Oh, it’s New API, of course, gotta wait for the newer API then. Let’s look at this then, what do you have about gro, sweetie. AHAH, ANOTHER GRO_RECEIVE. To hell if I know why it wasn’t called, but it’s there! Hmm, yet another gro_receive? Gosh how deep am I getting… Oh it seems gro_complete is called if we go over the maximum length of the buffer, makes sense. Doesn’t seem to do much merging but wait, does it link them? Hurray! Wait is it linked there? And why not is it not concatenated? Hold on let me read up TCP GRO implementation… skb_gro_receive, yet another gro_receive only called by TCP, of course…Ah, here it is! I guess there’s no concatenation for UDP because it usually doesn’t need it then, but then isn’t it unlikely to get down to this chaining, hmm… frag0, seems like another fun linux term to learn,, let’s read up about it: “and you want it to do the combining of data etc.. you can populate frags[] structure starting with the 2nd fragment till nth fragment” Eureka! this indeed seems to be where the listing happens! Now to move on to the next GRO step we should get the list from gro_complete I guess…?
And this is where our little story ends: you can’t. I’ll spare you the details but we realized that it was impossible to get an handle to the list into the module scope of gro_complete. Or so we thougt! Right at the end of GSoC, and after asking the mailing list, we found an implementation that looked like what we wanted but this was also the time I decided to move on and try not to waste my entire summer on network dragons.

Conclusion

And we are done! Hope you liked this small article, I just wanted to show you a bit of the mess that work was(and to rant, let’s be honest) instead of doing a boring summary of my work; Hopefully this will have entertained someone! As I noted at the beginning this article, this GRO issue is far from the only thing I’ve done(even though it is the thing I’ll rant about for quite a bit) and so here is an exhaustive list of what I’ve done this summer:

A clang-format for WireGuard
Fixed a bitmath issue with allowedips’s trie before Jason(WireGuard’s lead developper) published his
Began working on mpmc ring buffer before switching off to GRO and letting that to Thomas Gschwantner and Jonathan Neuschäfer(more infos about that in their work reports).
Did a bunch of research on GRO before figuring out it was impossible to get a list of packets from gro_complete as needed for decryption
Ate udon with Jason and figured out GRO was mostly impossible
Asked the mailing list and found a patchset that came out right at the end of GSoC seemed to make this doable
Looked at various things from userspace dev(Martin’s former project) to Dynamic Queue Limits, not really leading to anything of interest
Made this article

And finally, let’s conclude by my thoughts on this entire program: it was amazing. It had ups and downs, I was a bit burnt out at the end and feel like I could have done more but I worked with such amazing people (shoutouts to the entire WireGuard dev team, you know who you are!) on an interesting project that I was psyched. I’m not a fan of everything Google(even though I still think it is the lesser evil when compared to Facebook Apple and Microsoft) but I think I’d honestly recommend this program to any student who have spare time: work on something you like, with great people(for my case at least), publishing all your results publically, putting Google on your résumé and being paid for it, sounds like quite a fine deal to me! On a more personal note, I’m also quite happy to have worked on a project which Linus Torvalds called a “work of art”.

As a last note I’d like to encourage you to check out Jonathan Neuschäfer’s work summary and Thomas Gschwantner’s. Those two were students working on WireGuard just like me this summer and their report might give you additional insights on things which I mostly skipped over.

WireGuard and the Linux Networking Subsystem

Intro to WireGuard

Project got a fancy logo, so better make good use of it!

Generic Receive Offload

The Linux Networking Subsystem

Conclusion