Before this article truly begins I’d like to explain what I’ve done in the last
few months and what this article truly is about:
As some of you may know, I worked on WireGuard as a part of the Google Summer of
Code for the last 3 months, which
basically is contract work for students, giving you a stipend while you work for
an organization of your choice, in my case the Linux Foundation, in the open source world if selected.
This article will explain the work I’ve done and some other stuff related to it.
While I am making this article as a part of the program, for which I need to
showcase the work I did once the work period ended, I was not forced to make it
a blog article or had anything particular to say apart showcasing my work, hence
I’ll organize it this way:
At the end of this post is going to be a crude list of the tasks I’ve
accomplished during this summer and a conclusion about the GSoC et
cetera, while everything before it will be about WireGuard and the Linux kernel
itself, larger in scope, and my GRO research overall, explaining more stuff, so that this blog post is
interesting even to the people not having an in-depth knowledge about networks
or WireGuard.
This Preface looked way too serious, nah? Let’s get back to my old silly style of writing for what’s next!
Intro to WireGuard
Yeah, I know, boring stuff but we gotta pass through it for those that don’t know.
So WireGuard, for those that can’t read its website, is a simple level 3 VPN protocol which aims to be secure, sneaky and simple. In a nutshell you exchange keys a la ssh, you get yourself in the Allowed IPs list and you just got yourself in a Cryptokey Routing Table, which allows you to connect to the servers.
What’s great with this, however, is that on top of being more efficient than older protocols it is actually much smaller, too. Take for example this Lines of Code(LoC) count from latest WireGuard git at the time of writing, with tests, compatibility hacks and crypto removed:
➜ src git:(grt/dql) ✗ cloc *
34 text files.
33 unique files.
21 files ignored.
github.com/AlDanial/cloc v 1.76 T=1.01 s (27.8 files/s, 5040.1 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
C 14 623 298 3294
C/C++ Header 13 133 77 574
make 1 20 3 63
-------------------------------------------------------------------------------
SUM: 28 776 378 3931
-------------------------------------------------------------------------------
If that isn’t directly obvious to you, in that particular case, having a VPN protocol in-kernel that small makes it have a much smaller attack surface and so contribute to security and, in our case performance, but you cannot really derive performance from a LoC count.
So yay, got through my propaganda? Now’s the real talk: While the project is fancy and all, I was contracted to fix out several TODOs to make WireGuard acceptance in mainline Linux faster and, as a matter of fact, it currently is on the Linux Kernel Mailing List and Linus Torvalds himself is pressing for its adoption so I’d say we’re pretty good so far!
Since this was a contract work and that I spent most of my summer on ONE task I’m going to talk about this one instead of going through the codebase and explain what could be better or what couldn’t because:
- it’s easier for me
- i’m going to get yelled at for ranting like an old man on stuff that I think could be improved while missing critical knowledge of linux to understand why this is dumb.
Generic Receive Offload
So this big thing that I’ve wasted almost all my summer on is called GRO, short for Generic Receive Offload. To understand what it does let’s talk about internet protocols (and bear with me for this crazy story or just skip the paragraph if you know everything):
A long time ago, there lived a messenger, and angry people wanting to talk
throughout the country. After a bit of inner thinking and pondering about the
feasability of teleportation, he settled that before anything he needed to
identify everyone and find them, and so he gave them addresses, which he called
IPs. Those were ordained so that he could find easily an house without knowing
any map of the country. Now that the messenger made everyone have an IP, he
could finally help out the villagers send their packages! So he got a bag in
which he threw everyone’s packages, got several other messangers to deliver
packages with him and split the big packages into packets. Then messengers ran as
fast as they could to the address specified to deliver those packets he called UDP(the gal
had a kick for weird names).
Unfortunately, UDP had issues: indeed the messengers were running so fast that
sometimes they lost packets and they had no real delivery order, so packets
could be received out of order, which was inadmissible for our needy citizens.
Our messenger then had a brilliant idea, which was TCP: he labelled each
package, to know which one to send first, and would make sure the recipient
received the package, otherwise he would replicate it himself and deliver it
again.
But, alas, there was yet another issue: what happened to your expansive huge
Chinese vase that was broken into pieces for easier delivery? Well the
deliveryman had yet another trick up his sleeve: a glue that can piece back
every packet back together, lowering the work needed by the recipient; this is GRO.
Understood that weird story? No? Fine then let’s continue!
To stop out with the excentric stories GRO is very useful in our case since
every small packet actually have a checksum to verify its integrity, so that the
data is indeed correct. But that is expansive, computation wise and so GRO was
born: we get from a theoretical complexity of \(O(n \times jitter)\) to \(O(n+jitter)\)
where jitter is the complexity cost of calling the checksum verification function.
To go back to the original topic this would, for WireGuard, make us able to have a bunch of packets encrypted with the same peer key which would bring the performance improvements. Unfortunately, GRO in its original TCP implementation, actually concatenates packets, getting rid of their headers, which would make us unable to identify the separate payloads to decrypt, henceforth we need to dig deeper into…
The Linux Networking Subsystem
Before continuing further I’d like you to acknowledge that WireGuard transmits
everything through UDP, and so that the TCP comment I made above doesn’t really
stand for WireGuard and as such more research on how GRO actually behaved inside
the Linux Kernel was needed. Which also meant dealing with a bunch of
dragons, which is always fun.
Henceforth we begin looking at the offload definition called during
interrupts,
let’s begin with
udp4_gro_receive
for now, hmmm, the most it does is call
udp_gro_receive
so let’s go down the rabbit hole, this sets the same_flow var, aight aight, but
what does this
call_gro_receive_sk
do and return? Oh, basically nothing, just sets a value or return a list,
doesn’t seem to be helpful…
That might be into
udp4_gro_complete
then, after all gro_receive sets the same_flow, so this should merge the
packets! So calling
udp_gro_complete
I see, at least this is coherent and… wait… this only calls the module-side
gro_complete..
Oh yeah because you see I forgot to tell you but gro_receive and gro_complete have two
definitions, both in the kernel and modules, so that you can define your own gro
behaviour; naming both the same is absolutely not confusing!
Well, it seems our concatenation should happen on the module side of things then,
great! So let’s look at vxlan for our use
case.
Wait. gro_receive doesn’t seem to do much apart setting flags, and it uses
call_gro_receive too which we know is doing nothing, greaaaat, so that must be
in gro_complete then!
WAIT
another gro_complete call? eth_gro_complete? Sheesh, let’s look at
it.
Ok so now it calls gro_complete as a callback again, great, just what I wanted.
That might be a bad driver then, let’s look at another! geneve, ah geneve my
good
friend
you must be doing something interesting, nope? Nope. Same deal. Argh. And what
even is this NAPI on this struct caller, why is it not simply API? Net API
maybe? Oh, it’s New API, of course, gotta wait for the newer API
then. Let’s look at this then, what do
you have about gro, sweetie. AHAH, ANOTHER
GRO_RECEIVE.
To hell if I know why it wasn’t called, but it’s there! Hmm, yet another
gro_receive?
Gosh how deep am I getting… Oh it seems gro_complete is called if we go over
the maximum length of the
buffer,
makes sense. Doesn’t seem to do much merging but wait, does it link
them?
Hurray! Wait is it linked there? And why not is it not concatenated? Hold on let
me read up TCP GRO implementation… skb_gro_receive, yet another gro_receive
only called by TCP, of course…Ah, here it
is!
I guess there’s no concatenation for UDP because it usually doesn’t need it
then, but then isn’t it unlikely to get down to this chaining, hmm… frag0,
seems like another fun linux term to
learn,,
let’s read up about it:
“and you want it to do the combining of data etc.. you can populate frags[]
structure starting with the 2nd fragment till nth
fragment”
Eureka! this indeed seems to be where the listing
happens!
Now to move on to the next GRO step we should get the list from gro_complete I
guess…?
And this is where our little story ends: you can’t. I’ll spare you the details
but we realized that it was impossible to get an handle to the list into the module scope of
gro_complete. Or so we thougt! Right at the end of GSoC, and after asking the
mailing list, we found an implementation that looked like what we
wanted
but this was also the time I decided to move on and try not to waste my entire
summer on network dragons.
Conclusion
And we are done! Hope you liked this small article, I just wanted to show you a bit of the mess that work was(and to rant, let’s be honest) instead of doing a boring summary of my work; Hopefully this will have entertained someone! As I noted at the beginning this article, this GRO issue is far from the only thing I’ve done(even though it is the thing I’ll rant about for quite a bit) and so here is an exhaustive list of what I’ve done this summer:
- A clang-format for WireGuard
- Fixed a bitmath issue with allowedips’s trie before Jason(WireGuard’s lead developper) published his
- Began working on mpmc ring buffer before switching off to GRO and letting that to Thomas Gschwantner and Jonathan Neuschäfer(more infos about that in their work reports).
- Did a bunch of research on GRO before figuring out it was impossible to get a list of packets from gro_complete as needed for decryption
- Ate udon with Jason and figured out GRO was mostly impossible
- Asked the mailing list and found a patchset that came out right at the end of GSoC seemed to make this doable
- Looked at various things from userspace dev(Martin’s former project) to Dynamic Queue Limits, not really leading to anything of interest
- Made this article
And finally, let’s conclude by my thoughts on this entire program: it was amazing. It had ups and downs, I was a bit burnt out at the end and feel like I could have done more but I worked with such amazing people (shoutouts to the entire WireGuard dev team, you know who you are!) on an interesting project that I was psyched. I’m not a fan of everything Google(even though I still think it is the lesser evil when compared to Facebook Apple and Microsoft) but I think I’d honestly recommend this program to any student who have spare time: work on something you like, with great people(for my case at least), publishing all your results publically, putting Google on your résumé and being paid for it, sounds like quite a fine deal to me! On a more personal note, I’m also quite happy to have worked on a project which Linus Torvalds called a “work of art”.
As a last note I’d like to encourage you to check out Jonathan Neuschäfer’s work summary and Thomas Gschwantner’s. Those two were students working on WireGuard just like me this summer and their report might give you additional insights on things which I mostly skipped over.