Copy Hunting
Jul 26, 2023
I have a super power. I want to share it in this blog post so that you, my dear reader, also have it!
This is the super power: I know that LLVM IR is text. Let me explain.
When we hack on something, usually it is enough to study the source code. Whether we are implementing a new feature, hunting down a bug, or optimizing performance, usually the mapping between the source code and the resulting machine code is transparent enough that we can work on the level of our language, like Rust or Zig.
Sometimes though, you need to look deeper. For example, if you are optimizing binary size, the relation between the amount of source code and the length of the resulting ELF file is much less clear. The natural first answer when looking into these sorts of problems is to study the resulting binary. Tools like nm
, objdump
or even compiler explorer come to mind. But in practice, working with raw disassembly is not efficient—it is too remote from the original source code, too target specific, too AT&T syntax by default.
What if… the next time you think to yourself “oh dear, I have to look at the assembly to solve this problem”, you think “wow, I can look at the LLVM IR to tackle this!”?
LLVM IR is pretty low-level, so there’s a rather direct correspondence between IR instructions and generated assembly instructions. At the same time, LLVM IR is actually also high-level! It is somewhat target independent, it has better connection with the source code, and it is more human oriented—instruction names are usually obvious words, rather than cryptic mnemonics, variables are symbolic, rather than mapped to a finite number of registers, etc. And, the killer feature, LLVM IR has a standard, good, readable textual representation. You can read LLVM IR without studying the spec, and you can grep it. Recently, we’ve used a “grep llvm-ir” trick to make a dent in two different problems at TigerBeetle.
Needless memcpy
One performance bug that we have hit a few times in TigerBeetle is the problem of implicit memcopies. This is a problem that Rust and Zig both share—naive compilation of idiomatic code can introduce a bunch of local variable copies. Removing these copies requires some compiler smartness, and the work is underway in both Zig and Rust to implement relevant optimizations. While the smartness isn’t fully there yet, it’s on the programmer to write the code so that the compiled result is optimal with respect to copies.
How do we use our new super power to solve the problem? Solving real problems is hard, so let’s invent a fake, simple problem first. Once we’ve solved that, we can move onto the real thing.
One particular issue we fixed a couple of times in TigerBeetle is replacing by-value with by-pointer loops:
When compiling with optimizations, even older Zig versions such as 0.9.1 often can eliminate the copy here, but this doesn’t always happen. We want to find cases where it doesn’t.
Let’s try to weaponize this example. To do that, we need to create a surrounding context for the for
loop which is as small as possible, but still shows the problem (ideally, in an obvious form).
So let’s start with defining a gigantic struct with a small field:
We can now plug it into a for
loop, which only uses a small part of the struct:
Note how I abstract a detail, the body of the for
loop, into a function. To complete our example, we need to get the xs
slice. We could manually initialize an array of some Big
structs, but that’s cumbersome. When crafting minimal examples, we can use a standard trick to conjure anything out of thin air by defining a function that gets whatever is needed through parameters:
We are getting somewhere! Now we need to compile this to an actual artifact to get our LLVM IR. Because we are only interested in our use_big
function, we better compile a library. But to also force Zig to compile anything, we need to mark our function as part of the library API, which means it must also follow the C ABI. So the complete minimal example can look like this:
How do we compile that to LLVM IR?
Ok, got it, it’s --femit-llvm-ir
, let’s do it!
Perfect! Let’s look at what is inside that .ll file!
Some gibberish! Let’s try to search for our use_big
function:
Hey, it’s still gibberish, but we were able to find our use_big
function. And we even see memcpy
! And that’s the thing I like about LLVM IR—I know very little about the x86_64 instruction set, and even less about LLVM IR, but I am able to muddle through .ll
just because it is text.
Looking at the docs for this LLVM intrinsic, we see that the third argument is the length. And, indeed, i64 4100
looks like a big number, corresponding to @sizeOf(Big)
.
So here’s the plan for our copy-finding tool—read the .ll
file line-by-line, notice the current define
, look for lines with memcpy
, do some comma counting to find the third argument, and check it against a threshold.
At this point you might be wondering—why parse .ll
line-by-line? Can we just, like, take an ll-file parser, parse that into a data structure, build a call-graph, and otherwise act like grown-up compiler engineers? I also wondered about it! I tried one .ll
parsing library, but it required linking against LLVM and crashed on my .ll
file, so I figured some text processing would be more robust for my purpose here.
Anyway, the relatively self-contained tool can be found here: copyhound.zig. Feel free to use it yourself, like scheibo is already doing!
Copies found
So, was this useful? A bit! We haven’t fully productionized and put this into our CI, but some ad-hoc investigations uncovered a couple of curiosities. For example, a bunch of copies were traced back to the std.meta.eql
function. This is an extremely cute Zig version of Rust’s PartialEq
which compares two types by comparing every field, roughly like this:
This is great for comparing “objects” with complex structure. But in TigerBeetle, very few things are such objects. Most of the things we work with are cache-line aligned bags-of-bytes without pointers or padding. And when we compare things, very often it is for assertions where we expect things to be equal most of the time.
Thinking about this use-case suggests a more elegant comparison algorithm—just compare the underlying bytes directly. Additionally, given that we carefully align all our data, the comparison routine can take advantage of that, and compare chunks of memory at the same time. And Zig is just perfect for implementing this kind of optimized comparison routine, because we can use comptime capabilities to directly express this optimally:
For fun, I also tried running copyhound on the Zig compiler itself, and it found a curious issue!
AstGen.numberLiteral
, a function which parses numeric tokens into numeric values (that is, "92"
to 92
) uses ten kilobytes of stack.
The root cause was a slow path for parsing floating point numbers. Parsing floats is a surprisingly gnarly problem, because they are specified as base-10 in the source code, but the actual IEEE-754 float value is base-2. So, while most simple values can be dealt with efficiently, sometimes a slow path is needed which in Zig requires a lot of stack space. And LLVM was happily inlining this slow path! Although the code for slow path was rarely executed, the function’s frame size would have to account for memory there every time. The fix was to mark the function in question as @setCold(true)
.
Tracking Binary Size
After writing copyhound
, I realized that it solves one other kind of copy problem as well!
At TigerBeetle, we also care about binary size. Well, we are not actively trying to shrink the binary just yet, but we are keeping an eye on size. And Zig, like Rust and C++, has a potential gotcha here — it’s very easy to write source code that, while small in size on its own, uses comptime parameters that cause combinatorial explosion of the binary size, when the same function gets repeatedly monomorphized with different compile time parameters. A function like:
is duplicated in the machine code for every value of T
it is actually used with.
For the following example:
We get the following IR:
As you can see, the f
is repeated twice. But because the repetition already exists at the LLVM IR level, we can look at this IR to find functions which contribute most to combinatorial explosion. To do this, we need to adjust our line-by-line processing of .ll
files as follows:
When we parse the
define
line, extract the polymorphic name of a function. This mostly amounts to removing generic arguments between()
from the function name.Instead of looking for memcpy calls, just count the total number of lines comprising the body of each function.
Group by extracted name, summing up the total size.
This is the same idea that cargo-llvm-lines uses for Rust. That’s a theme—any trick you do with LLVM IR would work for any LLVM-based language.
Did this find any curiosities? You bet! Turns out, one of the most bloated functions in TigerBeetle was the code responsible for:
In verbose mode, this outputs compile-time configuration for TigerBeetle, using the following function:
As you see, it is monomorphized for each field name and value. But there’s no need for that! The following works and shaves off 300 bytes from the release binary:
Let’s recap the tricks from the post:
LLVM IR in many cases can be a convenient substitute for assembly.
When debugging something related to compilers, it’s important to first come up with a minimal program to experiment with.
You don’t need to write a whole program with
main
, you can write a single function which accepts everything needed as an argument.To compile just a single function, you can compile a library (but don’t forget to make the function public).
Any compiler which uses LLVM should have a way to produce a textual file with LLVM IR; look for a
--emit-llvm
argument.You can open the resulting
.ll
file in a text editor andCtrl+F
functions names you are interested in.You can also do UNIX-style ad-hoc text processing of the result, which might show interesting properties of your code.