Caches are fast, hashes are fast

23 Mar 2017 • Caches are fast, hashes are fast

Or, how to make a C/C++ build system in 2017

Here's a problem I've been having at work a lot lately:

Put a change set up for review
Someone asks me to split a commit into its own change set so we can merge it faster
Make a new branch, cherry-pick, and swap back
Keep working on my original changes. At some point I will want to build them
MSbuild/make see that lots of files have new timestamps because I checked out master, and builds all of them even though they haven't actually changed
C++ is stupid and my Windows build is in a VM so it takes fucking ages

Obviously the dream solution here would be to have good compilers(*) and/or a good language, but neither of those are going to happen any time soon.

*: as an aside that would solve one of the big pain points with C++. Everybody goes to these massive efforts splitting up the build so they can build incrementally which introduces all of its own headaches and tracking dependencies and making sure everything is up to date and etc and it's just awful. If compilers were fast we could just build everything at once every time and not have to deal with it.

Anyway since they aren't going to happen, the best we can do is throw computing power at the problem. Most of them build some kind of dependency graph, e.g. bin.exe depends on mod1.obj which depends on mod1.cpp, then looks at what has been modified and recompiles everything that depends on it. Specifically they look at the last modified times and if a file's dependencies are newer than it then you need to rebuild.

Maybe that was a good idea decades ago, but these days everything is in cache all of the time and CPUs are ungodly fast, so why not take advantage of that, and just actually check if a file's contents are different? I ran some experiments with this at work. We have 55MB of code (including all the 3rd party stuff we keep in the repo - btw the big offenders are like qt and the FBX SDK, it's not our codebase with the insane bloat), and catting it all takes 50ms. We have hashes that break 10GB/s (e.g. xxhash), which will only add like 5 ms on top of that. (and probably much closer to 0 if you parallelise it)

So 55ms. I'm pretty sure we don't have a single file in our codebase that builds in 55ms.

From this point it's pretty clear what to do: for each file you hash all of its inputs and munge them together, and if the final hash is different from last time you rebuild. Don't cache any of the intermediate results just do all the work every time, wasting 55ms per build is much less bad than getting it wrong and wasting minutes of my time. Btw inputs should also include things like command line flags, like how ninja does it.

The only slightly hard part is making sure hashes don't get out of sync with reality. Luckily I'm a genius and solved that too: you just put the hash in the object file/binary. With ELF files you can just add a new section called .build_input_hash or something and dump it in there, presumably you can do the same on Windows too (maybe IMAGE_SCN_LNK_INFO? I spent a few minutes googling and couldn't find an immediate answer).

For codegen stages you would either just run them all the time or do the timestamp trick I guess, since we are ignoring their timestamps and hopefully your codegen is not slow enough for it to matter very much.

Anyone want to work on this? I sure don't because my god it's boring, but I wish someone else would.

UPDATE: I've been told you can work around my specific example with git branch and git rebase --onto (of course), but this would still be nice to have.