mikejsavage.co.uk / blog

RSS feed

02 Dec 2017 / C++ tricks: macro to disable optimisations

This is pretty simple so here you go:

#  define DISABLE_OPTIMISATIONS() __pragma( optimize( "", off ) )
#  define ENABLE_OPTIMISATIONS() __pragma( optimize( "", on ) )
        _Pragma( "GCC push_options" ) \
        _Pragma( "GCC optimize (\"O0\")" )
#  define ENABLE_OPTIMISATIONS() _Pragma( "GCC pop_options" )
#  define DISABLE_OPTIMISATIONS() _Pragma( "clang optimize off" )
#  define ENABLE_OPTIMISATIONS() _Pragma( "clang optimize on" )
#  error new compiler

I use it in ggformat to disable optimisations on the bits that use variadic templates. For some reason variadic templates generate completely ridiculous amounts of object code and making the compiler slog through that takes a while. Not a big deal though, printing is slow anyway so disabling optimisations isn't a huge issue.

It's also handy for debugging code that's too slow in debug mode. You put DISABLE_OPTIMISATIONS around the code you want to step through, and leave everything else optimised.

BTW if you don't have COMPILER_MSVC etc macros, they look like this:

#if defined( _MSC_VER )
#  define COMPILER_MSVC 1
#elif defined( __clang__ )
#  define COMPILER_CLANG 1
#elif defined( __GNUC__ )
#  define COMPILER_GCC 1
#  error new compiler

30 Nov 2017 / Preprocessor madness 2

More preprocessor craziness:

#define A 1
// do not define B
#if A == B
int main() { return 0; }

Shows no warnings with -Wall -Wextra, you have to go -Weverything to get -Wundef to get "warning: "B" is not defined, evaluates to 0".

This doesn't seem like a big deal but we got bit by it. A colleague renamed the OSX platform define to MACOS, then presuambly he went down all the compile errors and fixed them. Unfortunately we had something like:

#include "windows_semaphore.h"
#elif OSX
#include "osx_semaphore.h"
#include "posix_semaphore.h"

which doesn't throw any errors if you change the define to MACOS because OSX defines the POSIX semaphore interface. It is broken though, because all the sem_ functions return "not implemented" errors at runtime!

It's very frustrating that the most basic building blocks we have in programming are full of shitty shit like this. I have no suggestions on how to improve things or any interesting commentary to add beyond that.

17 Nov 2017 / RSS feed

Someone asked for one, ole hyvä.

Email me if it doesn't work in your feed reader!

17 Nov 2017 / Deadlock

In a bounded lock-free multi-producer queue, pushing an element takes two steps. First you need to acquire a node for writing to avoid races with other producers, then you need to flag the node as fully written once you're done with it. Then to pop from the queue you check if the head node has been fully written, then try to acquire it if you're multi-consumer too.

(Look here or here for the details.)

One issue to note is that dequeues can fail, even though the queue is non-empty, in the sense that items have been successfully enqueued and not yet dequeued. Specifically this can happen:

It makes sense to guard such a queue with a semaphore. An obvious way to do that is to make the semaphore count how many elements are in the queue. You can't do that with a queue like this! If you write something like:

// producer
if( q.push( x ) )
        sem_post( &sem );

// consumer
while( true ) {
        sem_wait( &sem );
        T y;
        if( q.pop( &y ) )
                // do things with y

but that's incorrect because you decrement sem even when you haven't dequeued anything, and the item pushed by P2 gets lost. So instead you might try:

// producer
if( q.push( x ) )
        sem_post( &sem );

// consumer
while( true ) {
        T y;
        if( q.pop( &y ) ) {
                // do things with y
                sem_wait( &sem );

but then you're spinning while you wait for P1 to finish.

We ran into a particularly nasty instance of the first case of this at work. We have a queue which accepts various commands, and one of them is a "flush" command where the sender goes to sleep and the consumer wakes them up again (with an Event) once the flush is done. So something like this happened:

The fix is to make queueSem count the (negative) number of consumer threads that are asleep. The change is very simple:

// consumer
while( true ) {
        T y;
        if( q.pop( &y ) )
                // do things with y
                sem_wait( &sem );

Which avoids the deadlock thusly:

I'm not totally satisfied with that solution because I feel like I've misunderstood something more fundamental, something that will stop me running into similar problems in the future. Please email me if you know!

All in all writing an MPSC lock-free queue has been an enormous waste of time courtesy of shit like this. We need to enqueue things from a signal handler, which means locks and malloc are gone. Even so, since we're single-consumer and we only push in signal handlers so I believe we could have used a mutex to avoid races between producers and kept the consumer lock-free. I wasn't sure if you could get signalled while still inside a signal handler. FYI the answer is no, unless you set SA_NODEFER. I also don't know if pthread_mutex_lock etc are signal-safe, and of course if you try to Google it you just get pages of trash about "you can deadlock if the thread holding the lock gets signalled!!!!!". Presumably they are but I didn't want to risk it.

14 Nov 2017 / Preprocessor madness

This code (or similar) compiles and runs with every compiler I tested but one:

#define A( x ) 1
int main() {
        return A(); // -> "return 1;"

You can even use x in the macro body and it's fine:

#define A( x ) x + 1
int main() {
        return A(); // -> "return + 1;`

The only compiler that does the right thing and rejects this (if the spec says this is ok then the spec is fucked) is of course the AMD shader compiler.

03 Nov 2017 / C++ tricks: sized array arguments

In C if you write a function void f( char x[ 4 ] ); then the compiler ignores the 4 and treats it as char * x. This has two problems, firstly sizeof( x ) gives you sizeof( char * ) and not 4 * sizeof( char ) (GCC does warn about this), and the compiler doesn't complain if you pass in an array that's too small.

In C++ you can write void f( char ( &x )[ 4 ] ); instead and it works.

A code example:

void f( char x[ 4 ] ) {
        // warning: 'sizeof' on array function parameter 'x' will return size of 'char*'
        printf( "%zu\n", sizeof( x ) ); // prints 8

void g( char ( &x )[ 4 ] ) {
        printf( "%zu\n", sizeof( x ) ); // prints 4

int main() {
        char char3[ 3 ] = { };
        char char4[ 4 ] = { };
        char * charp = NULL;

        f( char3 ); // fine
        f( char4 );
        f( charp ); // fine

        g( char3 ); // error: invalid initialization of reference of type 'char (&)[4]' from expression of type 'char [3]'
        g( charp );
        g( char4 ); // error: invalid initialization of reference of type 'char (&)[4]' from expression of type 'char*'

        return 0;

01 Nov 2017 / Linux vs BSD in a man page

man gethostname on Linux:

The returned name shall be null-terminated, except that if namelen is
an insufficient length to hold the host name, then the returned name
shall be truncated and it is unspecified whether the returned name is


        Upon successful completion, 0 shall be returned; otherwise, −1 shall be returned.

        No errors are defined.

And on OpenBSD:

The returned name is always NUL terminated.


     The following errors may be returned by these calls:

     [EFAULT]           The name parameter gave an invalid address.

     [ENOMEM]           The namelen parameter was zero.

     [EPERM]            The caller tried to set the hostname and was not the

31 Oct 2017 / Monocypher is excellent

Monocypher is by far the best C/C++ crypto library, probably by far the best crypto library full stop.

It's a single pair of .c/.h files. The interface has easy to use implementations of sensible primitives and algorithms. The manual is absolutely wonderful, with really clear descriptions of what each function does and what guarantees they provide.

ATM I'm using it in Medfall to sign updates. I sign a manifest that lists all the game files and their hashes, and the public key is hardcoded in the client. It's less than 100 lines of code for everything. The keygen and signing utilities, and the client side verification code.

If you rip out arc4random from the portable LibreSSL and pair it with monocypher you have everything you need to make an encrypted game networking protocol. Something like:

The client hasn't really proven its identity to the server because you have to ship the same private key with every client so it's easy to fake, but that's not a big deal. On the other hand the client does know it's talking to the correct server, so you don't have to worry about sending your login credentials to random hackers.

I don't think you need to care about replay attacks here. To impersonate the server, you would need to take a signed x25519 public key and crack the secret key and that should be impossible. But you can stick a (signed) timestamp in there if you want. (doesn't totally mitigate it but you can reduce the time they have to crack a key to like a few seconds)

29 Oct 2017 / GL_FRAMEBUFFER_SRGB sucks

I replaced GL_FRAMEBUFFER_SRGB with explicit linear-to-sRGB conversions in my shaders.

It's a little bit more code but having the extra control is worth it. The big wins are a UI that looks like it does in image editors, and being able to easily turn off sRGB for certain debug visualisations.

Some GLSL for myself to copy paste into future projects:

float linear_to_srgb( float linear ) {
        if( linear <= 0.0031308 )
                return 12.92 * linear;
        return 1.055 * pow( linear, 1.0 / 2.4 ) - 0.055;

vec3 linear_to_srgb( vec3 linear ) {
        return vec3( linear_to_srgb( linear.r ), linear_to_srgb( linear.g ), linear_to_srgb( linear.b ) );

vec4 linear_to_srgb( vec4 linear ) {
        return vec4( linear_to_srgb( linear.rgb ), linear.a );

24 Oct 2017 / Roadblocks to releasing Medfall on macOS

Releasing software for OSX is annoying.

I kind of want to go down the "here is a totally unsupported and untested package" route, just because I can and it would probably be fine. I don't want to have to boot into OSX and actually test it as part of my release process because it's already annoying having to do both Windows and Linux and I can do both at the same time with my desktop/laptop.

But even that is a nightmare, because OSX is extremely hard to get working in a VM and I can't do it. Also it's illegal.

So you can cross compile. Clang supports cross compilation out of the box so it's less annoying than you might expect, but you still have to get all the headers/libs from an OSX machine, and then you have to install the LLVM linker which for some reason is packaged separately, and then that might be illegal too but I haven't read the EULA.

Building an installer package is more difficult. .pkgs uses a stupid format called XAR (I don't know if it's actually bad but WTF even if it was good you have to use zip or tar because everything supports those and does not support xar) so you have to download some Github project (maybe 7z can do it but it gives a scary warning trying to list the .pkg I built on OSX).

Inside the .pkg there are three files. There's a file called PackageInfo, which is some XML and looks easy enough to generate. There's a file called Bom which is some binary manifest, and you need to download another Github project to generate that. The last file is called Payload, which is another stupid archive format (cpio, which 7z seems ok with) + gzip, and contains the folder you pointed pkgbuild at.

I feel like I could probably get cross compilation and the .pkg stuff working but it would take ages and be boring so I'm not going to.

ADDENDUM: apparently you can jailbreak Apple TVs. Would be pretty funny to run OSX builds (it's cross-arch but not cross-OS so it's probably simpler) on the Apple equivalent of a Raspberry Pi. Of course I haven't tried it and I'm not going to but it amuses me that it might be possible.

The other major annoyance is that OSX doesn't support GL past 4.1, which means you miss out on:

I'm actually only really upset about losing clip control, because you need that to do massive draw distances. The rest I can either live without (compute, BPTC, explicit uniforms) or are pretty simple to conditionally support, so it's not the end of the world but it is annoying. I also don't really understand why Apple wants to push Metal like this, if I ever write a second render backend it's obviously going to be D3D and not Metal.

24 Oct 2017 / Vim: peek definition

Visual Studio has this awesome thing called "Peek Definition", which lets you open a temporary window that shows you the definition of whatever you wanted to look at. That link has a screenshot so you can see what I mean.

I have something similar but crappier working in Vim. You need to go through the stupid ctags hoops, ATM I am using vim-gutentags which actually seems pretty good. Then you can do something like

map <C-]> :vsplit<CR>:execute "tag " . expand( "<cword>" )<CR>zz<C-w>p

and then press C-] to get an unfocused vsplit at the definition of whatever the cursor was over.

It's not as good because if it can't find the tag you get a vsplit of whatever you were looking at. It totally fails on member functions (like size) because ctags doesn't understand C++. With functions it asks you whether you want to jump to the body or the prototype, when you probably always just want to jump to the prototype. That last one might be fixable if I tweak the flags to ctags but meh.

18 Oct 2017 / OpenSMTPD is excellent

Whenever I have to interact with email software that isn't OpenSMTPD I'm just so appalled by how shitty it is. Except maybe rspamd. Email software just seems to follow the 1980s Unix philosophy of "do one thing and completely suck dick at it".

My entire config looks like this:

pki mikejsavage.co.uk certificate "/etc/ssl/mikejsavage.co.uk.fullchain.pem"
pki mikejsavage.co.uk key "/etc/ssl/private/mikejsavage.co.uk.key"

listen on lo
listen on lo port 10028 tag DKIM
listen on egress tls pki mikejsavage.co.uk
listen on egress port 587 tls-require pki mikejsavage.co.uk auth

accept from any for local virtual { "@" => mike } deliver to mda "rspamc --mime --ucl --exec /usr/local/bin/dovecot-lda-mike" as mike
accept from local tagged DKIM for any relay
accept from local for any relay via smtp://

That's 9 lines of config. DKIMProxy's config is 8 lines. Dovecot's config is 2453 lines split across 34 files. WTF how can you suck so much? DKIMProxy's only job is to add a header to outgoing emails. Dovecot is probably of similar complexity to smtpd but has two orders of magnitude more config. I have a bunch of spamd/bgpd garbage lying around too and I have no idea if it does anything. Nuts.

pop3d looks extremely good, like this is how Dovecot should be, but it's POP and POP is useless. God damn. It's made me pretty tempted to do an imapd though. I'd have to keep Dovecot around for the MDA until smtpd gets filters, but after that I could drop everything but smtpd/rspamd/imapd and be happy.

The gmail spam filter does an extremely bad job of dealing with modern spam. The spam of old times is pretty much solved, the spam I got on this domain before I set up rspamd was all obviously fake invoices and Russian dating websites, really simple to filter out.

Modern spam is also trivial to spot, but the people spamming buy ads from Google so they'll never block it. I mean crap like newsletters after you explicitly checked/unchecked the box that dis/allows them to send you junk, the biweekly terms and conditions updates from shit startups, etc. If you blocked any email that contains an unsubscribe link or the phrase "Terms and Conditions" you would catch 100% of it with nearly no false positives. It's so easy but they won't do it.

Amusingly the gmail spam filter does a perfect job on my work inbox:

Change Management Training - Change Management Training in Paris, France Helsinki
Mike - Invitation to discuss how ex-Google/McKinsey team is replacing HR with bots
PMP Certification Workshop - PMP Certification Training in Helsinki, Finland
Data for Breakfast - Join us in Stockholm
PMP Certification Training
Data warehousing: Let the past tell your future
Webinar: Making Today's Data Rapidly Consumable
One-day Agile & Scrum training - Agile & Scrum training in Helsinki, Finland

15 Oct 2017 / Not even not upgrading can save me

A bunch of shit in Firefox has been breaking for me lately. :open in Vimperator has entirely stopped working on my laptop. Not even restarting helps.

The NoScript buttons in the tab bar randomly disappear and I have to restart.

Firefox randomly can't open web pages unless I try again. I thought it might have been a router problem or something but no other piece of software on my PC has this problem.

FFS I don't upgrade software so I can avoid garbage like this, but apparently not even that helps.

Actually I did upgrade this server to 6.2 today. I noticed it had 186 days of uptime before I took it down, so it's been alive since I did the update to 6.1. None of the long-running software I use has crashed a single time during that period. Upgrading probably took less than five minutes and everything came up and worked first time. Why can't all software be OpenBSD?

Couple of updates a few days later: my RSS cronjob is hanging for some unknown reason and I needed to reinstall rspamd (I think I built it myself before). Also the Firefox failures have spread to my desktop.

15 Oct 2017 / Optimising vs expanding to fill all available resources

Parallelising code does not make it faster.

You actually run slightly slower, because you have to deal with the overhead of dispatch and context switches and expensive futex calls. But we do it anyway because it makes code run in less time. So you trade CPU time for wall clock time. Or throughput for latency.

In games, people use thread pools to go wide when they have lots of the same work that must must be done to get a frame out. Things like culling, broad-phase collision detection, skinning, etc.

It's not immediately obvious that high framerate corresponds to low latency rather than high throughput. If you think of a frame as taking the inputs at the start of the frame, like the state of the world last frame and player/network inputs, and then producing the next frame as output, it kind of makes sense. You're reducing the latency between receiving the inputs and spewing the output.

It's also really surprising how little you benefit from using multiple threads. A typical desktop PC has 2 or 4 cores. The Steam hardware survey says that's 95% of the market (the gamer market even!), you're looking at less than 4x speedup.

That's a bad habit I need to break. When something needs optimising one of the first things that comes to mind is "put it on the thread pool". On one hand it's easy (ADDENDUM: not gonna edit this out but of course threading is not easy), on the other it's junk speedup and other optimisation methods are not a huge amount harder. Parallelising my code should probably be the last optimisation I make!

Anyway I was thinking about this because of all the Firefox talk about having one thread per tab and GPU text rendering and GPU compositing and etc. Ok Firefox runs in less wall clock time because it has 4x more resources, but now my whole PC runs like trash. The reason multi-core CPUs were such a huge upgrade when they first came out was that shit apps didn't make your PC unusable anymore! But now the shit apps are becoming parallelised, we're going back to the bad old times.

The GPU stuff isn't in yet but I'm looking forward to the "we made our code 10x slower but put it on hardware that's 100x faster!!" post, swiftly followed by having to close my web browser whenever I want to play games.

09 Oct 2017 / Windows 10 post-install checklist

  1. Settings, Update and security, Windows Update, Check for updates.
  2. Launch IE and download another browser. While you're at it, go into Internet Options, Security, drag the security level all the way down, Custom level..., Launching applications and unsafe files, check Enabled (not secure), ok out of all of that.
  3. If you picked Firefox, install uBlock Origin, NoScript. It's very important to install those before you do anything else on the web.
  4. Install video drivers.
  5. Right click the start button, Control Panel, User Accounts, User Accounts (again), Change User Account Control settings, disable it.
  6. gpedit.msc, Computer Configuration, Administrative Templates, Windows Components, Windows Defender. Double click Turn off Windows Defender, check Enabled, click ok. Also go to Windows Components, OneDrive, Prevent the usage of OneDrive for file storage, Enabled, ok.
  7. services.msc, disable Superfetch, Windows Firewall, Windows Search, and whatever else you don't like the look of.
  8. secpol.msc, Local Policies, Security Options, UAC: Run all administrators in Admin Approval Mode, Disabled.
  9. Run this reclaim Windows 10 script.
  10. Right click the desktop, Display settings, Advanced display settings, ClearType text.
  11. Right click the desktop, Personalize, go through all of it. In particular go Themes, Advanced sound settings, Sound Scheme = No Sounds. Taskbar, Combine taskbar buttons, Never. Start, disable everything. Taskbar, Turn system icons on or off, disable Action Center.
  12. Control Panel, System and Security, System, Advanced system settings, Performance Settings..., disable almost everything under visual effects, Advanced, Change..., set the pagefile size to 800MB.
  13. Control Panel, System and Security, Security and Maintenance. Click all the "Turn off messages about x" links.
  14. Win+R, cmd.exe, powercfg -h off.
  15. Set Windows to use UTC time.
  16. Reboot to BIOS, put Linux Boot Manager back at the top of the boot list. Reboot back to Windows.
  17. Win+E, View, Options, View. Check Show hidden files, folders and drives. Uncheck Hide empty drives. Uncheck Hide extensions for known file types. Uncheck Hide protected operating system files. Go down to Naviation pane, check Expand to open folder.
  18. Install the MarkC mouse acceleration fix.
  19. Install the Take Ownership Registry Hack. Take ownership of everything in C:. Right click on the C drive, Security, Advanced, click your name, Edit, check Full Control, OK, check Replace all child object permissions, click OK. Click Continue several hundred times (jesus christ Microsoft!!). Maybe skip ahead and install AHK so you can use an autoclicker.
  20. Open Settings:
    1. System. Notifications & actions, disable them. Power & sleep, Never and Never.
    2. Devices. Typing, Off and Off. AutoPlay, Off.
    3. Network & Internet. Mobile hotspot, Off and Off.
    4. Time & language. Set your time zone. Change date and time formats, use 12h time formats (the ones with h instead of H).
    5. Privacy. General, disable everything. Location, disable everything. Notifications, Off.
    6. Update & security. Windows Update, Advanced Options, Choose how updates are delivered, don't let other people download updates from your PC and don't download updates from theirs. For developers, Developer Mode, disable Remote Desktop.
  21. Control Panel, Ease of Access, Change how your keyboard works, disable everything.
  22. Control Panel, Uninstall a program, uninstall OneDrive.
  23. More Firefox things:
    1. Go to Options. General. When Firefox starts, Show your windows and tabs from last time. Set up fonts, set minimum font size, uncheck Allow pages to choose their own fonts. Downloads, Always ask you where to save files. Applications, PDF, Always ask. Firefox Updates, Never check for updates, uncheck the others, uncheck USe smooth scrolling. Search, disable everything. Privacy & Security, custom history settings, never accept third-party cookies, remove all cookies you picked up so far, disable Firefox data collection.
    2. about:config. extensions.update.autoUpdateDefault = false, extensions.update.enabled = false. browser.tabs.closeWindowWithLastTab = false.
    3. Install Vimperator, Download Statusbar, HideScrollbars, and bug489729.
    4. Put the NoScript icons in the tab bar. Go into NoScript options. Whitelist, remove everything. Notifications, uncheck Show message about blocked scripts, uncheck Display the release notes on updates. Advanced, XSS, disable.
    5. Go into uBlock settings and enable whatever filter lists you like the look of.
  24. Download the Sysinternals Suite. Run autoruns, disable anything you don't like the look of. OneDrive, Windows Defender things, MozillaMaintenance, NVIDIA telemetry, etc. Run procexp and check nothing dumb is running just in case.
  25. Probably reboot again for good measure.
  26. Install 7-Zip. Go into settings and associate it with everything that isn't zip. Disable all the junk context menu items.
  27. Install Search Everything. Sort by descending run count, and close window on execute.
  28. Install Vim.
  29. Install AutoHotKey. Put startup.ahk in %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup.
  30. Install Start Killer.
  31. Create halt.bat somewhere containing shutdown /s /t 0. Create reboot.bat containing shutdown /r /t 0. Use Everything to run these.
  32. Install Clink.
  33. Install Dina font.
  34. Download psubst. psubst X: C:\Users\mike\Documents /P.
  35. Install Cygwin. Add OpenSSH, and whatever else you like. Add -w max to the quicklaunch shortcut.
  36. Launch Cygwin and run ssh-host-config, and follow the prompts. Then run chown cyg_server: /var/empty; chmod 700 /var/empty; net start sshd.
  37. Download Cmder.
  38. Install Visual Studio 2015 with Update 3. Make sure to pick custom install, check the C++ common tools box, and uncheck everything else.
  39. Install NSIS.
  40. Install Intel Architecture Code Analyzer.
  41. Install the Windows SDK. Make sure you check Windows Performance Toolkit (for GPUView), Debugging Tools for Windows (WinDBG), Windows SDK Signing Tools for Desktop Apps (SignTool), and probably the x86/amd64 SDKs.
  42. Install the DirectX SDK. You need it for XAudio 2.7, which you need if you want to ship software on Win7. If the installer fails just ignore it.
  43. Install VsVim. TODO config
  44. Optional graphics tools: Install Renderdoc. Download apitrace. Install GPU ShaderAnalyzer. Install GPA.
  45. Optional art tools: Install Color Cop. Install GIMP. Install Inkscape. Install Milton. Install Blender. Install Wings3D. Install MeshLab.
  46. Download Path Editor. Add VS compiler stuff, MSBuild, the Win10 Kit (with mt.exe), IACA, apitrace and NSIS to path.
  47. Install VLC.
  48. Install TODO music player.
  49. Write a blog post.

07 Oct 2017 / Code for my intro to raytracing talk

I gave a talk about the basics of raytracing for the Catz Computer Science Society a while ago. I was drawing on my wacom so there are no slides and nobody recorded it, but the code is on Github and I'm still quite pleased with how simple it ended up being.

Feel free to use it for whatever.

07 Oct 2017 / C++ tricks: autogdb

One of the nice things about developing on Windows is that if your code crashes in debug mode, you get a popup asking if you want to break into the debugger, even if you ran it normally.

With some crap hacks we can achieve something pretty similar for Linux:

#pragma once

#include <sys/ptrace.h>
#include <sys/wait.h>

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>

static void pause_forever( int signal ) {
        while( true ) {

static void uninstall_debug_signal_handlers() {
        signal( SIGINT, SIG_IGN );
        signal( SIGILL, pause_forever );
        signal( SIGTRAP, SIG_IGN );
        signal( SIGABRT, pause_forever );
        signal( SIGSEGV, pause_forever );

static void reset_debug_signal_handlers() {
        signal( SIGINT, SIG_DFL );
        signal( SIGILL, SIG_DFL );
        signal( SIGTRAP, SIG_DFL );
        signal( SIGABRT, SIG_DFL );
        signal( SIGSEGV, SIG_DFL );

static void prompt_to_run_gdb( int signal ) {

        const char * signal_names[ NSIG ];
        signal_names[ SIGINT ] = "SIGINT";
        signal_names[ SIGILL ] = "SIGILL";
        signal_names[ SIGTRAP ] = "SIGTRAP";
        signal_names[ SIGABRT ] = "SIGABRT";
        signal_names[ SIGSEGV ] = "SIGSEGV";

        char crashed_pid[ 16 ];
        snprintf( crashed_pid, sizeof( crashed_pid ), "%d", getpid() );
        fprintf( stderr, "\nPID %s received %s. Debug? (y/n)\n", crashed_pid, signal_names[ signal ] );

        char buf[ 2 ];
        read( STDIN_FILENO, &buf, sizeof( buf ) );
        if( buf[ 0 ] != 'y' ) {
                exit( 1 );

        // fork off and run gdb
        pid_t child_pid = fork();
        if( child_pid == -1 ) {
                err( 1, "fork" );

        if( child_pid == 0 ) {
                execlp( "cgdb", "cgdb", "--", "-q", "-p", crashed_pid, ( char * ) 0 );
                execlp( "gdb", "gdb", "-q", "-p", crashed_pid, ( char * ) 0 );
                err( 1, "execlp" );

        if( signal != SIGINT && signal != SIGTRAP ) {
                waitpid( child_pid, NULL, 0 );
                exit( 1 );

static bool being_debugged() {
        pid_t parent_pid = getpid();
        pid_t child_pid = fork();
        if( child_pid == -1 ) {
                err( 1, "fork" );

        if( child_pid == 0 ) {
                // if we can't ptrace the parent then gdb is already there
                if( ptrace( PTRACE_ATTACH, parent_pid, NULL, NULL ) != 0 ) {
                        if( errno == EPERM ) {
                                printf( "! echo 0 > /proc/sys/kernel/yama/ptrace_scope\n" );
                                printf( "! or\n" );
                                printf( "! sysctl kernel.yama.ptrace_scope=0\n" );
                        exit( 1 );

                // ptrace automatically stops the process so wait for SIGSTOP and send PTRACE_CONT
                waitpid( parent_pid, NULL, 0 );
                ptrace( PTRACE_CONT, NULL, NULL );

                // detach
                ptrace( PTRACE_DETACH, parent_pid, NULL, NULL );
                exit( 0 );

        int status;
        waitpid( child_pid, &status, 0 );
        if( !WIFEXITED( status ) ) {
                err( 1, "WIFEXITED" );

        return WEXITSTATUS( status ) == 1;

static void install_debug_signal_handlers( bool debug_on_sigint ) {
        if( being_debugged() ) return;

        if( debug_on_sigint ) {
                signal( SIGINT, prompt_to_run_gdb );
        signal( SIGILL, prompt_to_run_gdb );
        signal( SIGTRAP, prompt_to_run_gdb );
        signal( SIGABRT, prompt_to_run_gdb );
        signal( SIGSEGV, prompt_to_run_gdb );

Include that somewhere in your code and stuff #if PLATFORM_LINUX install_debug_signal_handlers( true ); #endif at the top of main. Then when your program crashes you will get a prompt like PID 19418 received SIGINT. Debug? (y/n).

GDB often crashes and if you break with ctrl+c you can get problems when you quit GDB, but when it does work it's nice and it's definitely better than nothing.

BTW I wrote this ages ago and I can't remember many of the details around signal handling so don't ask me.

03 Oct 2017 / More installer junk

This is a bit of a followup to my last post about installers.

Turns out the "append /$MultiUser.InstallMode to the UninstallString" trick doesn't really work.

The problem is that when the user navigates to the folder with the uninstaller and double clicks it, as opposed to going through control panel, it will not have the command line switch and therefore will try to uninstall whatever you set as the default mode.

Coming up with a fix for this has been quite frustrating. A part of me wants to say "if the user wants to shoot themselves in the foot I can't do anything about it", but running the uninstaller manually seems pretty innocent to me, and we are the ones that lose money when our product doesn't work.

It's difficult because we are installing a plugin for some third-party software, so we have two installation folders and both of them move depending on whether it's a machine-wide or single user installation. If we only had our folder then we could just nuke whatever folder the uninstaller is in, but we need to locate the second folder too.

One suggestion was to remove the option for machine-wide installations. This would allow us to simplify the installer config (which is already pretty simple but christ I really hate putting any logic at all in these shitty crippled non-languages) but some of our clients have slow IT and requiring them to install the plugin for all of their users separately is a no go.

Another idea I tried was to look at whether the uninstaller exe was located under Program Files. It's ok but I don't think we can totally rule out users moving the installation folder somewhere else, like to another drive or something.

So I finally settled on writing an installmode.txt next to the installer. It's robust against moving the folder around and running the uninstaller directly, but the user can still go and delete the txt file if they really want to or the installer can fail to write it or etc.

I still don't like my solution because I really hate writing code that can fail. It's a huge relief when you can write a bit of code, no matter how trivial, and know that it can never go wrong. In this case I don't really have a choice because Windows doesn't provide a robust way to install software. (installers are literally just self extracting zips that also write registry keys)

It's especially upsetting because this code is going to be shipped to our non-technical customers. I dread the day when someone comes in with an insane installation problem, all of our suggestions take weeks to test and are expensive for the customer because they have to go through their outsourced IT, and then they either burn out and give up or we simply can't figure it out. Huge huge waste of time for everyone involved, and we lose the sale.

Someone noted a funny issue with the installer. If you did both a machine-wide and a single user installation at the same time, the W10 settings app would merge them together into a single entry and you couldn't choose which one to remove. They had the same name in control panel at that point too but at least both entries were there. The fix for that is to give them different registry key names. So something like HKLM\...\Uninstall\OurSoftwareAllUsers\UninstallString and HKCU\...\Uninstall\OurSoftwareCurrentUser\UninstallString. Or SHCTX\...\Uninstall\OurSoftware$MultiInstall.InstallMode. And I guess give them different DisplayNames too so you can distinguish them.

01 Oct 2017 / Really finishing the job

This is a followup to Rust performance: finishing the job.

Outsmarted by a crustacean

Over on lobste.rs, pbsd points out that my approach is bad and mentions a very neat trick. Subtracting 255 is equivalent to adding one when using u8, so you can keep 16 counters in a register and increment them by subtracting the mask returned by _mm_cmpeq_epi8. You have to stop every 255 chunks to make sure the counters don't overflow, but other than that it's quite simple. The hot loop becomes:

__m128i needles = _mm_set1_epi8( needle );
while( haystack < one_past_end - 16 ) {
        __m128i counts = _mm_setzero_si128();

        for( int i = 0; i < 256; i++ ) {
                if( haystack >= one_past_end - 16 ) {

                __m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
                __m128i cmp = _mm_cmpeq_epi8( needles, chunk );
                counts = _mm_sub_epi8( counts, cmp );

                haystack += 16;

        __m128i sums = _mm_sad_epu8( counts, _mm_setzero_si128() );
        u16 sums_[ 8 ];
        _mm_storeu_si128( ( __m128i * ) sums_, sums );
        c += sums_[ 0 ] + sums_[ 4 ];

Another neat trick is that we can use _mm_sad_epu8 to add the 8 counts at the end. It's slightly faster than storing the counts to u8[ 16 ] and summing them normally.

With the same test setup as last time, this runs in 2.01ms. Again it helps to unroll the loop manually. The inner loop is so simple now it actually helps to unroll 4x, and if we do that it runs in 1.92ms!

Branchless scalar code

The original code can be made branchless. The trick is you replace if( haystack[ i ] == needle ) c++; with c += haystack[ i ] == needle ? 1 : 0;, which can be computed with a CMP and SETZ.

GCC is smart enough to perform this optimisation already, even at -O2, so no benchmark for this one.


AVX has 32 wide versions of all the instructions we used in the SSE version. It didn't actually seem to be any quicker and the code is uninteresting (basically find and replace), so I'm not going to include it.


Here's the same table as last time but with the new results added:

VersionTime% of -O2% of -O3
Scalar -O221.3ms100%-
Scalar -O37.09ms33%100%
Old vectorised2.74ms13%39%
Old unrolled2.45ms12%35%
New vectorised2.01ms9%28%
New unrolled1.92ms9%27%

The new vectorised code is 10x faster than the original!

Full code

You can download the code I used to generate those results if you want to try it yourself. You'll also need the code from the last post if you want to compare before and after.

03 Sep 2017 / OpenGL uniforms and renderer design rambling

I recently did a bit of a renderer overhaul in my engine and I'm very pleased with how it turned out so now seems like a good time to blog about it. I don't think there's anything left in my renderer that's blatantly bad or unportable, yet there's still obvious improvements that can be made whenever I feel like working on that. (I like leaving things that are easy, fun and non-critical because if I ever get bored or stuck on something else I can go and work on them)

This post is going to roughly outline the evolution of setting OpenGL uniforms in my game engine. It's a simple sounding problem ("put some data on the GPU every frame") but OpenGL gives you several different ways to do it and it's not obvious which way is best. I assume it's common knowledge in the industry, but it took me a long time to figure it out and I don't recall ever seeing it written down anywhere.

glGetUniformLocation and glUniform

This is what everyone starts off with, it's what all the OpenGL tutorials describe, and it's what most free Github engines use. I'm not going to go over it in great detail because everyone else already has.

I will say though that the biggest problem by far with this method is that the book keeping becomes a pain in the ass once you move beyond anything trivial.

Uniform block objects

A step up from loose uniforms are UBOs. Basically you can stuff your uniforms in a buffer like you do with everything else, and use that like a struct from GLSL. The Learn OpenGL guy has a full explanation of how it works.

It's best to group uniforms by how frequently you update them. So like you have a view UBO with the view/projection matrices and camera position, a window UBO with window dimensions, a light view UBO with the light's VP matrix for shadow maps, a sky UBO with skybox parameters, etc. There's not actually very many different UBOs you need, so you can hardcode an enum of all the ones you use, and then bind the names to the enum with glUniformBlockBinding.

Of course you can have a "whatever" UBO that you just stuff everything in while prototyping too.

As a more concrete example you can do this:

// in a header somewhere
const u32 UNIFORMS_VIEW = 0;
const u32 UNIFORMS_LIGHT_VIEW = 1;
const u32 UNIFORMS_WINDOW = 2;
const u32 UNIFORMS_SKY = 3;

// when creating a new shader
const char * ubo_names[] = { "view", "light_view", "window", "sky" };
for( GLuint i = 0; i < ARRAY_COUNT( ubo_names ); i++ ) {
        GLuint idx = glGetUniformBlockIndex( program, ubo_names[ i ] );
        if( idx != GL_INVALID_INDEX ) {
                glUniformBlockBinding( program, idx, i );

// rendering
GLuint ub_view;
glBindBuffer( GL_UNIFORM_BUFFER, ub_view );
glBufferData( GL_UNIFORM_BUFFER, ... );
// ...
glBindBufferBase( GL_UNIFORM_BUFFER, UNIFORMS_VIEW, ub_view );

I found it helpful to write a wrapper around glBufferData. UBOs have funny alignment requirements (stricter than C!) and it was annoying having to mirror structs between C and GLSL. So instead I wrote a variadic template that lets me write renderer_ub_easy( ub_view, V, P, camera_pos );, which copies its arguments to a buffer with the right alignment and uploads it. The implementation is kind of hairy but here you go:

template< typename T >
constexpr size_t renderer_ubo_alignment() {
        return min( align4( sizeof( T ) ), 4 * sizeof( float ) );

constexpr size_t renderer_ubo_alignment< v3 >() {
        return sizeof( float ) * 4;

template< typename T >
constexpr size_t renderer_ub_size( size_t size ) {
        return sizeof( T ) + align_power_of_2( size, renderer_ubo_alignment< T >() );

template< typename S, typename T, typename... Rest >
constexpr size_t renderer_ub_size( size_t size ) {
        return renderer_ub_size< T, Rest... >( sizeof( S ) + align_power_of_2( size, renderer_ubo_alignment< S >() ) );

inline void renderer_ub_easy_helper( char * buf, size_t len ) { }

template< typename T, typename... Rest >
inline void renderer_ub_easy_helper( char * buf, size_t len, const T & first, Rest... rest ) {
        len = align_power_of_2( len, renderer_ubo_alignment< T >() );
        memcpy( buf + len, &first, sizeof( first ) );
        renderer_ub_easy_helper( buf, len + sizeof( first ), rest... );

template< typename... Rest >
inline void renderer_ub_easy( GLuint ub, Rest... rest ) {
        constexpr size_t buf_size = renderer_ub_size< Rest... >( 0 );
        char buf[ buf_size ];
        memset( buf, 0, sizeof( buf ) );
        renderer_ub_easy_helper( buf, 0, rest... );
        glBindBuffer( GL_UNIFORM_BUFFER, ub );
        glBufferData( GL_UNIFORM_BUFFER, sizeof( buf ), buf, GL_STREAM_DRAW );

I'm not 100% sure I got the alignment stuff right but it works for everything I've thrown at it so far.

In terms of book keeping it is better than loose uniforms, but you still need to allocate/deallocate/keep track of all your uniform buffers. It's less but still non-zero.

glMapBuffer and glBindBufferRange

For this next one you actually need to reorganise your renderer a little. Instead of submitting draw calls to the GPU immediately, you build a list of draw calls and submit them all at once at the end of the frame. More specifically you should build a list of render passes, each of which has a target framebuffer, some flags saying whether you should clear depth/colour at the start of the pass, and a list of draw calls.

People do talk about this on the internet but they focus on the performance benefits:

Neither loose uniforms nor UBOs really work with this model anymore though. Maybe you can pack uniform uploads into the draw call list, but that's a pain and ugly.

The pro secret is quite simple: map a huge UBO at the start of the frame, copy the entire frame's uniforms into it, then make the offsets/lengths part of your pipeline state and bind them with glBindBufferRange.

There's no book keeping beyond telling your renderer when to start/end the frame/passes. You can use the variadic template from above with few modifications so setting uniforms is still a one-liner. It's like going from a retained mode API to an immediate mode API. If you don't upload a set of uniforms for a given frame, they just don't exist.

To make it totally clear what I mean the game code looks like this:


UniformBinding light_view_uniforms = renderer_uniforms( lightP * lightV, light_pos );

// fill shadow map
        renderer_begin_pass( shadow_fb, RENDERER_CLEAR_COLOUR_DONT, RENDERER_CLEAR_DEPTH_DO );

        RenderState render_state;
        render_state.shader = get_shader( SHADER_WRITE_SHADOW_MAP );
        render_state.uniforms[ UNIFORM_LIGHT_VIEW ] = light_view_uniforms;

        draw_scene( render_state );


// draw world

        RenderState render_state;
        render_state.shader = get_shader( SHADER_SHADOWED_VERTEX_COLOURS );
        render_state.uniforms[ UNIFORM_VIEW ] = renderer_uniforms( V, P, game->pos );
        render_state.uniforms[ UNIFORM_LIGHT_VIEW ] = light_view_uniforms;
        render_state.textures[ 0 ] = shadow_fb.texture;

        draw_scene( render_state );



renderer_begin_frame clears the list of render passes and maps the big UBO, renderer_begin_pass records the target framebuffer and what needs clearing, draw_scene contains a bunch of draw calls which basically copy the RenderState and Meshes (VAOs) into the render pass's list of draw calls, and finally renderer_end_frame unmaps the big UBO and submits everything.

One pitfall is that glMapBuffer is probably going to return a pointer to write combining memory, so you should make sure to write the entire buffer, including all the padding you use to align things (just write zeroes). It's probably not required on modern CPUs, but it's good for peace of mind.

In case I haven't explained this well you should probably just look at my implementation in renderer.cc and renderer.h. Or look at Dolphin which does something similar.


For the sake of completion, if you're using GL4 you get to use a persistent map which should be a tiny bit faster. But it's the same idea.

02 Sep 2017 / Detecting TCP server crashes

I was wondering what happens when an HTTP server gets killed in the middle of a request. The OS should close all the open sockets, but what happens on the client? Can it tell that the server was shutdown forcefully, or does recv return 0 like it does for a normal shutdown?

I was wondering specifically because my HTTP client doesn't look at the response headers, and I wasn't sure if that would lead to problems down the line.

The Arch/OpenBSD man pages don't have anything to say about it. The junky and often outdated die.net man pages talk about ECONNRESET, and SO has a few questions that mention it, but nowhere else does. Grepping the OpenBSD kernel sources doesn't make it obvious.

So let's just test it. The client code is trivial and of no use to you because it's written against my libraries. On the server we can do nc -l 13337 and then pkill -9 nc from another terminal. The result:

connect(3, {sa_family=AF_INET, sin_port=htons(13337), sin_addr=inet_addr("")}, 16) = 0
recvfrom(3, "", 16, 0, NULL, NULL)      = 0

It looks just like a normal shutdown! And that's a bug in my HTTP client: I don't look at Content-Length so truncated responses are not considered an error.

I guess the bigger picture here is that TCP is not as high level as you might expect. Your protocol has to be able to distinguish between premature and intentional shutdowns. Somewhat related is that you can't tell when the other party has processed your TCP packet. ACK just means it's sitting in a buffer somewhere in the TCP stack, and there's no guarantee that the application has actually seen your data yet. So if your application need acks, you have to add them to your protocol too!

(The even bigger picture is that if you trust the other party at all, it will bite you)

24 Aug 2017 / ggformat

ggformat is the string formatting library I use in my game engine.

It's awesome because you can add your own types and you won't go grey waiting for it to compile. It's portable and doesn't allocate memory, so it's ideal for game engines. It's just printf, but better!

Get the real docs and the code from Github.

23 Aug 2017 / Saving scroll position when refreshing

I've noticed that when you refresh this blog (with Linux + Firefox) it always scrolls back up to the top. If you try to Google it you get a few results from people doing some insane Javascript infinite scrolling.

My blog doesn't do anything, so I've no idea what could be causing it.

23 Aug 2017 / Never update anything

I updated my Broadcom driver and my wifi stopped working. #archlinux made fun of me for running a two year old kernel, so I went ahead and upgraded that. It did fix my wifi but somehow it broke ASAN. WTF seriously

I caved to the stupid Whatsapp bug where it scrolls up to some random position whenever you open a conversation (which was introduced last time I updated) and the new version has some dumb crap merging photos together that is annoying and terribly implemented.

I always know in my head that literally nothing good can come from updating software but I still do it. It's a bad habit that I need to break. Of course on Linux that means I can never install new software because everything has to be dynamically linked against the latest versions and broken against everything else.

20 Aug 2017 / Ruoka Helsingissä

(heh that needed <meta charset="utf-8"> to render properly locally)

Some quick restaurant reviews for Helsinki, in no particular order:

16 Aug 2017 / SIGGRAPH 2017

I don't understand how anyone can say the food is so much better in America. I've been to NY and SF which are both supposed to be amazing for food, and just got back from LA.

Their reasoning is usually along the lines of "oh, if we Uber/subway for half an hour we can find a single restaurant that serves great food". Ok, but then if you try to pick somewhere at random it's very close to 100% chance that it's just awful. If you try to be smart, and only pick from places that have four or more stars online it's still very close to 100% chance that it's just awful.

It's fine if I'm planning a meal out somewhere and we can ask people what is actually good, but if I'm out and hungry and want to find somewhere quickly it's very annoying.

There's an absolutely ridiculous amount of signs in LA. Ads for everything, signs saying you must or must not do something plus some law reference code, all over the place. It's actually quite jarring going from Helsinki, which has less signage and even less that I can actually read, to the crazy visual pollution of LA.

Driving vehicles with unmuffled exhausts should be punishable with huge fines. Let's look at an extreme case: driving an unmuffled car through the middle of LA at night. Everyone within a block or two's radius will get woken up, and if you drive a few miles that's easily several thousand people. Now let's say that costs them $5 in lost productivity the next day. That's easily five digits of damage just because they wanted to be a shithead and do something which doesn't actually bring them any benefit.

The same logic goes for police sirens. If the guy you're chasing did less than $10k of damage, write the victim a cheque and be done with it, it's a net positive.

Don't share hotel rooms at conferences. I had plans to go see friends in SF after Siggraph, so I was very deliberately avoiding meeting people and shaking hands and going to bed on time so I could be healthy and fresh for fun times with my best friends. It very nearly backfired on me because one of the guys in my room got sick on the first day. I was lucky enough to stay good until I was waiting in SFO to go home. (but then the vacation days I booked to recover got wasted because I was sick and wouldn't have gone to work anyway)

America is big enough that NY -> LA takes almost as long as Helsinki -> NY. Fuck that! Helsinki -> SF is direct and I should hang out there both sides next time.

BTW the advances talks about cloud rendering in HZD and ocean rendering were awesome, as was the open problems one about using deep learning to learn TAA. The latter sounds like "oh god please no god", but the presenter was really very good at explaining everything and his videos were cool. Unfortunately I can't find any videos of the talks and only having slides is not so great.

The ocean rendering talk finally made clipmaps click for me. You upload a constant mesh with more triangles in the middle at init, then when rendering you snap it to integer coordinates and sample your heightmap texture in the vertex shader. Very simple! For performance I assume you can sample high mip levels at high lod levels to preserve locality when sampling. And if that's true you probably want to page out lower mip levels when you aren't using them to save memory.

14 Aug 2017 / Rust performance: finishing the job

Today I saw a story about profiling and optimising some Rust code. It's a nice little into to perf, but the author stops early and leaves quite a lot on the table. The code he ends up with is:

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
        let i = r / self.k;

        let mut count = 0;

        // count all the matching bytes b/t the closest checkpoint and our desired lookup
        for idx in (i * self.k) + 1 .. r + 1 {
                if bwt[idx] == a {
                        count += 1;

        // return the sampled checkpoint for this character + the manual count we just did
        self.occ[i][a as usize] + count

Let's factor out the hot loop to make it dead clear what's going on:

// BWTSlice is just [u8]
pub fn count(bwt: &[u8], a: u8) -> usize {
        let mut c = 0;
        for x in bwt {
                if x == a {
                        c += 1;

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
        let i = r / self.k;
        self.occ[i][a as usize] + count(bwt[(i * self.k) + 1 .. r + 1])

It's just counting the number of times a occurs in the array bwt. This code is totally reasonable and if it didn't show up in the profiler you could just leave it at that, but as we'll see it's not optimal.

BTW I want to be clear that I have no idea what context this code is used in or whether there are higher level code changes that would make a bigger difference, I just want to focus on optimising the snippet from the original post.


x86 has had instructions to perform the same basic operation on more than one piece of data for quite a while now. For example there are instructions that operate on four floats at a time, instructions that operator on a pair of doubles, instructions that operate on 16 8bit ints, etc. Generally, these are called Single Instruction Multiple Data instructions, and on x86 they fall under the MMX/SSE/AVX instruction sets. Since the loop we want to optimise is doing the same operation to every element in the array independently of one another, it seems like a good candidate for vectorisation. (which is what we call rewriting normal code to use SIMD instructions)

Rewrite It In C++

I would have liked to have optimised the Rust code, and it is totally possible, but the benchmarking code for rust-bio does not compile with stable Rust, nor does the Rust SIMD library. There's not much I'd rather do less than spend ages dicking about downloading and installing other people's software to try and fix something that should really not be broken to begin with, so let's begin by rewriting the loop in C++. This is unfortunate because my timings aren't comparable to the numbers in the original blog post, and I'm not able to get numbers for the version Rust.

size_t count_slow( u8 * haystack, size_t n, u8 needle ) {
        size_t c = 0;
        for( size_t i = 0; i < n; i++ ) {
                if( haystack[ i ] == needle ) {
        return c;

As a fairly contrived benchmark, let's use this count how many times the letter 'o' appears in a string containing 10000 Lorem Ipsums. To actually perform the tests I disabled CPU frequency scaling (this saves about 1.5ms!), wrapped the code in some timing boilerplate, ran the code some times (by hand, so not that many), and recorded the fastest result. See the end of the post for the full code listing if you want to try it yourself.

If we build with gcc -O2 -march=native (we really only need -mpopcnt. FWIW -march=native helps the scalar code more than it helps mine) the benchmark completes in 21.3ms. If we build with -O3 the autovectoriser kicks in and the benchmark completes in 7.09ms. Just to reiterate, it makes no sense to compare these with the numbers in the original article, but I expect if I was able to compile the Rust version it would be about the same as -O2.

Vectorising by hand

The algorithm we are going to use is as follows:

  1. The instructions we want to use only work on data that's aligned to a 16 byte boundary, so we need to run the slow loop a few times if haystack is not aligned (this is called "loop peeling")
  2. For each block of 16 bytes, we can compare all of them with needle at once to get a mask with the PCMPEQB instruction. Note that matches are set to 0xff (eight ones), rather than just a single one like C comparisons.
  3. We can count the number of ones in the mask, and divide it by eight to get the number of needles in that 16 byte block. x86 has POPCNT to count the number of ones in a 64 bit number, so we need to call that twice per 16 byte block.
  4. When we're down to less than 16 bytes remaining, fall back to the slow loop again.

(BTW see the followup post for a better approach)

Unsurprisingly the implementation is quite a bit trickier:

size_t count_fast( const u8 * haystack, size_t n, u8 needle ) {
        const u8 * one_past_end = haystack + n;
        size_t c = 0;

        // peel
        while( uintptr_t( haystack ) % 16 != 0 && haystack < one_past_end ) {
                if( *haystack == needle ) {

        // haystack is now aligned to 16 bytes
        // loop as long as we have 16 bytes left in haystack
        __m128i needles = _mm_set1_epi8( needle );
        while( haystack < one_past_end - 16 ) {
                __m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
                __m128i cmp = _mm_cmpeq_epi8( needles, chunk );
                u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
                u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
                c += ( pophi + poplo ) / 8;
                haystack += 16;

        // remainder
        while( haystack < one_past_end ) {
                if( *haystack == needle ) {

        return c;

But it's totally worth it, because the new code runs in 2.74ms, which is 13% the time of -O2, and 39% the time of -O3!


Since the loop body is so short, evaluating the loop condition ends up consuming a non-negligible amount of time per iteration. The simplest fix for this is to check whether there are 32 bytes remaining instead, and run the loop body twice per iteration:

while( haystack < one_past_end - 32 ) {
                __m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
                haystack += 16; // note I also moved this up. seems to save some microseconds
                __m128i cmp = _mm_cmpeq_epi8( needles, chunk );
                u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
                u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
                c += ( pophi + poplo ) / 8;
                __m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
                haystack += 16;
                __m128i cmp = _mm_cmpeq_epi8( needles, chunk );
                u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
                u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
                c += ( pophi + poplo ) / 8;

It's just a little bit faster, completing the benchmark in 2.45ms, which is 89% of the vectorised loop, 12% of -O2, and 35% of -O3.


For reference, here's a little table of results:

VersionTime% of -O2% of -O3
Scalar -O221.3ms100%-
Scalar -O37.09ms33%100%

Hopefully this post has served as a decent introduction to vectorisation, and has shown you that not only can you beat the compiler, but you can really do a lot better than the compiler without too much difficulty.

I am no expert on this so it's very possible that there's an even better approach (sure enough, there is). I should really try writing the loop in assembly by hand just to check the compiler hasn't done anything derpy.

Full code

For reference if you want to try it yourself. It should compile and run anywhere x86, let me know if you have problems.

14 Jul 2017 / Fixing the Visual Studio forms designer

More to file under "things that are excruciatingly stupid so nobody smart writes about them".

One thing that causes the designer to shit the bed is if your form isn't at the top of the non-designer file. So a specific example:

namespace Things {
        class FuckingEverythingUp { }
        class MyForm : Form {
                // ...

will not work (and will give you a useless error message about dragging and dropping from the components window). You need to move MyForm above FuckingEverythingUp. Btw Microsoft made $85 billion revenue last year and has over 100k employees.

The other thing that's not so obvious to work around (but still pretty obvious) is custom form components. In our case we have a few hacked components to enable text anti-aliasing (lol), but the designer can't handle them. I got sick of going into the designer file and replacing them all with normal labels whenever I wanted to change the UI, so I added methods like ConvertLabelToHackLabel and call them in the form constructor. All they do is make a new thing and copy all the properties over.

The only things that are non-trivial are copying events, which is copied and pasted from StackOverflow thusly:

using System.Reflection;

// ...

var eventsField = typeof(Component).GetField("events", BindingFlags.NonPublic | BindingFlags.Instance);
var eventHandlerList = eventsField.GetValue(originalButton);
eventsField.SetValue(hackedButton, eventHandlerList);

and making sure you update the form's AcceptButton to point at the hacked button as needed.


04 Jul 2017 / Vim

Pressing o in visual mode moves the cursor to the other end of the selection. So if you're selecting downwards it moves the cursor to the top and lets you select upwards.

30 Jun 2017 / Wat

Found this in the arch repos:

core/mkinitcpio-nfs-utils 0.3-5
    ipconfig and nfsmount tools for NFS root support in mkinitcpio

Seems awesome!

"My internet is flaking and now I can't turn my PC on"

"My internet is flaking and now I can't run any programs"

"My internet is slow so running cowsay took half an hour"

30 Jun 2017 / C++ tricks: least effort conditional breakpoints

Let's say you want to place a breakpoint deep in some leaf code, but only when the user presses a key.

For a more concrete example, my recent refactoring broke collision detection on some parts of the map. I want to be able to point the camera at a broken spot, press a key, and step through the collision routines to see what went wrong. My terrain collision routines use quadtrees and hence are recursive, and I'd like to be able to break fairly close to the leaf nodes to minimise the amount of stepping, but still before the leaf nodes in case anything interesting happens.

Debuggers have conditional breakpoints but I doubt they can express something so complex, and I don't want to learn another shitty meta language on top of the real programming language I already use which is inevitably different for each debugger I use.

Obviously a simple hack is to add a global variable, but this happens so often it would be nice to leave them in the entire time. In my case I added extern bool break1; extern bool break2; etc to one of my common headers, put bool break1 = false; bool break2 = false; etc in breakbools.cc, and added that to my list of common objects.

Then adding the breakpoint I want is very simple. High up in my frame loop I add break1 = input->keys[ KEY_T ], and in my collision routine I add something like if( break1 && node bounding box is sufficiently small ) asm( "int $3" ), and it does exactly what I want. (for MSVC you need __debugbreak(); instead of int 3)

29 Jun 2017 / Writing installers for Windows

Writing a software installer for Windows is apparently a slog of people with weird configs and requests asking for things that are impossible to implement nicely.

Everyone has to do this and it's conceptually so trivial (extract an archive) so it's baffling how this is so difficult to get right, and it's crazy to think about how much time is wasted on this shit. I'm not a fuckup, and in total I've probably wasted several days on this.

The biggest roadblock is that Google is just completely worthless. You try to search for something and the results are saturated with absolute shit that's totally unrelated because for whatever reason Google puts huge weight on popular/recent articles that are only very loosely related to what you want. "Oh he has Windows and uninstall in the query, let's return millions of forum posts asking how to uninstall software!" etc. Of course that means this blog post is excruciatingly dull to write with no benefit because nobody can find it.

The hardest part is getting a nice system-wide or single-user installation without running into UAC sadness.

I know this is an extremely boring topic, but that's exactly why I want to write about it. When I run into stuff this dull my brain switches off and it takes me 10x longer than it should. If someone told me exactly how to deal with this up front and I could just autopilot through it would have been a huge win.

The ideal way would be to only request admin rights if they want to do a system-wide installation, which requires you to re-exec the installer and ask for admin, then implement some hacks to jump to the right screen. Also cross your fingers that browsers don't delete the installer as soon as it exits if you click "run" instead of "save". Not sure if any of them actually do that but it's a huge amount of testing that nobody wants to do and very fragile against the instability of webshit.

So you give up on doing it properly and always present a UAC dialog when they run the installer. To save you some time Googling, the right way to do this is with MultiUser.nsh (*). The docs for it are ok, but it crucially doesn't cover how the uninstaller should identify what version it should remove. Ideally you should be able to install both system-wide and per-user at the same time, and be able to uninstall them both separately (not because this is a valuable thing to do but because it shows that your uninstaller can figure out the right thing to do). The answer is !define MULTIUSER_INSTALLMODE_COMMANDLINE and add /$MultiUser.InstallMode to the end of your UninstallString key (so the installer stores what mode control panel should run the uninstaller with). You DON'T need to do anything funny to make sure your uninstaller registry keys get written to the write place (HKLM for system-wide, HKCU for single-user), just use WriteRegStr SHCTX ... and it'll do the right thing.

(*): I came back and reread this post. By "the right way" I don't mean that MultIuser.nsh is good. It gets a lot of things wrong, but it's the least bad option. This other plugin looks a bit better but I didn't try it.

Btw have fun testing this. You have to log in and out after every one-line change (sloooooooow on Windows), and you'll never notice if you break something later on because surely you aren't going to leave UAC enabled.

Another topic that's annoying to get right: uninstaller signing. Make a stub installer that only writes the uninstaller and quits, sign the uninstaller, then add it with File, ...

25 Jun 2017 / C++ tricks: NO_INIT

This one is very simple and I'm surprised I've not written it down already.

Default initialisation is widely considered to be good, but if you're being a performance nut you might want to opt-out. In D you can do int x = void;. Rust apparently has mem::uninitialized(). You can do the same thing in C++ thusly:

enum class NoInit { DONT };
#define NO_INIT NoInit::DONT // if you want
struct v3 {
        float x, y, z;
        v3() { x = y = z = 0; }
        v3( NoInit ) { }

v3 a;
v3 b( NO_INIT );
// a = (0, 0, 0) b = (garbage)

On a similar note, in my code I prefer to design my structs so that all zeroes is the default state, and then my memory managers all memset( 0 ) when you allocate something. I find it easier than getting proper construction right, and I've heard that echoed by a few other people so I guess it's not a totally bunk idea.

01 May 2017 / Pwn3d

lol I got a computer virus.

I noticed a WmiDrvSSE.exe burning 75% of my CPU, and promptly killed it. It immediately came back and went back to thrashing, so I looked a little harder. Everything says it's in C:\Windows\debug\WmiDrvSSE.exe, right click and get properties, not signed by Microsoft. Ok let's go look, it's not there but there is a PASSWD.log (apparently this is a normal Windows file?). Turn on the show system files option and there's a bunch of fucking DLLs like curl and iconv and winpthreads, so I rename the exe and kill it again, which stops it coming back.

Check my process list for anything else, winl0gon.exe, kill this shit, it immediately respawns itself, rename that file too (same folder) and kill it again and it stays dead. There's also an RegisterService.exe, so I check my services and sure enough there's a Windows FirewalI entry pointing at winl0gon.

I immediately assumed that a virus burning my CPU (cleverly it left one core idle so most people wouldn't notice it) would be running crypto ops, but fortunately virus total seems to think it was just a bitcoin miner. Checked the file creation date and they only got to mine for like 20 minutes.

The big question is of course, how did it get in? I was lucky enough to catch it within half an hour so I could remember what I was doing at it's file creation time: I was browsing the web like normal.

This is one of the things security conscious people get wrong a lot. Package signing is useless. Reproducible builds are useless. I have no real reason to be worried about my government attacking me through insane channels. Nobody is going to bother, when my web browser doubles up as an unauthenticated shell server.

I asked in the firefox IRC channel if there were any known exploits in 51.0.1 and was immediately chastised for being a few versions out of date. Lol. So my choice is malware, or constantly broken UI and extensions. (UPDATE: Firefox 52 drops support for ALSA on Linux too)

Does anyone know how to run Firefox in a sandbox? Like a real sandbox, not the useless tab sandbox Firefox already has built in, I want UAC dialogs every time it tries to read or write anything outside its installation directory, every time it tries to create a file. I could run whatever the fuck version of Firefox I want to and not have to worry about this.

For Googler's sake, the full list of files was:


and all of them are system files so you have to go into folder options and turn those on to be able to see them.

22 Apr 2017 / Progress: libinput

UPDATE: rm /usr/share/X11/xorg.conf.d/40-libinput.conf and everything works again. Also probably don't read the rest of this because it sucks.

Some idiots have rewritten the X input layer and called it "libinput".

Unsurprisingly this breaks all of your existing configs and tools (xset and the old synaptics configs don't work) with no benefit.

But it gets better: you can no longer change your mouse sensitivity.

You're supposed to be able to install a separate program, xinput, which prints out this:

Virtual core pointer                          id=2    [master pointer  (3)]
  - Virtual core XTEST pointer                id=4    [slave  pointer  (2)]
  - Laview Technology Xornet II               id=9    [slave  pointer  (2)]
  - Laview Technology Xornet II               id=10   [slave  pointer  (2)]
  - Wacom Intuos PT S Pen stylus              id=11   [slave  pointer  (2)]
  - Wacom Intuos PT S Finger touch            id=12   [slave  pointer  (2)]
  - Wacom Intuos PT S Pad pad                 id=13   [slave  pointer  (2)]
  - Wacom Intuos PT S Pen eraser              id=18   [slave  pointer  (2)]
Virtual core keyboard                         id=3    [master keyboard (2)]
  - Virtual core XTEST keyboard               id=5    [slave  keyboard (3)]
  - Power Button                              id=6    [slave  keyboard (3)]
  - Power Button                              id=7    [slave  keyboard (3)]
  - Sleep Button                              id=8    [slave  keyboard (3)]
  - USB Keyboard                              id=14   [slave  keyboard (3)]
  - USB Keyboard                              id=15   [slave  keyboard (3)]
  - Eee PC WMI hotkeys                        id=16   [slave  keyboard (3)]
  - Laview Technology Xornet II               id=17   [slave  keyboard (3)]

and then you're supposed to run xinput --list-props XX where XX is all the ids, and it prints a huge pile of shit:

Device 'Laview Technology Xornet II':
        Device Enabled (152):   1
        Coordinate Transformation Matrix (154): 1.000000, 0.000000, 0.000000, 0.000000, 1.000000, 0.000000, 0.000000, 0.000000, 1.000000
        libinput Accel Speed (286):     0.000000
        libinput Accel Speed Default (287):     0.000000
        libinput Accel Profiles Available (288):        1, 1
        libinput Accel Profile Enabled (289):   1, 0
        libinput Accel Profile Enabled Default (290):   1, 0
        libinput Natural Scrolling Enabled (291):       0
        libinput Natural Scrolling Enabled Default (292):       0
        libinput Send Events Modes Available (271):     1, 0
        libinput Send Events Mode Enabled (272):        0, 0
        libinput Send Events Mode Enabled Default (273):        0, 0
        libinput Left Handed Enabled (293):     0
        libinput Left Handed Enabled Default (294):     0
        libinput Scroll Methods Available (295):        0, 0, 1
        libinput Scroll Method Enabled (296):   0, 0, 0
        libinput Scroll Method Enabled Default (297):   0, 0, 0
        libinput Button Scrolling Button (298): 2
        libinput Button Scrolling Button Default (299): 2
        libinput Middle Emulation Enabled (300):        0
        libinput Middle Emulation Enabled Default (301):        0
        Device Node (274):      "/dev/input/event23"
        Device Product ID (275):        9494, 43
        libinput Drag Lock Buttons (302):       <no items>
        libinput Horizontal Scroll Enabled (303):       1

Note that none of those are mouse sensitivity. Google says it's supposed to be called Device Accel Constant Deceleration (wtf), but that's not there.

I'm especially annoyed with the distro maintainers for this. When there's a breaking change coming, it's everyone's responsibility to push back against it. When you have tens of thousands of users, even small breaking changes add up to multiple man years of effort. You have to ask, "is this update worth millions of dollars of people's time?"

When the answer is no, as a developer you scrap it and do something better. As a middle man (distro maintainers etc) you tell them to go away and do better. As a user, wave goodbye to many hours of your time because nobody cares about the cost/benefit.

Jesus christ.

11 Apr 2017 / bug489729

bug489729 was an awesome Firefox extension that disabled the shit where dragging a tab (which happens all the time by accident) off the tab bar causes it to open in a new window (which takes like a full second on a 6700K and makes all my windows resize)

(and of course you can't turn it off without a fucking extension)

I'm rehosting it here, mostly for my own convenience: bug.xpi

01 Apr 2017 / C++ tricks: better casting

C style casts are not awesome. Their primary use is to shut up conversion warnings when you assign a float to an int etc. This is actually harmful and can mask actual errors down the line when you change the float to something else and it starts dropping values in the middle of your computation.

Some other nitpicks are that they are hard to grep for and can be hard to parse.

In typical C++ fashion, static_cast and friends solve the nitpicks but do absolutely nothing about the real problem. Fortunately, C++ gives us the machinery to solve the problem ourselves. This first one is copied from Charles Bloom:

template< typename To, typename From >
inline To checked_cast( const From & from ) {
        To result = To( from );
        ASSERT( From( result ) == from );
        return result;

If you're ever unsure about a cast, use checked_cast and it will assert if the cast starts eating values. Even if you are sure, use checked_cast anyway for peace of mind. It lets you change code freely without having to worry about introducing tricky bugs.

Another solution is to specify the type you're casting from as well as the type you're casting to. The code for this is a bit trickier:

template< typename S, typename T >
struct SameType {
        enum { value = false };
template< typename T >
struct SameType< T, T > {
        enum { value = true };

#define SAME_TYPE( S, T ) SameType< S, T >::value

template< typename From, typename To, typename Inferred >
To strict_cast( const Inferred & from ) {
        STATIC_ASSERT( SAME_TYPE( From, Inferred ) );
        return To( from );

You pass the first two template arguments and leave the last one to template deduction, like int a = strict_cast< float, int >( 1 ); (which explodes). I've not actually encountered a situation where this is useful yet, but it was a fun exercise.

Maybe it's good for casting pointers?

25 Mar 2017 / Least effort unit tests

I wanted a C++ unit testing library that isn't gigantic and impossible to understand, doesn't blow up compile times, doesn't need much boilerplate, doesn't put the testing code miles from the code being tested, and doesn't have its own silly build requirements that make it a huge pain in the ass to use. Unfortunately all the C++ testing libraries are either gigantic awful monoliths (e.g. googletest), or tiny C libraries that are a little too inconvenient to actually use (e.g. minunit).

Ideally it wouldn't give you awful compiler errors when you get it wrong but that's probably impossible.

(caveat: I didn't actually look very hard at existing work because it would take more time than just writing my own)


#pragma once

#if defined( UNITTESTS )

#include <stdio.h>

#define CONCAT_HELPER( a, b ) a##b
#define CONCAT( a, b ) CONCAT_HELPER( a, b )
#define COUNTER_NAME( x ) CONCAT( x, __COUNTER__ )

#define AT_STARTUP( code ) \
        namespace COUNTER_NAME( StartupCode ) { \
                static struct AtStartup { \
                        AtStartup() { code; } \
                } AtStartupInstance; \

#define UNITTEST( name, body ) \
        namespace { \
                AT_STARTUP( \
                        int passed = 0; \
                        int failed = 0; \
                        puts( name ); \
                        body; \
                        printf( "%d passed, %d failed\n\n", passed, failed ); \
                ) \

#define TEST( p ) \
        if( !( p ) ) { \
                failed++; \
                puts( "    FAIL: " #p ); \
        } \
        else { \
                passed++; \

#define private public
#define protected public


#define UNITTEST( name, body )
#define TEST( p )


It uses the nifty cinit constructor trick to run your tests before main, and you can dump UNITTESTs anywhere you like (at global scope). Example usage:

#include <stdio.h>
#include "ggunit.h"

int main() {
        printf( "main\n" );
        return 0;

UNITTEST( "testing some easy stuff", {
        TEST( 1 == 1 );
        TEST( 1 == 2 );
} );

UNITTEST( "testing some more easy stuff", {
        for( size_t i = 0; i <= 10; i++ ) {
                TEST( i < 10 );
} );

which prints:

testing some easy stuff
    FAIL: 1 == 2
1 passed, 1 failed

testing some more easy stuff
    FAIL: i < 10
10 passed, 1 failed


It would be great if you could put UNITTESTs in the middle of classes etc to test private functionality, but you can't and the simplest workaround is #define private/protected public. Super cheesy but it works. It's not perfect but it works. It's ugly. It's not a Theoretically Awesome Injection Framework Blah Blah. It works and is two lines, you can't beat it.

23 Mar 2017 / Caches are fast, hashes are fast

Or, how to make a C/C++ build system in 2017

Here's a problem I've been having at work a lot lately:

Obviously the dream solution here would be to have good compilers(*) and/or a good language, but neither of those are going to happen any time soon.

*: as an aside that would solve one of the big pain points with C++. Everybody goes to these massive efforts splitting up the build so they can build incrementally which introduces all of its own headaches and tracking dependencies and making sure everything is up to date and etc and it's just awful. If compilers were fast we could just build everything at once every time and not have to deal with it.

Anyway since they aren't going to happen, the best we can do is throw computing power at the problem. Most of them build some kind of dependency graph, e.g. bin.exe depends on mod1.obj which depends on mod1.cpp, then looks at what has been modified and recompiles everything that depends on it. Specifically they look at the last modified times and if a file's dependencies are newer than it then you need to rebuild.

Maybe that was a good idea decades ago, but these days everything is in cache all of the time and CPUs are ungodly fast, so why not take advantage of that, and just actually check if a file's contents are different? I ran some experiments with this at work. We have 55MB of code (including all the 3rd party stuff we keep in the repo - btw the big offenders are like qt and the FBX SDK, it's not our codebase with the insane bloat), and catting it all takes 50ms. We have hashes that break 10GB/s (e.g. xxhash), which will only add like 5 ms on top of that. (and probably much closer to 0 if you parallelise it)

So 55ms. I'm pretty sure we don't have a single file in our codebase that builds in 55ms.

From this point it's pretty clear what to do: for each file you hash all of its inputs and munge them together, and if the final hash is different from last time you rebuild. Don't cache any of the intermediate results just do all the work every time, wasting 55ms per build is much less bad than getting it wrong and wasting minutes of my time. Btw inputs should also include things like command line flags, like how ninja does it.

The only slightly hard part is making sure hashes don't get out of sync with reality. Luckily I'm a genius and solved that too: you just put the hash in the object file/binary. With ELF files you can just add a new section called .build_input_hash or something and dump it in there, presumably you can do the same on Windows too (maybe IMAGE_SCN_LNK_INFO? I spent a few minutes googling and couldn't find an immediate answer).

For codegen stages you would either just run them all the time or do the timestamp trick I guess, since we are ignoring their timestamps and hopefully your codegen is not slow enough for it to matter very much.

Anyone want to work on this? I sure don't because my god it's boring, but I wish someone else would.

UPDATE: I've been told you can work around my specific example with git branch and git rebase --onto (of course), but this would still be nice to have.

02 Mar 2017 / C++ tricks: ZERO

After writing memset( &x, 0, sizeof( x ) ); for the millionth time, you might start to get lazy and decide it's a good idea to #define ZERO( p ) memset( p, 0, sizeof( *p ) );. This turns out to be very easy to misuse:

int x;
ZERO( &x ); // cool
int y[ 8 ];
ZERO( &y ); // cool
ZERO( y ); // y[ 0 ] = 0, no warnings
int * z = y;
ZERO( z ); // y[ 0 ] = 0
ZERO( &z ); // z = NULL

You can try things like making ZERO take a pointer instead, but you still always end up with cases where the compiler won't tell you that you screwed up. The problem is that there's no way for ZERO to do the right thing to a pointer because it can't know how big the object being pointed at is. The simplest solution is to simply not allow that:

template< typename T > struct IsAPointer { enum { value = false }; };
template< typename T > struct IsAPointer< T * > { enum { value = true }; };

template< typename T >
void zero( T & x ) {
        static_assert( !IsAPointer< T >::value );
        memset( &x, 0, sizeof( x ) );

and as a bonus, we can use the same trick from last time to make it work on fixed-size arrays too:

template< typename T, size_t N >
void zero( T x[ N ] ) {
        memset( x, 0, sizeof( T ) * N );

Neat! (maybe)

02 Mar 2017 / C++ tricks: safe ARRAY_COUNT

Lots of C/C++ codebases have a macro for finding the number of elements in a fixed size array. It's usually defined as #define ARRAY_COUNT( a ) ( sizeof( a ) / sizeof( ( a )[ 0 ] ) ), which is great:

int asdf[ 4 ]; // ARRAY_COUNT( asdf ) == 4

until someone comes along and decides that asdf needs to be dynamically sized and changes it to be a pointer instead:

int * asdf; // ARRAY_COUNT( asdf ) == sizeof( int * ) / sizeof( int ) != 4

Now every piece of code that uses ARRAY_COUNT( asdf ) is broken, which is annoying by itself, but that still looks totally fine to the compiler and it's not even going to warn you about it.

The fix is some appalling looking C++:

template< typename T, size_t N >
char ( &ArrayCountObj( const T ( & )[ N ] ) )[ N ];
#define ARRAY_COUNT( arr ) ( sizeof( ArrayCountObj( arr ) ) )

which correctly explodes when you pass it a pointer:

main.cc: In function "int main()":
main.cc:5:57: error: no matching function for call to "ArrayCountObj(int*&)"
 #define ARRAY_COUNT( arr ) ( sizeof( ArrayCountObj( arr ) ) )
main.cc:9:9: note: in expansion of macro "ARRAY_COUNT"
  return ARRAY_COUNT( a );
main.cc:4:9: note: candidate: template<class T, long unsigned int N> char (& ArrayCountObj(const T (&)[N]))[N]
 char ( &ArrayCountObj( const T ( & )[ N ] ) )[ N ];
main.cc:4:9: note:   template argument deduction/substitution failed:
main.cc:5:57: note:   mismatched types "const T [N]" and "int*"
 #define ARRAY_COUNT( arr ) ( sizeof( ArrayCountObj( arr ) ) )
main.cc:9:9: note: in expansion of macro "ARRAY_COUNT"
  return ARRAY_COUNT( a );

31 Jan 2017 / Dumping a git repository to an encrypted zip file

ADDENDUM: This is trash, use gitolite to give your work PC read only access instead.

I want to be able to access my dotfiles repository from anywhere without actually giving people public access. I don't want to fuck about making a new user account with restricted shell/setting up a massive web server/etc.

The simplest solution I can think of is making a post-receive hook that dumps the repository to an encrypted zip and copying that to the (static) web root, which is done like this:

#! /bin/sh


rm "$OUT"
git archive master | 7z a -sidotfiles.tar -ppassword -mhe=on "$OUT"

Ok so it's a 7z not a zip, but some zip implementations (like Windows Explorer) only support shitty encryption so you were going to have to install 7zip anyway.

25 Jan 2017 / Windows post-install for developers

This is just a checklist for myself covering what to do with a fresh Windows installation. It covers disabling all the annoying crap Windows comes with by default, updating manually because Windows Update is broken in Windows 7 SP1, and a list of handy programs.

  1. Install drivers and reboot.
  2. Go to services.msc, stop and disable: Superfetch, Windows Defender, Windows Firewall, Windows Search.
  3. Go to Control Panel and view by small icons. Go to Administrative Tools, Computer Management, Local Users and Groups, Users, right click Administrator and enable it. Log in as Administrator. This is now your user account, and you can delete your old one.
  4. Right click the start menu, click properties, use small icons, never combine taskbar buttons, unlock, drag to left, lock. Under the start menu tab, click customise, then disable Devices and Printers, Games, Help, Highlight newly installed programs, Music, Pictures, and Use large icons.
  5. Click the start menu, right click Computer, Advanced system settings (in the sidebar), Startup and Recovery settings, disable Automatically restart. Close that window and go to Performance settings. Uncheck lots of crap.
  6. Right click the desktop, enable Windows Classic theme.
  7. Go to Control Panel, then Action Center and disable UAC (in the sidebar), then go to Change Action Center settings (also in the sidebar) and disable problem reporting and all the messages (except for maybe Windows Update). Go to AutoPlay and disable that too. Go to Mouse and disable enhanced pointer precision. Go to Sound and select the No Sounds sound scheme. Go to Power Options and disable monitor/PC sleeping.
  8. In the start menu, search for folder options. Go to view and show hidden files and known extensions.
  9. In Computer, right click your C drive, go to Security, Advanced, Change Permissions, click your name, Edit, check full control, click OK, check "Replace all child permissions...", click OK.
  10. Install Firefox.
  11. Install MSVC Community 2013. Uncheck all the optional features. This download is huge so start it first!
  12. Install the Windows 7 convenience rollup update and its dependencies.
  13. Download Autoruns, Process Explorer, and Process Monitor.
  14. Install Color Cop, AutoHotkey, 7-Zip, Everything.
  15. Install Git for Windows, Notepad++, and Vim.
  16. Install Renderdoc, Apitrace, Intel GPA, and the DirectX SDK.
  17. Install GIMP, Inkscape, Blender, Wings3D, and Meshlab.
  18. Download the Cygwin installer. Install tmux, openssh, lua and vim.
  19. In a cygwin shell, run ssh-host-config, and follow the prompts. chown cyg_server: /var/empty; chmod 700 /var/empty; net start sshd.

29 Dec 2016 / Billions

When people are trying to sell candidates on their company, they like to throw around statistics like "we do billions of Xs per year", or if they want to pull out the really big guns, "we do one billion Xs per day".

I (and presumably everyone else) hear these statistics and think "wow a billion is a big number that's impressive" and don't think too much more about it.

Until now! Let's figure out just how impressive these numbers are. Assuming 365 * 24 * 60 * 60 seconds in a year, one billion Xs per year is 32 Xs per second, which is actually not impressive at all. If we try again with one billion per day (86400 seconds in a day) we get 11.5k per second, which is back into impressive territory... until you think about it.

My crappy game engine on my laptop with integrated graphics does 1M verts per second without the fans spinning up. AAA games routinely do 100x that - 4 orders of magnitude higher than 1b/day.

I should sell my engine by telling people it can do QUADRILLIONS of verts per year!!!!!!!!

23 Apr 2016 / Auto-mounting removable drives

Put this in /etc/udev/rules.d/10-automount.rules:

KERNEL!="sd[c-z][1-9]", GOTO="media_by_label_auto_mount_end"

# Global mount options
ACTION=="add", ENV{mount_options}="relatime,users,sync"

# Filesystem specific options
ACTION=="add", PROGRAM=="/lib/initcpio/udev/vol_id -t %N", RESULT=="vfat|ntfs", ENV{mount_options}="$env{mount_options},utf8,gid=100,umask=002"
ACTION=="add", PROGRAM=="/lib/initcpio/udev/vol_id --label %N", ENV{dir_name}="%c"
ACTION=="add", PROGRAM!="/lib/initcpio/udev/vol_id --label %N", ENV{dir_name}="usbhd-%k"
ACTION=="add", RUN+="/bin/mkdir -p /mnt/%E{dir_name}", RUN+="/bin/mount -o $env{mount_options} /dev/%k /mnt/%E{dir_name}"
ACTION=="remove", ENV{dir_name}=="?*", RUN+="/bin/umount -l /mnt/%E{dir_name}", RUN+="/bin/rmdir /mnt/%E{dir_name}"

You need to change the first line (specifically the [c-z] bit) if you have more (or less) than two non-removable drives. I don't know exactly how it works, but it does the job. I copied it from the arch wiki years ago and I'm putting it here for my own reference.

27 Dec 2015 / Moving to OpenBSD

My last VPS provider got bought out, so I figured now would be a good time to buy a more respectable (i.e. not found on lowendstock) server, which also gives me the opportunity to experiment with OpenBSD.

As expected, almost everything has gone smoothly. However, there have been a couple of pain points, which I'll document here for future me and any lucky Googlers.


If you copy and paste a working config from a Linux server, the clients can all ping the server, but you get an error like "No route to host" when you try it the other way around. This turned out to be a pf one-liner:

pass out on tun0 to

(where tun0 is my VPN's interface and is my VPN's subnet)


There isn't a luarocks port yet, and building can be slightly annoying. It goes like this:

./configure --sysconfdir=/etc/luarocks \
    --lua-version=5.1 \
    --lua-suffix=51 \
make build
make install

The configure lines for 5.2 and 5.3 look like you would expect. The install step creates a symlink from luarocks to luarocks-5.x so if you are lazy like me you should correct that to point at your favourite version.


The latest stable release doesn't work with Lua 5.3, and for some reason the lua-ev rock doesn't look in /usr/local/include. We can fix the former by installing the scm version, and the latter with CFLAGS:

luarocks install \
    https://luarocks.org/manifests/brimworks/lua-ev-scm-1.rockspec \