mikejsavage.co.uk

29 Jun 2024 • C++ tricks: enum arithmetic

It's annoying having to write stuff like

for( MyEnum i = MyEnum( 0 ); i < MyEnum_Count; i = MyEnum( i + 1 ) ) {

Doubly so with enums that are really bitfields where you can actually do arithmetic, they just return non-enum types and C++ doesn't allow the implicit casts back to the enum in all situations.

At some point you might get sick of this and start implementing arithmetic operators on your enums, but why not just do it for all of them?

template< typename E > concept IsEnum = __is_enum( E );
template< typename E > using UnderlyingType = __underlying_type( E );

template< IsEnum E > void operator++( E & x, int ) { x = E( UnderlyingType< E >( x ) + 1 ); }
template< IsEnum E > void operator&=( E & lhs, E rhs ) { lhs = E( UnderlyingType< E >( lhs ) & UnderlyingType< E >( rhs ) ); }
template< IsEnum E > void operator|=( E & lhs, E rhs ) { lhs = E( UnderlyingType< E >( lhs ) | UnderlyingType< E >( rhs ) ); }

// you can do these in base C++ but they return ints and MyEnum x = int; doesn't compile
template< IsEnum E > constexpr E operator&( E lhs, E rhs ) { return E( UnderlyingType< E >( lhs ) & UnderlyingType< E >( rhs ) ); }
template< IsEnum E > constexpr E operator|( E lhs, E rhs ) { return E( UnderlyingType< E >( lhs ) | UnderlyingType< E >( rhs ) ); }
template< IsEnum E > constexpr E operator~( E x ) { return E( ~UnderlyingType< E >( x ) ); }

Every compiler supports the same intrinsics, so no need for compiler specific code or the STL.

7 Jun 2024 • C++ tricks: STL-free type traits

I ended up not actually using these but it seemed a shame to throw them away so here they are for copy pasterity.

underlyling_type:

template< typename T >
constexpr bool IsSigned() { return int( T( -1 ) ) == -1; }

template< size_t N, bool Signed > struct MakeIntType;
template<> struct MakeIntType< 1, true > { using T = s8; };
template<> struct MakeIntType< 2, true > { using T = s16; };
template<> struct MakeIntType< 4, true > { using T = s32; };
template<> struct MakeIntType< 8, true > { using T = s64; };
template<> struct MakeIntType< 1, false > { using T = u8; };
template<> struct MakeIntType< 2, false > { using T = u16; };
template<> struct MakeIntType< 4, false > { using T = u32; };
template<> struct MakeIntType< 8, false > { using T = u64; };

template< typename E >
using UnderlyingType = MakeIntType< sizeof( E ), IsSigned< E >() >::T;

ADDENDUM: that breaks in some cases, the __underlying_type intrinsic works on all compilers.

numeric_limits:

template< typename T > constexpr T MaxInt;
template<> constexpr u8  MaxInt< u8  > = U8_MAX;
template<> constexpr u16 MaxInt< u16 > = U16_MAX;
template<> constexpr u32 MaxInt< u32 > = U32_MAX;
template<> constexpr u64 MaxInt< u64 > = U64_MAX;

21 Feb 2024 • sem_postmany

On Windows, ReleaseSemaphore lets you raise/post a semaphore n times with a single call, the idea being that the OS can implement it more efficiently than n syscalls in a loop. Indeed on Linux the underlying syscall has that functionality, futex can (amongst other things) make a thread wait for some token to be signalled, and wake some number of threads blocking on that token, so naturally they never updated the userspace API and you can only call sem_post in a loop.

Some years ago I decided to fill the gap myself. I had the delusions that both someone might want to pay money for this and that I would be able to find them and sell it to them, so I didn't read the glibc sources and instead figured it out with printf/strace/musl sources, making this code not GPL. That said you probably don't want to use this anyway because if you're using semaphores you already don't care about supermax performance.

struct GlibcSemaphore {
        // musl sets this to -1 if count == 0 && waiters > 0
        s32 saved_wakes;

        s32 waiters;

        // if pshared in sem_init is 0 then shared is 0. otherwise 128
        // musl is the other way around
        // used to set FUTEX_PRIVATE_FLAG, which was introduced in Linux 2.6.22
        s32 shared;
};

void sem_postmany( sem_t * sem, int n ) {
        GlibcSemaphore * gsem = ( GlibcSemaphore * ) sem;
        int old = __atomic_fetch_add( &gsem->saved_wakes, n, __ATOMIC_ACQ_REL );
        int waiters = gsem->waiters;
        int extra_wakes = Min2( waiters - old, n );
        if( extra_wakes > 0 ) {
                int op = gsem->shared == 0 ? FUTEX_WAKE_PRIVATE : FUTEX_WAKE;
                syscall( SYS_futex, &gsem->saved_wakes, op, extra_wakes );
        }
}

24 Jan 2024 • All perf zero quality BC4 encoding

BC4 encoding is pretty straightforward. You find the min/max to use as endpoints and do a little bit of funny arithmetic to compute the selectors. But what if we didn't care at all about quality?

The motivation behind this is we have a lot of single channel decals in Cocaine Diesel that are mostly pure white on a pure transparent background. For example:

and we were storing them as 4 channel RGBA PNGs because that's easy for artists to work with. 99% flat PNGs compress well on disk but always eat 32 bits per pixel in VRAM, or 8x as much space as BC4. Once we accumulated a few hundred decals it started to cause problems on 1GB GPUs which would otherwise have had no issues running the game, and was also a lot slower to render than more bandwidth efficient textures would have been. At this point we hadn't settled on an asset pipeline, I was hoping we could store source assets wherever possible and compile optimised assets along with the engine in CI for release builds, because giving a build system of terrible self written compilers to non-developers is pain. So we set about converting hundreds of PNGs to BC4 at runtime.

The obvious place to start is an existing DXT compression library. There are loads, stb_dxt, squish, rgbcx, etc. This was a very long time ago and I didn't keep notes so I have no benchmarks, but needless to say it was too slow. Even at 50ms per texture it adds 15s to the game startup time when you have 300 of them, which is way too much for a game that starts in two seconds in debug builds.

A simple way to make it faster is to just not compute accurate endpoints, and instead hardcode them to 0 and 255. Then we remap alpha from [0,256) to [0,8) to get our selectors, which is dividing by 32. Finally we remap them again to the actual non-linear order BC4 uses with a LUT and pack them tightly. That looks like this:

struct BC4Block {
    u8 endpoints[ 2 ];
    u8 indices[ 6 ];
};

static BC4Block FastBC4( Span2D< const RGBA8 > rgba ) {
    BC4Block result;

    result.endpoints[ 0 ] = 255;
    result.endpoints[ 1 ] = 0;

    constexpr u8 index_lut[] = { 1, 7, 6, 5, 4, 3, 2, 0 };

    u64 indices = 0;
    for( size_t i = 0; i < 16; i++ ) {
        u64 index = index_lut[ rgba( i % 4, i / 4 ).a >> 5 ];
        indices |= index << ( i * 3 );
    }

    memcpy( result.indices, &indices, sizeof( result.indices ) );

    return result;
}

static Span2D< BC4Block > RGBAToBC4( Span2D< const RGBA8 > rgba ) {
    Span2D< BC4Block > bc4 = AllocSpan2D< BC4Block >( sys_allocator, rgba.w / 4, rgba.h / 4 );

    for( u32 row = 0; row < bc4.h; row++ ) {
        for( u32 col = 0; col < bc4.w; col++ ) {
            Span2D< const RGBA8 > rgba_block = rgba.slice( col * 4, row * 4, 4, 4 );
            bc4( col, row ) = FastBC4( rgba_block );
        }
    }

    return bc4;
}

which was better, but still slow enough to be annoying. So the next obvious thing to try is vectorising it. Each row in the source texture is four pixels at four bytes per pixel stored contiguously, which is exactly a 128-bit SSE register. We can extract the alpha channel with PSHUFB and POR, shift it down by 5 bits and mask off the bottom 3 bits of each pixel to compute the selectors, do the LUT remap with another PSHUFB, and pack the resulting 3-bit selectors with PEXT. For a single block that's:

static BC4Block FastBC4( Span2D< const RGBA8 > rgba ) {
    BC4Block result;

    result.endpoints[ 0 ] = 255;
    result.endpoints[ 1 ] = 0;

    // in practice you would lift these out and not load them over and over
    __m128i alpha_lut_row0 = _mm_setr_epi8(  3,  7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 );
    __m128i alpha_lut_row1 = _mm_setr_epi8( -1, -1, -1, -1,  3,  7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1 );
    __m128i alpha_lut_row2 = _mm_setr_epi8( -1, -1, -1, -1, -1, -1, -1, -1,  3,  7, 11, 15, -1, -1, -1, -1 );
    __m128i alpha_lut_row3 = _mm_setr_epi8( -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  3,  7, 11, 15 );
    __m128i lut = _mm_setr_epi8( 1, 7, 6, 5, 4, 3, 2, 0, 9, 10, 11, 12, 13, 14, 15, 16 );
    __m128i mask = _mm_set1_epi8( 7 );

    __m128i row0 = _mm_load_si128( ( const __m128i * ) rgba.row( 0 ).ptr );
    __m128i row1 = _mm_load_si128( ( const __m128i * ) rgba.row( 1 ).ptr );
    __m128i row2 = _mm_load_si128( ( const __m128i * ) rgba.row( 2 ).ptr );
    __m128i row3 = _mm_load_si128( ( const __m128i * ) rgba.row( 3 ).ptr );

    __m128i block = _mm_or_si128(
        _mm_or_si128( _mm_shuffle_epi8( row0, alpha_lut_row0 ), _mm_shuffle_epi8( row1, alpha_lut_row1 ) ),
        _mm_or_si128( _mm_shuffle_epi8( row2, alpha_lut_row2 ), _mm_shuffle_epi8( row3, alpha_lut_row3 ) )
    );

    __m128i high_bits = _mm_and_si128( _mm_srli_epi64( block, 5 ), mask );

    __m128i selectors = _mm_shuffle_epi8( lut, high_bits );

    u64 packed0 = _pext_u64( _mm_extract_epi64( selectors, 0 ), 0x0707070707070707_u64 );
    u64 packed1 = _pext_u64( _mm_extract_epi64( selectors, 1 ), 0x0707070707070707_u64 );
    u64 packed = packed0 | ( packed1 << 24 );

    memcpy( result.indices, &packed, sizeof( result.indices ) );

    return result;
}

Again I have no benchmarks, but this was... better again but still too slow. It ended up being a fun experiment, although ultimately not good enough, and now we store the PNGs in a separate source assets repo, compress them with rgbcx (i.e. a good BC4 encoder) and zstd, and copy the .dds.zst textures by hand to the main repo. It's not great but also really not that bad and automating it fully would be more trouble than it's worth for now.

We did keep the non-SIMD FastBC4 around for faster iterations when adding new textures, but nothing ever goes through it in release builds.

22 Jan 2024 • C++ tricks: NonRAIIDynamicArray

In Cocaine Diesel our DynamicArray class takes an Allocator in its constructor. One our of allocators is just malloc plus std::map< void *, AllocInfo > for leak checking, which we initialise statically. So it didn't take us long to run into the standard static initialisation order problem when we wanted to use file-scoped arrays.

At first we added a special constructor that put the array into a non-RAII mode, a trick I learned at Umbra:

enum NoInit { NO_INIT };

class DynamicArray {
public:
    DynamicArray( NoInit ) { ... }
};

To make this work you need to add explicit init/shutdown methods, and also not run the normal destructor because static destructors have the same problem. It works but it always seemed kind of clunky to me.

Another problem we ran into was that returning arrays from functions was always kind of annoying. Flipping the RVO/malloc coin is not good enough, so we had to write such functions like f( DynamicArray< int > * results ) { ... } which is also annoying.

So I split DynamicArray into NonRAIIDynamicArray and DynamicArray. By itself that would honestly be a nice trick and worthy of a post already, but C++ also lets you change the visibility of inherited members which makes the implementation very concise. So you can make DynamicArray derive from NonRAIIDynamicArray, hide the explicit init/shutdown methods, and add a constructor/destructor like so:

template< typename T >
class DynamicArray : public NonRAIIDynamicArray< T > {
    using NonRAIIDynamicArray< T >::init;
    using NonRAIIDynamicArray< T >::shutdown;

public:
    NONCOPYABLE( DynamicArray );

    DynamicArray( Allocator * a, size_t initial_capacity = 0 ) {
        init( a, initial_capacity );
    }

    ~DynamicArray() {
        shutdown();
    }
};

With this static dynamic arrays work with no magic, and you can write functions that return n results pretty much how you would in any scripting language:

Span< int > Square( Allocator * a, Span< const int > xs ) {
    NonRAIIDynamicArray< int > squares( a );
    for( int x : xs ) {
        squares.add( x * x );
    }
    return squares.span();
}

Cleanup is easy too with defer, or non-existent if you use an auto-freeing temp allocator.

22 Jan 2024 • C++ tricks: defer

I was writing another tricks post and was shocked to find I hadn't already posted this. It's super useful! C++ purists would call it heresy and tell you to use RAII, in practice I've found RAII gets in the way as much as it helps which is annoying and no good.

Credit to either Jonathan Blow and Ignacio Castaño, since this is either lifted from jblow's MSVC finder or this blog post.

// you probably have these macros already
#define CONCAT_HELPER( a, b ) a##b
#define CONCAT( a, b ) CONCAT_HELPER( a, b )
#define COUNTER_NAME( x ) CONCAT( x, __COUNTER__ )

template< typename F >
struct ScopeExit {
    ScopeExit( F f_ ) : f( f_ ) { }
    ~ScopeExit() { f(); }
    F f;
};

struct DeferHelper {
    template< typename F >
    ScopeExit< F > operator+( F f ) { return f; }
};

#define defer [[maybe_unused]] const auto & COUNTER_NAME( DEFER_ ) = DeferHelper() + [&]()

[[maybe_unused]] is to shut clang up. Use it like so:

void f() {
    void * p = malloc( 1 );
    defer { free( p ); };
}

27 Dec 2023 • Configuring launchd scheduled tasks with Nix home manager

macOS has a thing called launchd which amongst other things can run scripts periodically. I use it to schedule backups and auto-update youtube-dl. I thought it would be nice to manage that through home manager so it's one less thing to forget about. You can do it out of the box and it is ostensibly documented in places like this but that's hard to turn into something that actually works, so here are some examples:

launchd.agents = {
    restic = {
        enable = true;
        config = {
            Program = /Users/mike/.bin/run-or-notify;
            ProgramArguments = [ "/Users/mike/.bin/run-or-notify" "restic-snapshot failed!" "/Users/mike/.bin/restic-snapshot" ];
            StartInterval = 21600;
            EnvironmentVariables.PATH = "${pkgs.restic}/bin:/usr/bin";
            StandardErrorPath = "/Users/mike/restic-stderr.txt"; # these lines are how you debug stuff with launchd
            StandardOutPath = "/Users/mike/restic-stdout.txt";
        };
    };

    yt-dlp = {
        enable = true;
        config = {
            Program = /Users/mike/.bin/yt-dlp;
            ProgramArguments = [ "/Users/mike/.bin/yt-dlp" "-U" ];
            StartInterval = 86400;
        };
    };
};

Everything inside config is standard launchd stuff which is documented in man launchd.plist. StartInterval Just Works if you put your laptop to sleep etc which is nice.

run-or-notify is a script that alerts the program's output if it fails:

#! /bin/sh

title="$1"
shift
output="$("$@" 2>&1)"
err="$?"

if [ "$err" -ne 0 ]; then
        osascript -e "display notification \"Output: $output\" with title \"$title\""
        exit "$err"
fi

30 Nov 2023 • C++ tricks: STL-free initializer list

Credit for this one to https://nitter.net/Donzanoid/status/1611315409596071936.

C++'s initializer_list is dogshit:

> cat a.cpp
#include <initializer_list>
> cl.exe /std:c++20 a.cpp /P > /dev/null; and wc -l a.i
Microsoft (R) C/C++ Optimizing Compiler Version 19.36.32534 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

a.cpp
8988 a.i
> g++ -E a.cpp -std=c++20 | wc -l
115
> clang++ -E a.cpp -std=c++20 | wc -l
393

There's actually no compiler magic involved here and initializer_list is pretty much just span so very simple to implement, but you need a few tricks to get it to actually work:

GCC doesn't let you do this at all
MSVC/clang have different constructors
You need to define some header guards so stray STL headers don't try to redefine it

On that last point you may be wondering why bother with this at all. You can't keep a codebase completely STL-free, be it because of certain things where the non-STL way is extremely painful (e.g. MSVC atomics), or unavoidable third party libraries pulling in STL headers (e.g. metal-cpp). So we need to make this not clash with the real thing.

#if COMPILER_GCC

// GCC refuses to compile if the implementation doesn't exactly match the STL
#include <initializer_list>

#else

#define _INITIALIZER_LIST_ // MSVC
#define _LIBCPP_INITIALIZER_LIST // clang

namespace std {
    template< typename T >
    class initializer_list {
    public:

#if COMPILER_MSVC
        constexpr initializer_list( const T * first_, const T * one_after_end_ ) : first( first_ ), one_after_end( one_after_end_ ) { }
#else
        constexpr initializer_list( const T * first_, size_t n ) : first( first_ ), one_after_end( first_ + n ) { }
#endif

        const T * begin() const { return first; }
        const T * end() const { return one_after_end; }
        size_t size() const { return one_after_end - first; }

    private:
        const T * first;
        const T * one_after_end;
    };
}

#endif

You may also want to add your own header guards against the STL defines, or you can just make sure you always include this first.

29 Nov 2023 • Realloc and arena allocators

If you combine a dynamic array with an arena allocator, you get a dynamically growing buffer on top of a dynamically growing buffer. The easiest way to implement realloc is as malloc + memcpy + free, but that behaves sub-optimally here. Unfortunately the standard realloc interface prevents you from doing any better. You could fix that by storing extra metadata, specifically the allocation size, along with your allocations to determine if a given pointer is the topmost allocation, but arenas are meant to be very simple and that isn't. Instead, we can rely on realloc users necessarily tracking the allocation size anyway and make it part of the interface.

Let's start with a general allocator interface:

struct Allocator {
    virtual void * allocate( size_t size, size_t alignment ) = 0;
    void * reallocate( void * old_ptr, size_t new_size, size_t alignment ) {
        // allocate memcpy deallocate
    }
    virtual void deallocate( void * ptr ) = 0;
};

struct ArenaAllocator : public Allocator {
    u8 * memory;
    u8 * memory_end; // memory + size
    u8 * top;

    void * allocate( size_t size, size_t alignment ) {
        Assert( IsPowerOf2( alignment ) );
        u8 * aligned = ( u8 * ) ( size_t( top + alignment - 1 ) & ~( alignment - 1 ) );
        if( aligned + size > memory_end ) {
            abort();
        }
        top = aligned + size;
        return aligned;
    }

    void free( void * ptr ) { }
};

And DynamicArray:

template< typename T >
struct DynamicArray {
    Allocator * a;
    size_t n = 0;
    size_t capacity = 0;
    T * elems = NULL;

    DynamicArray( Allocator * a_ ) { a = a_; }

    void add( const T & x ) {
        if( n == capacity ) {
            grow();
        }
        elems[ n ] = x;
        n++;
    }

    void grow() {
        capacity = Max2( 1, capacity * 2 );
        elems = ReallocMany< T >( a, elems, capacity );
    }
};

(Production ready code might use static dispatch instead of dynamic dispatch/split allocate into try_allocate that can return NULL and allocate that aborts/ASAN_(UN)POISON_MEMORY_REGION in both classes/SourceLocation/anything but 1 as the initial array size/etc but let's ignore all that.)

Now let's see what happens when we put these together. Say we have an arena big enough to hold eight ints, and add 1 2 3 to the array. I'll use . to denote uninitialised memory and underlined text for the memory currently owned by the array.

array = [], arena = [. . . . . . . .]
add(1) -> realloc(1), array = [1],     arena = [1 . . . . . . .]
add(2) -> realloc(2), array = [1 2],   arena = [1 1 2 . . . . .]
add(3) -> realloc(4), array = [1 2 3], arena = [1 1 2 1 2 3 . .]

So we used 7 slots in the underlying buffer to represent an array with 3 out of 4 elements in use.

If we instead modify our realloc to take the old allocation size, we can check to see if we're reallocating the topmost allocation and grow it in place. Implementing that looks like this:

void * ArenaAllocator::reallocate( void * old_ptr, size_t old_size, size_t new_size, size_t alignment ) {
    if( old_ptr == NULL ) {
        return allocate( new_size, alignment );
    }

    // are we the topmost allocation and sufficiently aligned?
    if( old_ptr == top - old_size && size_t( ptr ) % alignment == 0 ) {
        u8 * new_top = top - old_size + new_size;
        if( new_top > memory_end ) {
            abort();
        }
        top = new_top;
        return old_ptr;
    }

    // allocate memcpy (deallocate is noop)
}

void DynamicArrray::grow() {
    size_t new_capacity = Max2( 1, capacity * 2 );
    elems = ReallocMany< T >( a, elems, old_capacity, new_capacity );
    capacity = new_capacity;
}

Which exhibits "just" O(1) instead of amortised O(1) behaviour:

array = [], arena = [. . . . . . . .]
add(1) -> realloc(0, 1), array = [1],     arena = [1 . . . . . . .]
add(2) -> realloc(1, 2), array = [1 2],   arena = [1 2 . . . . . .]
add(3) -> realloc(2, 4), array = [1 2 3], arena = [1 2 3 . . . . .]

This all obviously breaks down and reverts to the old behaviour if you allocate anything after the array, but I've found that in practice it doesn't happen very often, and the change is simple and non-intrusive enough to be a win.

29 Nov 2023 • C++ tricks: Production Ready TM aligned malloc

People generally implement aligned malloc by adding some metadata at ptr[-1] pointing to the actual allocation. In this post I will show you a simpler way.

void * Allocate( size_t size, size_t alignment ) {
    Assert( alignment <= 16 );
    return malloc( size );
}

This works because:

21 Nov 2023 • C++ tricks: STL-free source location

C++20's source_location is dogshit:

> cat a.cpp
#include <source_location>
> cl.exe /std:c++20 a.cpp /P
Microsoft (R) C/C++ Optimizing Compiler Version 19.36.32534 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

a.cpp
> wc -l a.i
6736 a.i
> g++ -E a.cpp -std=c++20 | wc -l
99
> clang++ -E a.cpp -std=c++20 | wc -l
460

but the compilers were kind enough to all use the same intrinsics, so the DIY implementation is trivial:

// zero headers required!
struct SourceLocation {
    const char * file;
    int line;
    const char * function;
};

constexpr SourceLocation CurrentSourceLocation( const char * file_ = __builtin_FILE(), int line_ = __builtin_LINE(), const char * function_ = __builtin_FUNCTION() ) {
    return {
        .file = file_,
        .line = line_,
        .function = function_,
    };
}

Now you can throw out your old allocation macros:

// before
#define ALLOC( a, T ) ( ( T * ) ( a )->allocate( sizeof( T ), alignof( T ), __PRETTY_FUNCTION__, __FILE__, __LINE__ ) )

// after
template< typename T >
T * Alloc( Allocator * a, SourceLocation src = CurrentSourceLocation() ) {
       return ( T * ) a->allocate( sizeof( T ), alignof( T ), src );
}

and also update helper functions to have useful source info:

template< typename T >
Span< T > CloneSpan( Allocator * a, Span< T > span, SourceLocation src = CurrentSourceLocation() ) {
    Span< T > copy = AllocSpan< T >( a, span.n, src );
    memcpy( copy.ptr, span.ptr, span.num_bytes() );
    return copy;
}

// const variant because templates don't do implicit casts
template< typename T >
Span< T > CloneSpan( Allocator * a, Span< const T > span, SourceLocation src = CurrentSourceLocation() ) {
    Span< T > copy = AllocSpan< T >( a, span.n, src );
    memcpy( copy.ptr, span.ptr, span.num_bytes() );
    return copy;
}

18 Nov 2023 • 2023 Windows post-install checklist

This is a guide on how to set up Windows to not be annoying. You should set aside a few hours to go through everything. Most steps, but not all, are detailed enough that you can autopilot your way through it.

Initial setup and disabling security features

Install the correct version. You want Windows 10 IoT Enterprise LTSC 21h1, which is pretty stripped down out of the box and doesn't get feature updates. 21h1 is more or less a strict downgrade from 1909 because you can't disable Windows Defender which tanks perf hard. You can't install Visual Studio on LTSC 1909 anymore and none of the solutions for installing on older versions of Windows actually work, so 21h1 it is.
In the installer, do domain join instead of creating a web account. Say no to all the location/telemetry garbage.
Click the start button, Settings, Update & Security, Windows Update, Check for updates. Don't reboot yet.
Start, Edge, download another browser, e.g. Vivaldi.
Install Search Everything. Sort by descending run count, and close window on execute. Right click on things and set run count to seed them to appear at the top.
Install AutoHotKey. Put a shortcut to your AHK script in %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup. See below for an example.
Install video drivers. If you have an NVIDIA GPU, use NVCleanstall because it's easier and lets you skip GeForce Experience, unlike the official installer. If you used the official installer you can just uninstall it afterwards anyway.
Open Control Panel (search for it in the start menu or Control Panel.lnk/control.exe in Everything):
- User Accounts, User Accounts (again), Change User Account Control settings, disable it.
- Programs, Turn Windows features on or off, check Windows Subsystem for Linux, this takes ages, don't reboot yet.
- System and Security, Windows Defender Firewall, Turn Windows Defender Firewall on or off, turn it off. Don't kill the service or it breaks the Windows Store, which then breaks WSL.
- System and Security, System, Rename this PC.
- System and Security, Security and Maintenance, Change Security and Maintenance settings. Turn off all the security messages.
Open settings, Update and Security, Windows Security, Virus & threat protection, Virus & threat protection settings, Manage settings, disable everything, including tamper protection.
gpedit.msc, Computer Configuration, Administrative Templates, System, Power Management, Sleep settings, Require a Password when the computer wakes, Disabled.
secpol.msc, Local Policies, Security Options, UAC: Run all administrators in Admin Approval Mode, Disabled.
devmgmt.msc, Mice and other pointing devices, [your mouse], Power Management, uncheck Allow this device to wake the computer. This stops moving the mouse from waking your PC from hibernate.
Ctrl+alt+del, More details, Startup, disable Windows Security notification icon and Microsoft Edge.
Download the Sysinternals Suite to use in a moment.
Reboot into safe mode: Ctrl+Escape to open the start menu, power, shift click reboot, in the startup menu go Troubleshoot, Advanced Options, Startup Settings, Restart. Then select safe mode when it comes up again. If that doesn't work, Start your PC in safe mode in Windows.
Run autoruns from the Sysinternals Suite you just downloaded. Uncheck the Hide Microsoft/Windows entries "checkboxes" in the top bar, then disable the following:
- WSearch (Windows Search)
- WpnUserService and the one with an _asdf suffix (Windows Push Notification System Service/Windows Push Notification User Service). This is the only way to disable the firewall spam popups.
- SysMain (used to be called Superfetch)
- Sense/WdNisSvc/WinDefend (Windows Defender). This is quite pointless because it just turns itself back on.
- wscsvc (Security Center)
Reboot back to normal mode.

Disable everything else

Make a documents folder somewhere that isn't My Documents. Too much software uses it as a dumping grounds so it's not really usable for its intended purpose nowadays. I used C:\Users\mike\Mike and psubst that to X:\, some people use stuff like C:\Docs.
Win+E, View, Options, View. Check Show hidden files, folders and drives. Uncheck Hide empty drives. Uncheck Hide extensions for known file types. Uncheck Hide protected operating system files. Go down to Naviation pane, check Expand to open folder.
Right click the desktop, Personalize, go through all of it including all the links. In particular:
- Themes, Sounds, Sound Scheme, No Sounds.
- Start, disable stuff
- Taskbar, Combine taskbar buttons, Never. Put it on the left. Turn system icons on or off, disable Action Center/Input Indicator/Meet Now.
Right click the taskbar, Search, Hidden. Uncheck Show Task View button.
Control Panel (control.exe):
- System and Security, System, Advanced system settings, Performance Settings..., disable almost everything under visual effects, Advanced tab, set the pagefile size to 800MB. Go back to the Advanced system settings window, Startup and Recovery Settings..., uncheck Automatically restart if you want.
Install the MarkC mouse acceleration fix.
Open Settings:
- System:
  - Display: Configure Night light.
  - Sound: Disable all audio devices except the one you actually use. TODO blacklist the others too so they don't get re-enabled randomly
  - Notifications & actions: Off.
  - Power & sleep: Probably set these to Never and Never.
- Devices. Typing, disable everything. AutoPlay, Off.
- Time & language. Set your time zone. Click Date, time & regional formatting, Change data formats, pick what you like. Language, add UK, remove US.
- Gaming: Xbox Game Bar, Off. Game Mode, Off.
- Ease of Access. Keyboard, disable everything.
- Privacy. Go through every tab and disable basically everything.
- Update & security. For developers in the sidebar, enable Developer Mode, disable Remote Desktop, disable never sleep when plugged in.
Open PowerShell and set some registry keys:
- Make startup entries run more quickly: reg add "HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Serialize" /v "StartupDelayInMSec" /t REG_DWORD /d "0" /f
- Don't show an option to search in Windows Store when opening unfamiliar file types: reg add "HKLM\SOFTWARE\Policies\Microsoft\Windows\Explorer" /v "NoUseStoreOpenWith" /t REG_DWORD /d "1" /f
- Disable window previews in the taskbar: reg add "HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Advanced" /v "ExtendedUIHoverTime" /t REG_DWORD /d "30000" /f
- Disable the searching for a solution to this problem popup: reg add "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting" /v "Disabled" /t REG_DWORD /d "1" /f
- Hide the Music folder: reg delete "HKLM\Software\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{3dfdf296-dbec-4fb4-81d1-6a3438bcf4de}" /f and reg delete "HKLM\Software\Wow6432Node\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{3dfdf296-dbec-4fb4-81d1-6a3438bcf4de}" /f
- Hide the Pictures folder: reg delete "HKLM\Software\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{24ad3ad4-a569-4530-98e1-ab02f9417aa8}" /f and reg delete "HKLM\Software\Wow6432Node\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{24ad3ad4-a569-4530-98e1-ab02f9417aa8}" /f
- Hide the Videos folder: reg delete "HKLM\Software\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{f86fa3ab-70d2-4fc7-9c99-fcbf05467f3a}" /f and reg delete "HKLM\Software\Wow6432Node\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{f86fa3ab-70d2-4fc7-9c99-fcbf05467f3a}" /f
- Hide the 3D Objects folder: reg delete "HKLM\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{0DB7E03F-FC29-4DC6-9020-FF41B59E513A}" /f and reg delete "HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\MyComputer\NameSpace\{0DB7E03F-FC29-4DC6-9020-FF41B59E513A}" /f

WSL

Install WSLtty. It is the only reasonable terminal emulator on Windows. I have tried them all, everything else suffers from at least one of:
- Incapable of rendering text (alacritty, wezterm, ...)
- Drops to below 1FPS under normal use (Windows Terminal, ...)
- Clipboard issues (the old X server + Linux terminal trick)
- Renders itself inoperable under normal use (Windows Terminal)
If you pin it to the taskbar, right click the icon > right click WSL Terminal > Properties, set the icon to C:\Windows\System32\cmd.exe.
Install AlpineWSL. It has pretty comprehensive repos and comes with the least garbage (5MB!). Actually NixOS is the only Linux distro that won't irreparably self destruct under normal use and you should use that, but I haven't figured it out on WSL yet.
Run apk update; apk add openssh; ssh-keygen -t ed25519; ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519. Disable PasswordAuthentication in /etc/ssh/sshd_config. Set up authorized_keys. Create sshd.vbs somewhere:
```
WScript.CreateObject( "shell.application" ).ShellExecute "WSL", "/usr/sbin/sshd", "", "open", 0
```
and copy a shortcut to %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup.
apk add bind-tools coreutils ctags curl fish fzf git grep htop less man man-pages mdocml-apropos p7zip the_silver_searcher tig tmux tree vim whois

Software I like

7-Zip. Go into settings and associate it with everything that isn't zip. Disable all the junk context menu items.
Create halt.bat somewhere containing shutdown /s /t 0. Create reboot.bat containing shutdown /r /t 0. Create hibernate.bat containing shutdown /h. Use Everything to run these.
Search Everything. Sort by descending run count, and close window on execute. Right click on things and set run count to seed them to appear at the top. halt.bat, reboot.bat, Control Panel.lnk, Snipping Tool.lnk, vivaldi.exe, etc.
Start Killer. Use Everything as a launcher instead.
AltBacktick.
Windows Auto Dark Mode.
Clink.
Dina font.
Download psubst. psubst X: C:\Users\mike\Mike /P.
Sumatra PDF.
Syncthing.
Obsidian.
Shairport4w. Airplay to your PC.
Apple Music for Windows 10. You need to modify setup.bat a bit to not kill any processes or install any dependencies and just install Apple Music. Maybe also install it to Program Files and not Documents.
mpv. Put yt-dlp.exe in the same folder. Put sponsorblock_minimal.lua in mpv/scripts. Periodically run yt-dlp -U.
ShareX. 21h1 LTSC nukes Win+Shift+S, you can point AutoHotKey at ShareX to do the same thing (see below).

Dev tools

Visual Studio 2022. Install C++ build tools, Windows 11 SDK, and HLSL tools. JIT debugging is quite broken in VS22 so just use RemedyBG.
RemedyBG. See this Github issue to register it as a JIT debugger.
Install Microsoft Store. Install WinDBG preview.
Git.
CMake, sadly.
NSIS.
Intel Architecture Code Analyzer.
Vulkan SDK.
Renderdoc.
Nsight Graphics.
Blender.
Zeal. Make a hotkey in your text editor to open dash://[selected text].
Path Editor. Add VS compiler stuff (cl.exe, MSBuild.exe, rc.exe, use Everything to find where they are), IACA, and NSIS to path.
Control Panel, System and Security, System, Advanced system settings, Environment Variables. Point INCLUDE and LIB at VS and the Windows SDK. I have:
INCLUDE:
- C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\include
- C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\shared
- C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\ucrt
- C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\um
- C:\Program Files (x86)\Windows Kits\10\Include\10.0.22000.0\winrt
LIB:
- C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\lib\x64
- C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\ucrt\x64
- C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22000.0\um\x64
This and the previous step are what vcvarsall.bat is supposed to do for you, except vcvarsall is extremely slow for no reason and this isn't.

startup.ahk

This script maps capslock to escape, adds some hotkeys for launching/closing programs, and doesn't open the start menu when you press the windows key.

InstallKeybdHook
#SingleInstance Force

SetCapsLockState( "Off" )
SetCapsLockState( "AlwaysOff" )
CapsLock::Escape

#e::Run( "X:\" )
#p::Run( "C:\Program Files\Everything\Everything.exe" )
#Enter::Run( "wt.exe" )
#x::WinClose( "A" )
#space::return

~LWin::Send( "{Blind}{vkE8}" )

#+s::Send( "^{PrintScreen}" ) ; then bind Ctrl+PrintScreen to "Capture region" in ShareX

#m:: { ; run "mpv.exe {clipboard}"
    Run( Format( '"C:\Program Files\mpv\mpv.exe" "{1}" "--window-maximized"', A_Clipboard ) )
    Tooltip( "mpv " . A_Clipboard )
    SetTimer( ToolTip, -1000 )
}

4 Nov 2023 • Running a business in Finland

I noticed I was getting a lot of spam from Finnish companies to firstname.lastname@. Presumably this comes from scraping the companies registry. Fastmail lets you block senders but not recipients from the GUI, so we have to nuke them from sieve instead:

### Reject firstname.lastname {{{
if anyof(
    address :matches :localpart "to" "firstname1.lastname1",
    address :matches :localpart "to" "firstname2.lastname2"
) {
    reject "haista vittu";
    stop;
}
### }}}

Put it in the top box, i.e. above the spam filtering rules, so it ends up fully nuked and not in your spam folder.

11 Sep 2023 • Least effort self-destructing email addresses with Fastmail

Fastmail has a builtin tempomail provider ("Masked Email") which works and is good, but is mostly just barely too clunky to use much. So I made a sieve filter that lets me sign up for things with e.g. blah.temp20230101@... and auto-rejects emails received after the date in the address.

### Reject tempYYYYMMDD@ after the date in the address {{{
if anyof(
	address :matches :localpart "to" "*temp????????",
	address :matches :localpart "cc" "*temp????????",
	address :matches :localpart "bcc" "*temp????????",
	address :matches :localpart "deliveredTo" "*temp????????"
) {
	if currentdate :value "gt" "date" "${2}${3}${4}${5}-${6}${7}-${8}${9}" {
		reject "This address auto-rejects emails sent after ${2}${3}${4}${5}-${6}${7}-${8}${9}";
		stop;
	}
}
### }}}

Use it by going Settings > Mail rules > Edit custom Sieve code (at the bottom), and pasting it into the top box above the regular spam filtering (if you look at the hardcoded mail rules you can see where I got the header list from). You can test it with Fastmail's sieve tester by adding require ["date", "reject", "relational", "variables"]; at the top and putting To: blahtemp20230101@blah.com in the message box.

Daily Mail was my previous attempt at this, but only allowing mail through on the same day was too restrictive in practice.

19 Jul 2021 • Building a userspace CSPRNG on top of Monocypher 3

Same idea as the code I wrote a few years ago, except for the latest version of Monocypher and it actually works.

The tl;dr of the last time I did this is that OS entropy APIs are annoying because that have vaguely defined failure conditions, and moving it to userspace sidesteps all of that. We still need to seed it with kernel entropy, which we'll do with ggentropy.

The code is way simpler this time:

u8 entropy[ u8 ];
u64 ctr;

bool Init() {
	if( !ggentropy( entropy, sizeof( entropy ) ) )
		return false;
	ctr = 0;
	return true;
}

void Shutdown() {
	crypto_wipe( entropy, sizeof( entropy ) );
}

void CSPRNG( void * buf, size_t n ) {
	ctr = crypto_chacha_ctr( ( u8 * ) buf, NULL, n, entropy, entropy + 32, ctr );
}

although not foolproof:

You need to actually check if Init succeeded. Abort if it was first init, you can maybe just print a warning if doing periodic reseeding
Thread safety is your problem
You need to reseed after forking on forky platforms, which you can do with pthread_atfork. This can fail but so can fork so it's not making it worse

2 Dec 2020 • An equal but opposite reaction

Too many websites have adblock detection and whine banners now, let's stick it to the man.

👍 Thanks for blocking ads! 😍

⚠️ Uh-oh, you're not blocking ads! ⚠️

Please install one or more of the following:

Chrome/Edge: uBlock Origin
Firefox: uBlock Origin
iPhone/iOS: AdGuard
Android: Firefox and uBlock Origin

View source if you want to copy it. Zero JS. For licensing consider it public domain/MIT/"please distribute this as widely as possible", whatever is most convenient.

18 Aug 2020 • OpenSMTPD is excellent 2020 edition

It's been three years since I first posted this and OpenSMTPD still kicks dick. But the config format and plugin ecosystem has changed a lot since then so I thought I'd post my config again (minus some Mike stuff covered elsewhere on this blog). The biggest differences from last time are that smtpd has filters now and Jeff Bezos delivers my mail.

Behold:

pki mikejsavage.co.uk cert "/etc/ssl/mikejsavage.co.uk.fullchain.pem"
pki mikejsavage.co.uk key "/etc/ssl/private/mikejsavage.co.uk.key"

# Incoming
filter rspamd proc-exec "filter-rspamd"
listen on all tls pki mikejsavage.co.uk filter rspamd
action deliver_local maildir virtual { "@" => mike }
match from any for local action deliver_local

# Outgoing
filter "dkimsign" proc-exec "filter-dkimsign -d mikejsavage.co.uk -s dkim -k /etc/mail/dkim.mikejsavage.co.uk.key" user _dkimsign group _dkimsign
table ses_credentials file:/etc/mail/ses_credentials
listen on all port submission tls-require pki mikejsavage.co.uk auth filter dkimsign
action relay_ses relay host smtp+tls://ses@email-smtp.eu-west-1.amazonaws.com auth <ses_credentials>
match auth from any for any action relay_ses

14 Aug 2020 • C++ tricks: catching ASAN errors in a debugger

Add a breakpoint at __sanitizer::Die then try not to think about how ASAN could just call abort.

9 Jul 2020 • gg libraries index

gg libraries are little libraries that are easy to add to any project (one .cpp and one .h). So my contribution to stb land.

Library	What it does
ggformat	Better printf
ggentropy	Cross platform crypto secure entropy function
ggtime	High precision fixed point time function

9 Jul 2020 • ggtime

ggtime is a sane (i.e. not using <chrono>) implementation of Facebook's flicks library.

Basically it's a fixed point time library with roughly nanosecond precision, and whose unit divides all the FPS values we care about in games.

github

21 Jan 2020 • Daily Mail

Alternatively titled: least effort self hosted disposable mail

Email spam is widely considered to be a solved problem. If you open your inbox this is immediately obviously false. Scams and outright virus mail are long gone, but the spammers have adapted. Customer success mails, newsletters after you unchecked the box saying I want newsletters, terms of service updates when you haven't logged in for five years.

Unfortunately, this new problem is unsolvable. Also, the spammers buy DoubleClick Ads and run Google Analytics and let you sign in with Google on their websites, which are best viewed in Google Chrome.

My first idea was to drop any email containing an unsubscribe link, or ( "terms of service" || "privacy policy" ) && "update", but that seems like it might have too many false positives.

My second idea was to use the date as an email address and drop any emails sent on the wrong day. So you register with like daily21012020@ and they can only email you on the same day. The end result is like a tempo mail provider, but it's nicer because there is no UI or software. You just type the date in and emails go to your normal inbox.

I did briefly consider making a service out of this but everyone would immediately blacklist the domain and then it turns into a normal temp mail provider which you can get for free so whatever.

OpenSMTPD is great so the implementation is trivial. Drop this in /usr/local/libexec/smtpd/filter-dailymail as an executable:

#! /usr/bin/env lua

local function result( version, sid, token, data )
	if version == "0.4" then
		token, sid = sid, token
	end
	print( ( "filter-result|%s|%s|%s" ):format( sid, token, data ) )
end

for line in io.lines() do
	if line == "config|ready" then
		print( "register|filter|smtp-in|rcpt-to" )
		print( "register|ready" )
		break
	end
end

for line in io.lines() do
	local version, sid, token, rcpt = line:match( "^filter|([^|]*)|[^|]*|smtp%-in|rcpt%-to|([^|]*)|([^|]*)|(.*)" )
	if rcpt then
		local daily = rcpt:match( "^daily(%d+)@" )
		if daily and daily ~= os.date( "%d%m%Y" ) then
			result( version, sid, token, "reject|550 Email causes cancer" )
		else
			result( version, sid, token, "proceed" )
		end
	end
end

Then in smtpd.conf:

filter dailymail proc-exec "filter-dailymail"
filter rspamd proc-exec "filter-rspamd"
filter incoming-filters chain { rspamd, dailymail }

listen ... filter incoming-filters

action deliver_local maildir virtual { "@" => mike } # you probably also want something like this

26 Dec 2019 • C++ tricks: compound literals

C99 has this nice thing where you can initialise structs inline with named members, so like:

struct A { int a, b, c; };

...

struct A a = ( struct A ) { .c = 1, .a = 2 };
printf( "%d %d %d\n", a.a, a.b, a.c ); // { 2, 0, 1 }

It's useful so it's not available in C++, but we can hack it together with variadic templates and pointer-to-members:

template< typename T >
T compound_literal() {
	return T();
}

template< typename T, typename A, typename... Rest >
T compound_literal( A ( T::* m ), const A & v, const Rest & ... rest ) {
	T t = compound_literal< T >( rest... );
	t.*m = v;
	return t;
}

...

A a = compound_literal< A >( &A::c, 1, &A::a, 2 );

I assume this breaks under even minimal C++ feature usage and the errors are the worst shit of my life but this post is a joke and you shouldn't use it anyway.

31 Oct 2019 • GL tricks: glPushDebugGroup

KHR_debug can do more than just spam you every time you call glBufferData, it also has a cool thing called debug groups. They don't do anything normally but you can use them to group related draw calls into collapsible lists in renderdoc. For example, here's a diesel engine frame:

They're very simple to use:

glPushDebugGroup( GL_DEBUG_SOURCE_APPLICATION, 0, -1, "Hello" );
...
glPopDebugGroup();

Put those calls in your render pass setup code, also check GLAD_GL_KHR_debug != 0 if you're still on pleb GL like we are, and you're done.

31 Oct 2019 • C++ tricks: member array count

I recently wanted to make an array with the same size as some other array. Normally I would use ARRAY_COUNT, But this time the reference array was a struct member, and I was in a context where I had no struct instance I could use.

So this:

struct A {
	int a[ 4 ];
};

struct B {
	int b[ ??? ];
};

The solution is to use C++'s "pointer to data member" functionality:

template< typename T, typename M, size_t N >
constexpr size_t ARRAY_COUNT( M ( T::* )[ N ] ) {
       return N;
}

struct B {
	int b[ ARRAY_COUNT( &A::a ) ];
};

Weird but it works.

25 May 2019 • Server side React

JSX is nice. Most templating languages start with HTML and add dumpy scripting on top, while JSX adds HTML syntax to javascript. It means you (mostly) get to use normal language constructs and flow control and it's much less annoying. I'd like to see more languages go this way.

If you only want to use it server side so people don't have to enable javascript and run 10MB of react, you can do this:

function flatten( arr ) : string[] {
	let flattened = [ ];

	function flatten_impl( flattened, arr ) {
		if( Array.isArray( arr ) ) {
			for( const e of arr ) {
				flatten_impl( flattened, e );
			}
		}
		else {
			flattened.push( arr );
		}
	}

	flatten_impl( flattened, arr );
	return flattened;
}

type Attributes = { [ key : string ] : string };
type Component = ( Attributes, any ) => any;

let React = {
	createElement: function( component : Component | string, attributes : Attributes | null, ...children ) : string {
		if( typeof component == "string" ) {
			let attr = "";
			if( attributes != null ) {
				for( const k in attributes ) {
					const key = k == "className" ? "class" : k;
					attr += ` ${key}="${attributes[k]}"`;
				}
			}
			return `<${component}${attr}>` + flatten( children ).join( "" ) + `</${component}>`;
		}
		else {
			return component( attributes, children );
		}
	}
};

Then write your JSX and build your code like tsc --jsx and run it to print your pages. Not sure it exactly matches React's behaviour since I don't use it, but it seems good enough.

Rant time

JSX is a big improvement over everything else but it's still not actually good. All HTML statements should be concatenated to some implicit return value so you can use normal language constructs everywhere:

function CountTo( props ) {
	for( let i = 0; i < props.n; i++ ) {
		<div>{i}</div>
	}
}

rather than having to jam everything into one statement:

function CountTo( props ) {
	return [ ...new Array( props.n ) ].map( ( _, i ) => <div>{i}</div> );
}

I think this also lets you delete custom components from the language since it seems to just be different syntax for calling a function.

29 Apr 2019 • SSH local discovery

tinc has a nice feature called local discovery, where if the endpoints can talk directly it will do that rather than routing packets out through my VPS.

Wireguard is the new hotness but it doesn't do this. The only thing I really use my VPN for is to SSH/scp between my computers though, so solving this for SSH solves 99% of the problem.

Fortunately it's easy:

Match originalhost pi exec "am-i-home"
HostName 192.168.1.3
Host pi
HostName 10.0.0.4

If I SSH to pi, it will run am-i-home to decide whether to use the local IP or the VPN IP. So you need to configure your router/VPN to use static IPs.

am-i-home just checks whether I'm connected by ethernet or on my home WiFi:

#! /bin/sh
[ "$(cat /sys/class/net/eth0/carrier 2> /dev/null)" = "1" ] && exit
[ "$(iwgetid -r)" = "homessid" ] && exit
exit 1

15 Apr 2019 • C++ tricks: compile time string hashing

While writing that last post I figured I should write about this too.

We use it in Cocaine Diesel to generate a unique version number from the version string (git shorthash or tag if there is one). We also plan to use it for assets so we can refer to them by name in the code but not have it be super slow at runtime (could use an enum but that's annoying to maintain).

In C++11 constexpr functions can't have bodies besides a return statement, so we have to write the hashes as recursive functions and it's a big ugly. Here's FNV-1a:

constexpr uint32_t Hash32_CT( const char * str, size_t n, uint32_t basis = UINT32_C( 2166136261 ) ) {
	return n == 0 ? basis : Hash32_CT( str + 1, n - 1, ( basis ^ str[ 0 ] ) * UINT32_C( 16777619 ) );
}

constexpr uint64_t Hash64_CT( const char * str, size_t n, uint64_t basis = UINT64_C( 14695981039346656037 ) ) {
	return n == 0 ? basis : Hash64_CT( str + 1, n - 1, ( basis ^ str[ 0 ] ) * UINT64_C( 1099511628211 ) );
}

and then you can add some helper functions to make it a bit easier to use:

template< size_t N >
constexpr uint32_t Hash32_CT( const char ( &s )[ N ] ) {
	return Hash32_CT( s, N - 1 );
}

template< size_t N >
constexpr uint64_t Hash64_CT( const char ( &s )[ N ] ) {
	return Hash64_CT( s, N - 1 );
}

Errata: I missed the - 1 when I first wrote this. You need it so you don't hash the trailing '\0' char.

15 Apr 2019 • C++ tricks: compile time type IDs

Here's a little trick for getting a unique ID for each type in your codebase, entirely at compile time.

typeid

C++ has typeid which is garbage. It works like this:

#include <stdio.h>
#include <typeinfo>

int main() {
	constexpr const std::type_info & a = typeid( int );
	printf( "%zu\n", a.hash_code() );
	return 0;
}

and compiles to

leaq    _ZTS1A(%rip), %rdi
movl    $3339675911, %edx
movl    $2, %esi
call    _ZSt11_Hash_bytesPKvmm@PLT

which looks not awesome. If we go look at gcc's implementation of hash_code we get

size_t hash_code() const noexcept {
	return _Hash_bytes(name(), __builtin_strlen(name()), static_cast<size_t>(0xc70f6907UL));
}

so this does a string hash at runtime every time you want to get the ID.

All of this could be trivially done at compile time and work fine with -fno-rtti, but it's C++ so they picked the absolute most useless implementation instead.

DIY

An actual solution is to use a constexpr hash function to hash the typename and (optionally) a little template to ensure the argument is actually a type.

#include <stdio.h>
#include <stdint.h>

// compile time FNV-1a
constexpr uint32_t Hash32_CT( const char * str, size_t n, uint32_t basis = UINT32_C( 2166136261 ) ) {
	return n == 0 ? basis : Hash32_CT( str + 1, n - 1, ( basis ^ str[ 0 ] ) * UINT32_C( 16777619 ) );
}

struct A {
	int a;
};

template< uint32_t id >
uint32_t typeid_helper() {
	return id;
}
#define TYPEID( T ) typeid_helper< Hash32_CT( #T, sizeof( #T ) - 1 ) >()

int main() {
	printf( "%u\n", TYPEID( A ) );
	// printf( "%u\n", TYPEID( 1 ) );
	return 0;
}

which compiles to

movl    $1735789992, %esi

This breaks in many situations. It doesn't understand namespaces and using and nested types, it doesn't understand typedef, and if you have structs in different files with the same name (technically UB, compilers don't warn about it so good luck with that) they get the same ID.

But it's good enough for ECS so good enough for me.

Addendum: C++17

C++17 adds a feature called inline variables and you can implement typeid with them too:

#include <stdio.h>

// stuff this in a header somewhere
inline int type_id_seq = 0;
template< typename T > inline const int type_id = type_id_seq++;

int main() {
	printf( "%d\n", type_id< int > );
	printf( "%d\n", type_id< float > );
	printf( "%d\n", type_id< int > );
	return 0;
}

Basically type_id_seq = 0/type_id< int > = 0/etc (names get mangled but let's write them with template syntax for clarity) get put in .data, then some code runs before main that does type_id< int > = type_id_seq++; and so on. The advantages of doing it this way is if you stick in some more templates to remove const/references/etc you can make it actually always accurate, and that it counts from 0 so you can use typeids as array indices. The disadvantages are that it compiles to a load rather than a constant, it needs C++17, and the implementation is a WTF.

4 Mar 2019 • Useful tmux window titles

If you Google for this you'll find a lot of wrong answers. tmux rename-window is useless and will break if you do like sleep 1 and switch windows. Fortunately tmux has an escape sequence you can use too. First you have to enable it in tmux.conf:

set -g allow-rename on

Then to get the current working directory put this in your shell prompt:

echo -en "\ek$cwd\e\\"

And to show vim's current file you can put this in vimrc:

set title
set titlestring=%t
if !empty( $TMUX )
	set t_ts=^[k
	set t_fs=^[\
endif

^[ is an actual ESC (0x1B) character, which you can type in vim by doing ctrl+v then escape.

18 Feb 2019 • Sending mail through Amazon SES with OpenSMTPD

I've not had problems with delivery but sending your own mail is generally considered to be No Good, and SES is $0.10 per 1k messages, so why not.

You need to add a few DNS entries so SES can verify your domain, then grab your SMTP credentials from the dashboard and put them in /etc/mail/ses_credentials like ses username:password. Then in smtpd.conf:

table ses_credentials file:/etc/mail/ses_credentials
action relay_ses relay host smtp+tls://ses@email-smtp.eu-west-1.amazonaws.com auth <ses_credentials>

and you're done. Send a test mail to ooto@simulator.amazonses.com to check it works.

12 Jan 2019 • Detecting WSL in Makefiles

2024 update: WSLENV doesn't work anymore, do ifdef WSL_DISTRO_NAME instead.

There's a WSLENV variable, but it's empty by default so ifdef doesn't do what you want. ?= can distinguish between empty and undefined, so this works:

WSLENV ?= notwsl
ifndef WSLENV
	# this runs when you _are_ in WSL
endif

12 Jan 2019 • Windows 10 2019 post-install checklist

This is a guide on how to set up Windows 10 to not be annoying. You should set aside a few hours to go through everything. Most steps, but not all, are detailed enough that you can autopilot your way through it.

Initial setup and disabling security features

Install the correct version. You want Windows 10 LTSC, which is pretty stripped down out of the box and doesn't get feature updates.
In the installer, do domain join instead of creating a web account. Say no to all the location/telemetry garbage.
Click the start button, Settings, Update & Security, Windows Update, Check for updates.
Start, Windows Accessories, IE, download another browser. While you're at it, go into Internet Options, Security, drag the security level all the way down, Custom level..., Launching applications and unsafe files, check Enabled (not secure), ok out of all of that. (Firefox 52 ESR link)
If you installed an older Firefox, unplug your network cable before running it then go to settings and disable updates. Install uBlock Origin and NoScript (FF52 version). It's very important to install those before you do anything else on the web. Go into uBlock settings and enable all filter lists that sound good.
Install video drivers. Don't install Geforce Experience.
Open Control Panel (search for it in the start menu), User Accounts, User Accounts (again), Change User Account Control settings, disable it. Go back to Control Panel home, Programs, Turn Windows features on or off, check Windows Subsystem for Linux, don't reboot. Back to Control Panel home, System and Security, Windows Defender Firewall, Turn Windows Defender Firewall on or off, turn it off.
gpedit.msc, Computer Configuration, Administrative Templates, Windows Components, Windows Defender Antivirus. Double click Turn off Windows Defender Antivirus, check Enabled, click ok. The other Windows Defender entries are disabled and you can ignore them. Also go to Windows Components, OneDrive, Prevent the usage of OneDrive for file storage, Enabled, ok, Prevent OneDrive from generating network traffic..., Enabled, ok.
services.msc, disable and stop Windows Search and Windows Update, and anything else that offends you.
secpol.msc, Local Policies, Security Options, UAC: Run all administrators in Admin Approval Mode, Disabled.
Ctrl+shift+escape, Startup, disable Windows Security notification icon.
Reboot to BIOS (might need to fully shutdown), put Linux Boot Manager back at the top of the boot list. Reboot back to Windows and reboot until Windows Update is done.

Disable everything else

Win+E, View, Options, View. Check Show hidden files, folders and drives. Uncheck Hide empty drives. Uncheck Hide extensions for known file types. Uncheck Hide protected operating system files. Go down to Naviation pane, check Expand to open folder.
Install the Take Ownership Registry Hack. Ignore the tutorial, just scroll down to the zip and install it. Use it if you ever run into permissions errors. Don't use it on C: because it causes problems.
Install the Disable 3D Objects Hack.
Right click the desktop, Personalize, go through all of it including all the links. In particular go Themes, Sounds, Sound Scheme = No Sounds. Taskbar, Combine taskbar buttons, Never. Start, disable everything. Taskbar, Turn system icons on or off, disable Action Center and Input Indicator.
Right click the taskbar, Search, Hidden. Uncheck Show Task View button. Uncheck Show People on the taskbar.
Control Panel, System and Security, System, Advanced system settings, Performance Settings..., disable almost everything under visual effects, Advanced, Change..., set the pagefile size to 800MB. Go back to the Advanced system settings window, Startup and Recovery Settings..., uncheck Automatically restart if you want.
Control Panel, System and Security, Security and Maintenance. Click all the "Turn off messages about x" links.
Win+R, cmd.exe, powercfg -h off.
Set Windows to use UTC time.
Install the MarkC mouse acceleration fix.
Open Settings:
1. System. Notifications & actions, disable them. Power & sleep, Never and Never.
2. Devices. Typing, disable everything. AutoPlay, Off.
3. Time & language. Set your time zone. Click Date, time & regional formatting, Change data formats, pick what you like.
4. Ease of Access. Keyboard, disable everything.
5. Privacy. Go through every tab and disable basically everything.
6. Update & security. Windows Update, Advanced Options, Delivery Optimization, don't let other people download updates from your PC. Click For developers in the sidebar, enable Developer Mode, disable Remote Desktop.
Control Panel, Ease of Access, Change how your keyboard works, disable everything.
regedit.exe, HKCU\Software\Microsoft\CurrentVersion\Explorer, create a key called Serialize, then create a DWORD called StartupDelayInMSec and set it to 0.
HKCU\Software\Microsoft\Windows\CurrentVersion\Explorer\Advanced, set ExtendedUIHoverTime to 30000 to disable taskbar preview windows.
HKCU\Control Panel\Accessibility\Keyboard Response, set Flags to 3 and AutoRepeatDelay/AutoRepeatRate to whatever. Never open the accessibility settings menu again.
HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting, make a DWORD called Disabled and set it to 1.
More Firefox things:
1. Go to Options. General. When Firefox starts, Show your windows and tabs from last time. Set up fonts, set minimum font size, uncheck Allow pages to choose their own fonts. Downloads, Always ask you where to save files. Applications, PDF, Always ask (disables pdf.js). Privacy, Use custom settings for history, never accept third-party cookies, remove all cookies you picked up so far. Security, uncheck the warnings/blockers. Advanced, uncheck smooth scrolling, check autoscrolling. Update, Never check for updates, don't update search engines.
2. about:config. extensions.update.autoUpdateDefault = false, extensions.update.enabled = false. browser.tabs.closeWindowWithLastTab = false.
3. Install Vimperator, Download Statusbar, Hide Tab Bar With One Tab, HideScrollbars, and bug489729. Click the links for Firefox 52 compatible versions. You might need to download through IE.
4. Put the NoScript icons in the tab bar. Go into NoScript options. Whitelist, remove everything. Notifications, uncheck Show message about blocked scripts, uncheck Display the release notes on updates. Advanced, XSS, disable.
Download the Sysinternals Suite. Run autoruns, disable anything you don't like the look of. MozillaMaintenance, NVIDIA telemetry, etc. Run procexp and check nothing dumb is running just in case. You will have lots of svchost.exes because MS doesn't run multiple services in one exe anymore.
Reboot.

WSL

We are going to use WSL because it's much faster than Cygwin.

Install AlpineWSL. It has pretty comprehensive repos and comes with the least garbage (5MB!)
Run apk update; apk add openssh; ssh-keygen -t ed25519; ssh-keygen -f /etc/ssh/ssh_host_ed25519_key -t ed25519. Disable PasswordAuthentication in /etc/ssh/sshd_config. Set up authorized_keys. Create sshd.vbs somewhere:

   WScript.CreateObject( "shell.application" ).ShellExecute "C:\Program Files\Alpine Linux\Alpine.exe", "run /usr/sbin/sshd", "", "open", 0

and copy a shortcut to %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup. 1. Install Xming. Run xlaunch, click ok a few times, click save configuration, save it to Startup. 1. Install a terminal in Alpine (I like st), then create a script to run that too:

   WScript.CreateObject( "shell.application" ).ShellExecute "C:\Program Files\Alpine Linux\Alpine.exe", "run env DISPLAY=:0 st", "", "open", 0

Check that copy paste between Windows and WSL works how you want. I had to swap primary/clipboard pastes in st for it to work out.
Create a shortcut to it, right click, put wscript.exe at the front of the target, give it a nice icon, then drag it to the taskbar. When you run it, it makes a new icon rather than opening in place, if anyone knows how to fix that please email me.
apk add bind-tools coreutils ctags curl fish fzf git grep htop less man man-pages mdocml-apropos p7zip the_silver_searcher tig tmux tree vim whois

Software I like

7-Zip. Go into settings and associate it with everything that isn't zip. Disable all the junk context menu items.
Create halt.bat somewhere containing shutdown /s /t 0. Create reboot.bat containing shutdown /r /t 0. Use Everything to run these.
Search Everything. Sort by descending run count, and close window on execute. Right click on things and set run count to seed them to appear at the top. halt.bat, reboot.bat, Control Panel.lnk, Snipping Tool.lnk, firefox.exe, etc.
Vim.
AutoHotKey. Put startup.ahk in %APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup. Include a hotkey to launch Everything. Scroll down for an example.
Start Killer. Use Everything as a launcher instead.
Clink.
Dina font.
Download psubst. psubst X: C:\Users\<user>\Documents /P.
Sumatra PDF.
f.lux.
Visual Studio 2019. Check the Graphics debugger and GPU profiler for DirectX box.
Apply this .reg file to fix JIT debugging.
NSIS.
Intel Architecture Code Analyzer.
Windows SDK. Make sure you check Windows Performance Toolkit (for GPUView), Debugging Tools for Windows (WinDBG), Windows SDK Signing Tools for Desktop Apps (SignTool), and probably the x86/amd64 SDKs.
DirectX SDK. You need it for XAudio 2.7, which you need if you want to ship software on Win7. You might need to reenable Windows Update for this.
Windows Store. Delete the PurchaseApp/xbox stuff. Get WinDbg Preview from the Windows Store.
Renderdoc. apitrace. GPU ShaderAnalyzer. Nsight Graphics.
Color Cop. GIMP. Inkscape. Milton. Blender. Wings3D.
mpv. Put youtube-dl in the same folder.
foobar2000.
Download Path Editor. Add VS compiler stuff, MSBuild, the Win10 Kit (with mt.exe), IACA, apitrace, NSIS, and mpv to path.
Control Panel, System and Security, System, Advanced system settings, Environment Variables. Point INCLUDE and LIB at VS and the SDKs. For VS2015 I have:

   INCLUDE:
   C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include;C:\Program Files (x86)\Windows Kits\10\Include\10.0.10240.0\ucrt;C:\Program Files (x86)\Windows Kits\8.1\Include\shared;C:\Program Files (x86)\Microsoft DirectX SDK (June 2010)\Include;C:\Program Files (x86)\Windows Kits\8.1\Include\um;C:\Program Files (x86)\Windows Kits\8.1\Include\winrt

and

   LIB:
   C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib\amd64;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.10240.0\ucrt\x64;C:\Program Files (x86)\Windows Kits\8.1\lib\winv6.3\um\x64

startup.ahk

This script maps capslock to escape, adds some hotkeys for launching/closing programs, and doesn't open the start menu when you press the windows key.

#SingleInstance force

SetCapsLockState, Off
SetCapsLockState, AlwaysOff
CapsLock::Escape

#e::Run C:\Users\mike\Documents
#p::Run C:\Program Files\Everything\Everything.exe
#Enter::
	Run C:\Program Files\Alpine Linux\st.vbs
	WinWait Xming,, 1
	WinMaximize
	return
#x::Winclose, A

LWin & vk07::return
LWin::return

Internet points

Write a blog post.
Email me if I forgot anything.

5 Jan 2019 • Immediate mode audio

In games you typically have a fire and forget API, where you start a sound and it plays to completion. Maybe it returns some handle so you can stop it later on.

PlayingSound ps = PlaySound( "path/to/sound" );
...
StopSound( ps );

Most of the time you want to play sounds to completion so this is nice and convenient, but sometimes you have sounds attached to objects and you want the sound to stop when the object is destroyed. Things like rocket thrusters and voice lines.

For those it's nicer to have an immediate mode API, where you have to keep calling the play function or it stops. For example:

void RocketThink() {
	if( collided with something ) {
		// destroys the rocket, so RocketThink doesn't get called anymore,
		// so PlaySoundImmediate doesn't get called anymore, so the sound stops playing
		RocketExplode();
		return;
	}

	DrawRocket();
	PlaySoundImmediate( "path/to/sound" );
}

It's trivial to implement and there's no way you can forget to stop a sound.

18 Dec 2018 • How to get rid of ? globbing in fish

? globbing is useless and makes pasting URLs annoying, this is how you kill it:

git clone https://github.com/fish-shell/fish-shell
cd fish-shell
sed -i "s/set_from_string(opts.features)/set_from_string(L\"qmark-noglob\")/" src/fish.cpp

then make and install that. If you're on arch you can grab fish-git from AUR, and add the sed line to build() above autoreconf.

12 Sep 2018 • Branch prediction minutiae in LZ decoders

Say we have an LZSS inner decode loop like this (not good, just an example):

u8 ctrl = read_u8();

// match when MSB of ctrl is set, literal otherwise
if( ctrl >= 128 ) {
	u32 len = ctrl - 128 + MML;
	if( len == 127 )
		len += read_extra_length();
	u16 offset = read_u16();
	memcpy( op, op - offset, len );
	op += len;
}
else {
	u32 len = ctrl;
	if( len == 127 )
		len += read_extra_length();
	memcpy( op, ip, len );
	op += len;
	ip += len;
}

We have an unpredictable branch to decide between literals and matches, and the branch misprediction penalties can eat a lot of time if you're hitting lots of short copies, which you do the majority of the time. There's also a branch to read spilled length bytes but we hit that less than 1% of the time and when we do hit it the branch misprediction isn't such a big deal because we get lots of data out of it, so we are going to ignore that in this post.

LZ4's solution to this is to always alternate between literals and matches, and send a 0-len literal when you need to send two matches in a row. That might look like this:

u8 ctrl = read_u8();
u32 literal_len = ctrl & 0xF;

if( literal_len == 15 )
	literal_len += read_extra_length();
memcpy( op, ip, literal_len );

op += literal_len;
ip += literal_len;

u32 match_len = ctrl >> 4 + MML;
u16 match_offset = read_u16();
if( match_len == 15 )
	match_len += read_extra_length();
memcpy( op, op - match_offset, match_len );

op += match_len;

So we got rid of the branch, but not really! memcpy has a loop, and now we're polluting that branch's statistics with 0-len literals. This does end up being an improvement on modern CPUs though, from the Haswell secton in Agner Fog's uarch PDF:

3.8 Pattern recognition for conditional jumps
The processor is able to predict very long repetitive jump patterns with few or no mispredictions. I found no specific limit to the length of jump patterns that could be predicted. One study found that it stores a history of at least 29 branches. Loops are successfully predicted up to a count of 32 or a little more. Nested loops and branches inside loops are predicted reasonably well.

Modern CPUs are able to identify loops and perfectly predict the exit condition. A good memcpy copies 16 or 32 bytes at a time, so we don't pay any misprediction penalties until at least 512 bytes, at which point we don't care because we got so much data out of it.

8 Sep 2018 • Least effort image self-hosting

My requirements:

Self hosted
Easy to upload photos from my PC
Easy to upload photos from my phone
Don't require tons of new crap software on my VPS

I already have mail clients on my PC/phone, and a mail server running on my VPS, so writing it as an MDA seemed like the least effort approach. It ended up being one line in smtpd.conf, 70 lines of python, and a DNS entry.

The smtpd.conf entry looks like this:

accept from any for domain "topsecret.mikejsavage.co.uk" virtual { "alsotopsecret" => mike } deliver to mda "mkgallery" as gallery

The subdomain and recipient kind of act as passwords, obviously that's bogus but it's good enough to keep the kids out.

And the code: mkgallery.py

Attaching images in the iOS mail client is kind of annoying and you can only send a few images at once because smtpd rejects huge messages, but it's still better than everyone else's image hosting services and I knocked it out in like a day so whatever.

8 Sep 2018 • Using WSAAsyncSelect

This API is garbage, the docs are garbage, and every piece of code I could find that uses it is garbage. It works roughly like this:

You open a socket
You call WSAAsyncSelect on it
WndProc gets called with FD_READ every frame if there's still data on the socket
WndProc gets called with FD_CLOSE when it closes

which looks totally reasonable. But, FD_CLOSE happens when the socket closes and not when you're done reading it, so you can keep getting FD_READs after the FD_CLOSE. On top of that the API seems to be buggy shit and you can get random garbage FD_READs out of nowhere. (the Microsoft sample code runs into this too (key word in that comment being "usually") so it's not just me using it wrong)

So the correct usage code looks like this:

switch( WSAGETSELECTEVENT( lParam ) ) {
	case FD_READ:
	case FD_CLOSE: {
		SOCKET sock = ( SOCKET ) wParam;
		// check this isn't a random FD_READ
		if( !open_sockets.contains( sock ) )
			break;

		while( true ) {
			char buf[ 2048 ];
			int n = recv( fd, buf, sizeof( buf ), 0 );
			if( n > 0 ) {
				// do stuff with buf
			}
			else if( n == 0 ) {
				closesocket( sock );
				open_sockets.remove( sock );
				break;
			}
			else {
				int err = WSAGetLastError();
				if( err != WSAEWOULDBLOCK )
					abort();
				break;
			}
		}

	} break;

	// ...
}

At the top we have a check to make sure it's not a garbage FD_READ, and then we have a while loop to read until WSAEWOULDBLOCK. You need the loop so you know when to actually close the socket. You can't close it in FD_CLOSE because you can get FD_READs after that. Without the loop you'll stop receiving FD_READs after you read everything off the socket, which means you never see the zero length recv and you won't know when to close it. Technically you only need to start looping after FD_CLOSE, but it simplifies the code a bit if you treat both events in the same way.

Lots of samples you see just ignore any errors from recv. Don't do this, you should explicitly handle expected errors (in this case that's just WSAEWOULDBLOCK), and kill the program on anything else because it means your code is wrong.

24 Apr 2018 • GoAccess with OpenBSD httpd

httpd's logs are pretty similar to CLF but not exactly, so you need to put this in ~/.goaccessrc:

time-format %T
date-format %d/%b/%Y
log-format %v %h %^ %^ [%d:%t %^] "%r" %s %b

And then run zcat /var/www/logs/access.log.*.gz | cat /var/www/logs/access.log - | grep -v syslog | goaccess --no-global-config.

14 Apr 2018 • namespace is bad and should not be used

namespace in C++ is an odd one. Even anti-C++ people do not seem to complain about it, and it's not obvious exactly why it's bad. It's taken me some years of programming in C++ to come to that realisation myself.

The problem with namespace is that it brings no upsides and causes problems that are very annoying and somewhat time consuming to debug. You write code that looks fine, get some linker error about "can't find function f" when you literally just wrote function f, and unless you know know from experience that it's going to be a namespace bug (you learn after a few times but these errors are so infrequent that can take many years) you can waste a lot of time going down dead ends.

Even if you suspect it's a namespace bug it can be hard to figure out exactly what the problem is because code affected by a namespace is mostly indistinguishable from code that isn't. You generally wrap the whole file in namespace Company { }, and C++ people don't indent inside namespace blocks, so unless the start of the namespace block is on your screen you have no idea if a given function is namespaced or not. You just assume that it is because that's how it is 99% of the time.

To give a concrete example of a namespace problem, here's the one that prompted me to write this post:

We have some class A in the global namespace
We have other stuff in the same file that's in the company namespace
A is all private, so the whole header is in the company namespace
I wanted to change some method A::b to be a free function
I did the minimal changes to make that happen (deleted the A::, added it to the header)
I get a linker error trying to call the new function

A is not in the company namespace and I left the new function in amongst A's method definitions, so my new function was not in the company namespace either. The whole header is in the global namespace, so there was a mismatch, and it took a lot of WTFing to figure out what the problem was.

Some people may disagree that namespace has no upsides, I know the point is to prevent naming collisions, but those never happen. Obviously when writing new code you can namespace the function name (renderer_init vs Renderer::init), which leaves third party libraries. In my experience libs just don't have naming collisions, so unless you're deep in dependency hell it's not a problem, and namespace is not the right solution in that case.

We could help the situation by indenting inside namespace blocks (like C# people do), or by having better compiler errors by suggesting near matches, but honestly the simplest solution is to just not use namespaces.

5 Apr 2018 • Never update anything #145432

I occasionally write posts like this and then delete because it's just whining about things that are not interesting and nobody wants to read it. But this case was so annoying so that I have to push it.

I wanted to install OBS so I could record my screen with sound. OBS has a lot of dependencies and thanks to dynamic linking that basically meant I had to update everything on my PC, so I did.

Then when I went back to working on compression I noticed my codec was running at half the speed it was before. WTF ok, I'll go disable the Meltdown patches, nope that made no difference, WTF.

Now for some reason my code runs like absolute shit unless I run it in a loop inside the program. Before I was checking perf by running the program lots of times, so like for i in {1..n}; do ./program; done, and now I have to do the loop from C or it runs half as fast.

This annoys me so much because I've made real breakthroughs in compression perf and I could have missed them over something as stupid as this. I actually did the first implementation on the plane on my laptop and got horrible results. If I got bad perf on my desktop too I'd probably have just given up.

The only thing I want from my computer is for it to work exactly like it did yesterday. I want to get my real work done and not waste time on things like this. Nobody seems to understand this besides the OpenBSD team. Reading the OpenBSD 6.3 changes there's nothing in there that affects how I use it, it's just like before but it works a bit better, so I know I can do the 10 minute upgrade and not have to worry about anything breaking. Everyone else is like "we introduced nothing of value but we sure did fuck up a bunch of other shit and yep we will berate you for not upgrading".

BTW audio is such a shitshow on Linux that OBS didn't work and I had to remove it anyway.

30 Mar 2018 • cmov

The reason cmov is not always is a win is that you have to block on the branch condition and on both sides of the branch.

To make that more explicit, say you have some code like this, and let's assume we take the true side of the branch.

int x;
bool p = [pred code]; // let's say it ends up being true
if( p ) {
	[true code];
	x = something;
}
else {
	[false code];
	x = something;
}

With branch prediction and out of order you can run [pred code] in parallel with [true code], and you don't need to bother running [false code]. If [pred code] and [true code] can overlap perfectly then the branch becomes very nearly free.

With cmov, it looks like this:

bool p = [pred code]; // let's say it ends up being true
int xtrue = [true code];
int xfalse = [false code];
int x = xfalse;
x = cmov( x, xtrue, p );

The difference is that the end result of x not only depends on just [true code], but also on [pred code] and [false code]. You have to wait for those results to come in before you can compute x. Whereas before you ran [pred code] in parallel and didn't run [false code] at all.

To decide between an explicit branch and cmov you need to figure out the expected costs of both. For cmov you just go look in the instruction tables and figure it out, for a branch it's slightly more work:

expected_cost_of_branch =
	probability_of_misprediction * ( misprediction_penalty + latency_of_pred_code )
	+ ( 1 - probability_of_misprediction ) * cost_of_predicted_branch

misprediction_penalty is like 16-20 cycles, cost_of_predicted_branch can be 1 or 2 cycles depending on whether it's a branch taken or not taken (for Skylake) (look in the instruction tables). latency_of_pred_code is how long it takes to figure out you took the wrong side of the branch.

That formula is actually incomplete, you have four cases. You have "correctly predicted and takes true side", "mispredicted and takes true side", "correctly predicated and takes false side", and "mispredicted and takes false side". With those probabilities you can also add in the cost of the code in each side of the branch, but those probabilities are harder to estimate, and just having the cost of the branch op can give you a decent idea of whether cmov will help or not.

Rant time

WTF isn't there a cmov intrinsic? WTF isn't there a not-cmov intrinsic? The compiler authors will of course say "oh you need to do a lot of measurements to make sure that it's a win so it's best to just leave it to us". Ok but I have done the measurements, I know my branch is unpredictable and I drew the pipeline diagrams so I know cmov will be faster on average. Why are you making me waste a ton of time guessing the syntax that will make the compiler generate the right code? Jesus.

21 Mar 2018 • Compression gold medalist

I have the world's fastest compression algorithm. Some quick preliminary benchmarks against LZ4:

Silesia mozilla

LZ4 -1: 717.1MB/s encode, 3057.9MB/s decode, 1.938 ratio
me: 553.5MB/s encode, 5305.8MB/s decode, 1.991 ratio

I'm nearly 75% faster to decode with slightly better ratio. My encoder hasn't received much love but will become faster than LZ4 with more work.

acidwdm2.bsp

LZ4 -1: 645.7MB/s encode, 2985.8MB/s decode, 3.033 ratio
me: 644.4MB/s encode, 5317.8MB/s decode, 2.924 ratio

78% faster with slightly worse ratio.

More to come.

3 Mar 2018 • Existential risk from artificial general intelligence

26 Feb 2018 • ggentropy

Code dump for a cross platform getentropy. Goes well with the CSPRNG post.

Update: deleted the inline code dump code because and put it on github

7 Feb 2018 • C++ tricks: dealing with 3rd party code

There are really only two options for this. You can copy the source code into your repository and add it to your build system. If it's too hard to add to your build system or takes too long to compile you can add it to CI separate from the main codebase and commit the binaries.

I feel like this is pretty well known, the only interesting thing I have to say is that you should really put the builds somewhere where other people can reproduce them. Building it on your PC/random servers and committing that is not good enough! CI works well for this. You make a new repo that contains the library source code and a script to build it build it, and set it up so people can click a button to run it.

I'm not sure of a nice way to get stuff like cmake onto the build machines though. Docker is not a solution unless you're Linux only. I guess there aren't that many build systems out there so making the images manually isn't too bad?

28 Jan 2018 • Building a userspace CSPRNG on top of Monocypher 2

Addendum July 2021

This was written against Monocypher 2 and doesn't work with Monocypher 3. Also the stirring logic is bad, when you encrypt something the chacha state is unaffected by the data itself, i.e. the entropy doesn't actually get mixed in. You need to reinitialise the chacha context! Go here for the Monocypher 3 version.

Original post

Monocypher is an excellent crypto library that comes with everything you need, except a cryptographically secure RNG. The Monocypher manual says use the OS's CSPRNG, but those are not ideal. So this post covers how to build an ideal userspace CSPRNG on top of Monocypher.

On OpenBSD you have the getentropy syscall, which cannot fail, and is the gold standard for OS RNGs. On new Linux you have getrandom, which might or might not fail (I can't tell from the manual). On old Linux and OSX you have urandom, which can fail in 100 different ways. On other platforms you have various other mechanisms that might or might not fail.

Making your RNG signal failure is not a solution because people are not going to check for and handle that case correctly. So we need an RNG which never fails, which we can do by moving the RNG to userspace. We need to seed it with entropy from the kernel, but we can do that once at startup when killing the program isn't such a big deal.

Also syscalls are slow.

Monocypher implements Chacha20 for encryption. It's a stream cipher, so it works by taking the output of a CSPRNG and XORing it with the plaintext, so we can use its CSPRNG as our CSPRNG. Monocypher also conveniently exposes the RNG directly, but with other libraries you can encrypt all zeroes and use that.

Seeding with getentropy

Getting this right on every platform isn't very hard, especially when you don't care about recovering from failure, but if you don't want to implement it you can just copy the code from OpenBSD.

Otherwise, see:

CryptoGenRandom on Windows
getrandom on Linux (don't set any flags)
getentropy on OpenBSD
/dev/urandom on old Linux and OSX

Stirring/reseeding

OpenBSD's arc4random reseeds the Chacha20 context every 1.6MB for paranoia, and there may be other reasons why you would want to reseed the RNG (see below).

To reseed the RNG all you need to do is encrypt the seed. (add: no)

Threads

If you need random numbers in multiple threads there are a few approaches you can take:

Use a single RNG and wrap it in a mutex. This approach takes the least effort and is the easiest to get right
Create N RNGs and then never use more than N threads. This is totally valid if your code isn't out of control
Use thread-local storage. Create an RNG at startup, and then use that to seed the TLS RNGs since those can be created after startup and you don't want that to fail

fork

Fork works (roughly) by making a complete copy of the program's memory. That includes the RNG states, so both sides will output the same data from their RNGs, which is probably not desirable.

Using getpid doesn't work, because if

Program A forks to Program B
Program B forks to Program C
Program A exits
Program C forks to Program A

then Program A calls getpid. It will see that it's still Program A and assume it hasn't forked.

What we really want is some way to register a callback that gets called when the program forks, which is exactly what pthread_atfork is for.

Obviously if your program never forks, or only does fork+exec, you don't need to worry about this.

The code

I've not included any of the getentropy/atfork/threading code, but this should get you started:

crypto_chacha_ctx ctx;

void csprng_init() {
	uint8_t entropy[ 32 + 8 ];
	bool ok = getentropy( entropy, sizeof( entropy ) );
	if( !ok )
		abort();

	crypto_chacha20_init( &ctx, entropy, entropy + 32 );
	crypto_wipe( entropy, sizeof( entropy ) );
}

void csprng_random( void * buf, size_t n ) {
	crypto_chacha20_stream( &ctx, ( uint8_t * ) buf, n );
}

bool csprng_stir() {
	// if we're periodically stirring and this fails we can probably let it slide.
	// if we're in an atfork callback and this fails we have to abort.
	uint8_t entropy[ 32 ];
	bool ok = getentropy( entropy, sizeof( entropy ) );
	if( !ok ) {
		crypto_wipe( entropy, sizeof( entropy ) );
		return false;
	}

	uint8_t ciphertext[ 32 ];
	crypto_chacha20_encrypt( &ctx, ciphertext, entropy, sizeof( entropy ) ); // add: bad!
	crypto_wipe( entropy, sizeof( entropy ) );

	return true;
}

11 Jan 2018 • C++ tricks: named function arguments

You all know what they are and why they're useful. In C you can hack it with designated initialisers and macros so your calls look like f( .x = 1, .y = 2 );. C++ doesn't have designated initialisers, but you can do something pretty similar with lambdas:

struct FooArgs { int x, y; };
int foo_impl( const FooArgs & args ) {
	return args.x + args.y;
}

#define foo( ... ) [&]() { FooArgs A; __VA_ARGS__; return foo_impl( A ); }()

int main() {
	return foo( A.x = 1, A.y = 2 );
}

This is junk but nice to know it can be done.

30 Dec 2017 • Least effort self hosted dynamic DNS

I wanted to set up a reverse SSH tunnel from my work PC to home because I can't figure out how to use OpenVPN. I have a dynamic IP at home so I needed to set up some kind of dynamic DNS. I was hoping someone had already done it for me because my requirements are so simple:

it should point some mikejsavage.co.uk subdomain at my home IP
it should not depend on any new 3rd party services
conceptually dyndns is pretty trivial so the solution should not be a gigantic heap of script

but they haven't so here's my take on it.

My solution is to host an authoritative DNS server on my VPS and update the zone file from cron running on my home server.

Setting up the server

First, you need a DNS provider that lets you add NS records for subdomains (Vultr's DNS lets you do this). In my case I have a record saying that lookups for domains under dyn.mikejsavage.co.uk should get forwarded to my VPS.

Next, install nsd. nsd.conf looks like this:

server:
        username: _nsd
        database: "" # disable database

remote-control:
        control-enable: yes

zone:
        name: dyn.mikejsavage.co.uk
        zonefile: dyn.mikejsavage.co.uk

The dyn.mikejsavage.co.uk zone file looks like this: (Google for "zone file syntax" if you care, it's not very exciting)

@       IN      SOA     ns0.mikejsavage.co.uk.  mike.mikejsavage.co.uk. ( 0 21600 3600 43200 300 )
$INCLUDE /zones/dyn/test

and the dyn/test include section is just one line:

test 5m IN A 1.2.3.4

To update the DNS I ssh into the VPS, overwrite that dyn/test file, and reload nsd. So the zones/dyn directory needs to be writeable by whatever user, and you need a few doas entries so that user can reload nsd.

The script to update the zone file looks like this:

#! /bin/sh
ip=$(echo "$SSH_CLIENT" | cut -d " " -f 1)
echo "test 5m IN A $ip" > /var/nsd/zones/dyn/test
doas /usr/bin/touch /var/nsd/zones/dyn.mikejsavage.co.uk
doas -u _nsd /usr/sbin/nsd-control reload dyn.mikejsavage.co.uk

and the only part that's non-obvious is the touch. nsd checks file modified times to see if it needs to really reload zones, but it doesn't look at the modified times of included files. So you need to touch the main zone file or nsd won't reload it.

Setting up the router

All the computers I have sit behind a router, so they all actually have the same public IP address, which is the address the router negotiates when it boots.

To actually send packets to my computers, the router keeps a map from ports to private IPs, so packets that come in on certain ports get forwarded on to the right machine. The router automatically adds entries when my computers send packets, so things like TCP and UDP expecting replies work just fine.

But when an outside PC tries to open a connection, the router doesn't know who to forward it to and ignores it. So to make the reverse SSH tunnel work you need to explicitly add a NAT rule (and probably a firewall rule) to your router, which forwards connections to a specific port on the router to some port on some computer behind the router.

You'll need to look at your router docs for this one. Keep in mind that your router might or might not apply NAT rules before firewall rules. My ERX does NAT first, so even though I expose port X (not 22) on the router, I need to allow port 22 in the firewall.

Setting up the client

Add a cron job:

*/30 * * * * ssh mikejsavage.co.uk "/path/to/update-dyndns"

and there you have it.

Actually the title of this post is a lie. For my use case it would have been less effort to put the IP in a text file and serve it over HTTP, but this way I got to learn some networking junk.

27 Dec 2017 • Mesh generation checklist

This post contains a comprehensive checklist documenting all the steps you need to go through to generate meshes in code for use in any kind of real time or offline renderer.

Kind of guess what the code should look like. Get it wrong.
Guess again. Still wrong.
Draw for a bit.
Make a more informed guess. Still wrong.
Realise you were drawing the wrong thing. Draw some more.
Try again. It looks pretty good, but there are some random stray triangles and/or missing triangles.
Comb the code for ages to find the off by ones. This step often involves drawing.
Vow to never touch the code again.
At some point in the future, notice more random stray triangles and/or holes.
Comb the code for ages to find the off by ones.
Vow to never touch the code again.

26 Dec 2017 • Geometry clipmaps: simple terrain rendering with level of detail

Geometry clipmaps are a neat rendering technique for drawing huge terrains in real-time. They were first published in 2004, and a practical GPU implementation was described shortly after.

Those articles don't really explain everything you need to know. In particular there are some some unexplained things that seem to add unnecessary complexity, but are actually crucial. The main motivation for writing this blog post is I tried to simplify those things away then ran directly into the problems they resolve which was annoying and took a lot of time.

This blog post by itself is not sufficient for you to write a complete implementation of geometry clipmaps. You should probably take a look at the paper and the GPU Gems article, and spend some time drawing to convince yourself that what I've written is correct.

Overview

The idea behind geometry clipmaps is you upload a mesh that's more finely tesselated in the middle than around the edges, draw the mesh centred on the camera, and then move the vertices of the mesh to the right height on the terrain. So you end up with more detail next to the camera, and less detail where it's so far away you couldn't see it anyway.

The mesh we submit to the GPU might look like this:

And then we set the Z coordinates in the vertex shader so it looks like this:

There are tons of other LOD techniques that achieve the same thing, but clipmaps stand out for their simplicity. You don't need to run complex decimation algorithms, you don't need to worry about stitching arbitrary meshes together, you don't need to select from discrete LOD levels at runtime, it's easy to tune the quality for different quality settings, and you don't need to send tons of data to the GPU every frame.

High level implementation details

In the original paper, they divide the terrain into a few different kinds of meshes and reuse those to draw the complete terrain. You can do it with a single mesh, but I'm not going to cover that. (see this post)

Let's start by looking at a top down view of the flat terrain geometry, with each mesh coloured according to its type:

You can see that there are several rings, or levels. Each ring has the same number of vertices as the ring inside it but is twice as large, so effectively the resolution halves with each level.

You can also see that each level mainly consists of a 4x4 grid of square tiles (the blue ones). Obviously you don't draw the inner 2x2 squares except for the innermost level.

Each level also has filler meshes that effectively split the level into a 2x2 grid of 2x2 squares (the red cross that gets fatter as you move outwards), and an L shaped trim mesh that separates each level (the green meshes). I will explain why these are required in a minute!

To draw the terrain, you pretty much centre the rings on the camera and put all the pieces in the right place and that's it. I will go into more detail on this later, but for now there is one important thing to note.

The position of each mesh needs to be snapped to be a multiple of its resolution. So if a mesh has a vertex every two units, it needs to be snapped to positions that are a multiple of two. If you don't do this, vertices move up and down as they swim over the terrain and the terrain looks like it's shimmering or waving which looks terrible.

With snapping, vertices can be added and removed as the level of detail changes, but they never move around, which is a lot less noticeable.

One consequence of this is that clipmap levels move half as fast as their inside neighbour. To prevent tiles from overlapping, you need to add a gap so each ring can encircle not only the 4x4 grid of square meshes of the level inside, but also some extra padding for the inner level to move around while the outer level remains stationary. This is why you need the filler and trim meshes! This video should show you what I mean:

I'm not drawing the trim so you can more clearly see what's going on. This video should demonstrate why we need the red filler meshes. Each side of the ring needs to be padded by one unit so there's enough room for the inner level to move around without any overlaps. (try drawing it if you don't believe me)

In the original paper the filler meshes are two squares wide and the trim meshes twice as big but still only one square wide. I have one square wide fillers and narrow trim, which is the one difference between my implementation and the paper that actually ended up working. It makes positioning trim meshes ever so slightly more difficult but that's the only difference I can think of.

That's pretty much everything. In pseudocode rendering looks something like this:

for( int l = 0; l < levels; l++ ) {
	v2 snapped = compute snapped camera position;

	for( int x = 0; x < 4; x++ ) {
		for( int y = 0; y < 4; y++ ) {
			if( innermost level or not in the middle 2x2 ) {
				draw a square tile;
			}
		}
	}

	draw this level's fillers;
	draw this level's trim;
}

and below I will fill in the blanks.

Low level implementation details

Generating the meshes

Generating meshes for each piece is pretty dry and way too much code to dump on my blog (it's like 350 lines), so for now I'm just going to point you at the repository. (let me know if that link breaks)

There are a few non-obvious things to do at this point that make rendering simpler later on.

You should generate a single mesh with the four filler pieces for a single level. If you generate one mesh and rotate it you get triangulation flips at the 90/270 degree rotations, and that causes triangles to flip as you move around which is pretty noticeable.

On the other hand, you can generate a single trim mesh and rotate that. You still get triangulation flips (in fact, you can see the flipped triangles in the top down image. Look at the vertical green strip on the left), but they don't seem to produce any noticeable artifacts.

You need to generate a cross shaped filler mesh that gets centred on the camera. It needs to be its own mesh because the arms are not separated from each other and you need the extra quad in the middle so there's no hole.

Actually rendering the meshes can be made quite simple depending on how you place them in object space. If we say the square tiles have side length of TILE_RESOLUTION, the should be placed so their bottom left vertex is at (0, 0) and their top right vertex is at (TILE_RESOLUTION + 1, TILE_RESOLUTION + 1).

Assuming you made them one unit wide too, the filler mesh and filler cross should have the bottom left of the centre quad (the normal filler mesh doesn't actually meet in the middle but imagine it does) at (0, 0) so you only need to snap and scale them into place. If you made them two units wide they should be centred on (0, 0).

The trim mesh should be positioned so that all you need to do is rotate it to put it in the right place for rendering. I start with an L shape with the bottom left vertex at the origin, then transform it down and left by TILE_RESOLUTION / 2 + 0.5 units. You'll probably want to draw this one to convince yourself that's correct, and I expect the +0.5 is only correct if your fillers are one unit wide.

Seams

If you look again at the top down image you will notice there are T-junctions at the boundaries between clipmap levels, and T-junctions mean cracks in the terrain. I've set the background colour to red in the above image to make them stand out.

We aren't totally spared from having to deal with seams, but fortunately they are pretty simple. If we draw the clipmap levels slightly pulled away from each other we can draw the triangles that we need for the seam geometry. The black lines are tile borders, the grey lines are triangle borders, and the red lines are the seam triangles.

But drawing them separately like this is actually a bit misleading. There is no gap, and the vertices at the coarser clipmap level exactly match some of the vertices at the finer level. I've drawn green lines between vertices that share the same position, and I've drawn the triangles we actually need in red. It's around one third as many, and we don't need to do anything special at the corners.

I've not drawn a complete level, but this works so long as the length of a full clipmap side is even. And they will be even, because we have four square tiles, a one wide filler tile, and a one wide trim tile. So 4x + 2, which is even. (try drawing all of it if you don't believe me)

When you generate the seam mesh you should put the bottom left corner at (0, 0) in object space.

Rendering the mesh pieces

Since each level as you move outwards is twice the size of the previous level, we can compute the scale of each level as float scale = 1 << level;. Then we want to snap the camera position to be some multiple of scale, which can be done like v2 snapped_pos = floor( camera_pos / scale ) * scale;.

Placing the tiles is nice and easy. You find the bottom left corner of the bottom left tile and place each tile relative to that. Don't forget about the fillers though!

There's nothing specific to D3D/GL here, but you do need to understand my rendering API. renderer_uniforms appends all of its arguments to a big uniform buffer, and returns an offset into the buffer and the size of the data. renderer_draw_mesh enqueues a draw call, and draw calls include the offsets and sizes of the uniform data they need. At the end of a frame the entire constant buffer gets copied to the GPU then the draw calls get submitted. I've written more about it in this old post if that doesn't make sense.

(checked_cast is a cast that asserts it didn't trash anything)

// this should already be filled in
struct {
	Mesh tile;
	Mesh filler;
	Mesh trim;
	Mesh cross;
	Mesh seam;
	Texture heightmap;
} clipmap;

// RenderState represents the GPU state for a draw call. this gets
// reused for brevity since some of the parameters don't change
RenderState render_state;
render_state.textures[ 0 ] = clipmap.heightmap;
render_state.uniforms[ UNIFORMS_VIEW ] = renderer_uniforms( V, P, camera_pos );

for( u32 l = 0; l < NUM_CLIPMAP_LEVELS; l++ ) {
	// scale is the unit size for this clipmap level
	// tile_size is the size of a full tile mesh
	// snapped_pos is the camera position snapped to this level's resolution
	// base is the bottom left corner of the bottom left tile
	float scale = checked_cast< float >( u32( 1 ) << l );
	v2 snapped_pos = floor( camera_pos / scale ) * scale;

	// draw tiles
	v2 tile_size = v2( checked_cast< float >( TILE_RESOLUTION << l ) );
	v2 base = snapped_pos - tile_size * 2;

	for( int x = 0; x < 4; x++ ) {
		for( int y = 0; y < 4; y++ ) {
			// draw a 4x4 set of tiles. cut out the middle 2x2 unless we're at the finest level
			if( l != 0 && ( x == 1 || x == 2 ) && ( y == 1 || y == 2 ) )
				continue;

			// add space for the filler meshes
			v2 fill = v2( x >= 2 ? 1 : 0, y >= 2 ? 1 : 0 ) * scale;
			v2 tile_bl = base + v2( x, y ) * tile_size + fill;

			render_state.uniforms[ UNIFORMS_MODEL ] = renderer_uniforms( m4_identity() );
			render_state.uniforms[ UNIFORMS_CLIPMAP ] = renderer_uniforms( tile_bl, scale );
			renderer_draw_mesh( clipmap.tile, render_state );
		}
	}
}

Next up are the filler meshes, which are also nice and easy:

// draw filler cross
{
	v2 snapped_pos = floor( camera_pos.xy() );
	render_state.uniforms[ UNIFORMS_MODEL ] = renderer_uniforms( m4_identity() );
	render_state.uniforms[ UNIFORMS_CLIPMAP ] = renderer_uniforms( snapped_pos, 1.0f );
	renderer_draw_mesh( clipmap.gpu.cross, render_state );
}

for( u32 l = 0; l < NUM_CLIPMAP_LEVELS; l++ ) {
	float scale = checked_cast< float >( u32( 1 ) << l );
	v2 snapped_pos = floor( camera_pos / scale ) * scale;

	[draw tiles]

	// draw filler
	{
		render_state.uniforms[ UNIFORMS_MODEL ] = renderer_uniforms( m4_identity() );
		render_state.uniforms[ UNIFORMS_CLIPMAP ] = renderer_uniforms( snapped_pos, scale );
		renderer_draw_mesh( clipmap.filler, render_state );
	}
}

Seams are tougher. If you remember we pad each level with a trim mesh so it fits inside the outer level, and the seam has to go around the trim too, so we need to snap the seam mesh to the outer level's resolution. The code for that looks like this:

[draw filler cross]
for( u32 l = 0; l < NUM_CLIPMAP_LEVELS; l++ ) {
	[draw tiles]
	[draw filler]

	// no need to draw a seam around the outermost clipmap level
	if( l != NUM_CLIPMAP_LEVELS - 1 ) {
		float next_scale = scale * 2.0f;
		v2 next_snapped_pos = floor( camera_pos / next_scale ) * next_scale;

		// draw seam
		{
			v2 next_base = next_snapped_pos - v2( checked_cast< float >( TILE_RESOLUTION << ( l + 1 ) ) );

			render_state.uniforms[ UNIFORMS_MODEL ] = renderer_uniforms( m4_identity() );
			render_state.uniforms[ UNIFORMS_CLIPMAP ] = renderer_uniforms( next_base, scale );
			renderer_draw_mesh( clipmap.seam, render_state );
		}
	}
}

Finally we have the trim meshes. We need to rotate them into place which is a little more complicated than what we've seen so far.

There's a neat little trick though. Let's start with the L in the bottom left (like a normal L). Then take two bits, flipping the first bit flips the mesh horizontally, and flipping the other bit flips the mesh vertically. If we draw it that looks like this:

From that we can see that they are all equivalent to rotations about the Z axis:

And the two bits can be interpreted as decimal 0 to 3. So when rendering we can figure out which flips we need, and use those bits to index into an array of rotations.

To decide which bits to set, we need to figure out where the current clipmap level is placed relative to the outer level. If the current level is in the bottom left of the hole, the trim needs to go in the top right and we set both bits. If the current level is in the top right, the trim needs to go in the bottom left and we use 00. And so on.

The logic to figure out which bits to set is a bit tricky. I do it by looking at the difference between the current level's snapped camera position and the next outer level's snapped camera position. If there's less than one unit difference between the two in both the x and y axes, the tile will be placed in the bottom left and the trim should be placed in the top right. If there's more than one unit difference we set the bit for that axis.

All of that looks like this:

StaticArray< UniformBinding, 4 > rotation_uniforms;
rotation_uniforms[ 0 ] = renderer_uniforms( m4_identity() );
rotation_uniforms[ 1 ] = renderer_uniforms( m4_rotz270() );
rotation_uniforms[ 2 ] = renderer_uniforms( m4_rotz90() );
rotation_uniforms[ 3 ] = renderer_uniforms( m4_rotz180() );

[draw filler cross]
for( u32 l = 0; l < NUM_CLIPMAP_LEVELS; l++ ) {
	[draw tiles]
	[draw filler]

	if( l != NUM_CLIPMAP_LEVELS - 1 ) {
		float next_scale = scale * 2.0f;
		v2 next_snapped_pos = floor( camera_pos / next_scale ) * next_scale;

		[draw seam]

		// draw trim
		{
			// +0.5 because the mesh is offset by half a unit to make rotations simpler
			// and we want it to lie on the grid when we draw it
			v2 tile_centre = snapped_pos + v2( scale * 0.5f );

			v2 d = camera_pos - next_snapped_pos;
			u32 r = 0;
			r |= d.x >= scale ? 0 : 2;
			r |= d.y >= scale ? 0 : 1;

			render_state.uniforms[ UNIFORMS_MODEL ] = rotation_uniforms[ r ];
			render_state.uniforms[ UNIFORMS_CLIPMAP ] = renderer_uniforms( tile_centre, scale );
			renderer_draw_mesh( clipmap.gpu.trim, render_state );
		}
	}
}

The rotations should be exact, like m4_rotz270 should return a hardcoded matrix of zeroes and ones rather than calling some generic rotation function. I suspect it may be possible to end up with cracks in the terrain if you go through a rotation function, and it's easy to hardcode it and be sure so why not.

Shading

The vertex shader is pretty simple. It transforms the mesh position from object space to world space (it's a bit convoluted, it could all be done with a single matrix multiply), samples the heightmap using the world space xy coordinates, then finishes transforming the mesh into clip space.

struct VSOut {
	vec4 view_position;
	vec3 world_position;
	vec2 uv;
};

uniform sampler2D heightmap;

in vec3 position;
out VSOut v2f;

void main() {
	vec2 xy = offset + ( M * vec4( position, 1.0 ) ).xy * scale;

	// +0.5 so we sample from the centre of the texel
	// it's not relevant in the vertex shader but it does affect the fragment shader
	vec2 uv = ( xy + 0.5 ) / textureSize( heightmap, 0 );

	// heightmap is BC5, see next section
	vec2 height_sample = texelFetch( heightmap, ivec2( xy ), 0 ).rg;
	float z = 256.0 * height_sample.r + height_sample.g;

	v2f.view_position = V * vec4( xy, z, 1.0 );
	v2f.world_position = vec3( xy, z );
	v2f.uv = uv;
	gl_Position = P * v2f.view_position;
}

Then the fragment shader looks something like this:

uniform sampler2D normalmap;

in VSOut v2f;
out vec4 screen_colour;

void main() {
	// decode BC5 normal
	vec2 normal_xy = texture( normalmap, v2f.uv ).xy * 2.0 - 1.0;
	float normal_z = sqrt( 1.0 - normal_xy.x * normal_xy.x - normal_xy.y * normal_xy.y );
	vec3 normal = vec3( normal_xy.x, normal_xy.y, normal_z );

	// do all your normal shading

	screen_colour = whatever;
}

The only subtlety here is that if you have a normalmap etc you should sample it in the fragment shader and not in the vertex shader. If you do it in the vertex shader you lose lots of detail and it looks horrible.

The only different between those two islands is the left island samples the normalmap in the vertex shader, and the right island does it in the fragment shader. The geometry is exactly the same!

Storing the heightmap efficiently

Obviously the answer is "as an image" but it's a bit more subtle than that.

We need 16 bits of precision for the heightmap because 8 bits looks blocky and bad, and we would really like it to be in a GPU compressed texture format because it helps with performance and VRAM usage.

BC5 has a pair of (roughly) 8-bit channels, so we can store h / 256 in one and h % 256 in the other to get 16 bits of precision. I didn't do any scientific tests but I did play around with swapping between lossless and BC5 terrains and it seemed fine so I stuck with it.

For a 4k terrain it ends up using 4k x 4k x 1 byte per pixel memory, so 16MB of VRAM for the heightmap. You'll probably want a few more channels than that, probably a normal map, maybe a horizonmap and AO map for the lighting. The normalmap and horizonmap are BC5 too, the AO map can be BC4 which makes it 3.5 bytes per pixel, so 59MB total. If we expand the terrain to 8k then that's 235MB, which is probably still ok. Going beyond that is probably too much though without more cleverness.

We need to be able to decode the terrain image on the CPU too so we can use it for things like collision detection. BC5 should be simple to decode, and indeed the code for it is simple, but my decoder doesn't exactly match the GPU's! If anyone can see the problem please email me!

Extras

Skirts

It's pretty normal for games to be set on an island in the middle of an infinite ocean, because it neatly sidesteps the "invisible walls are immersion breaking" problem.

To help with the illusion, we want to really draw ocean all the way to the horizon. So we need some extra skirt geometry around the coarsest level clipmap. We could add more clipmap levels but that's wasting triangles since all of them will sample Z = 0.

Generating the mesh is pretty simple. You know how large the coarsest level clipmap is and how many vertices go along each edge, so you make a square that fits around the entire terrain and add triangles fanning out from it to some vertices arbitrarily far away.

There are tricks you can do to project vertices to the far plane, but I couldn't figure out how to make fog work with that so I just put a lot of zeroes. If anyone knows how to do this properly please get in touch!

Empty tiles

Continuing with the island idea, you probably want to be able to stand on one side of the island and have the clipmaps reach all the way to the other side. Which implies that they extend that far in the other direction too, meaning you have a lot of vertices over the ocean and outside the terrain.

It's easy enough to detect when a tile lies fully outside the world and swap in a simpler mesh and the only requirement is that it needs to have full resolution at the sides so we don't get T-junctions. A triangle fan works, and the code for swapping meshes at render time looks like this:

// draw a low poly tile if the tile is entirely outside the world
// tl = top left, br = bottom right
v2 tile_br = tile_top_left + tile_size;
bool inside = true;
if( !intervals_overlap( tile_tl.x, tile_br.x, 0, clipmap.heightmap.w ) )
	inside = false;
if( !intervals_overlap( tile_tl.y, tile_br.y, 0, clipmap.heightmap.h ) )
	inside = false;
if( !inside )
	use the simpler mesh;

Make sure to copy intervals_overlap from the ryg blog!

Geomorphing

Even though the differences between adjacent clipmap levels are pretty small, the LOD transitions can be noticeable in some cases. The idea here is that you blend between clipmap levels as you get close to the clipmap boundary.

I haven't implemented it so I have nothing to say here, but if I ever get round to it I shall update this section.

Conclusion

You should now have the knowledge to get started on an implementation of geometry clipmaps without running into the problems that I did.

It's not a huge amount of code (less than 1 KLOC), but the implementation is tricky in some places, and especially mesh generation is an absolute slog to get right.

Here's some other links I looked at while writing this:

The original paper
The GPU Gems article on GPU clipmaps
This terrain rendering project uses a single mesh instead of dividing it into parts which is probably simpler
Crest ocean renderer. Uses clipmaps to render an ocean
The Witcher 3's landscape presentation. Clipmaps in a real game!
This gamedev.net post. You might find this on Google and it's pretty crappy. It draws tiles that overlap and nothing else

2 Dec 2017 • C++ tricks: macro to disable optimisations

This is pretty simple so here you go:

#if COMPILER_MSVC
#  define DISABLE_OPTIMISATIONS() __pragma( optimize( "", off ) )
#  define ENABLE_OPTIMISATIONS() __pragma( optimize( "", on ) )
#elif COMPILER_GCC
#  define DISABLE_OPTIMISATIONS() \
        _Pragma( "GCC push_options" ) \
        _Pragma( "GCC optimize (\"O0\")" )
#  define ENABLE_OPTIMISATIONS() _Pragma( "GCC pop_options" )
#elif COMPILER_CLANG
#  define DISABLE_OPTIMISATIONS() _Pragma( "clang optimize off" )
#  define ENABLE_OPTIMISATIONS() _Pragma( "clang optimize on" )
#else
#  error new compiler
#endif

I use it in ggformat to disable optimisations on the bits that use variadic templates. For some reason variadic templates generate completely ridiculous amounts of object code and making the compiler slog through that takes a while. Not a big deal though, printing is slow anyway so disabling optimisations isn't a huge issue.

It's also handy for debugging code that's too slow in debug mode. You put DISABLE_OPTIMISATIONS around the code you want to step through, and leave everything else optimised.

BTW if you don't have COMPILER_MSVC etc macros, they look like this:

#if defined( _MSC_VER )
#  define COMPILER_MSVC 1
#elif defined( __clang__ )
#  define COMPILER_CLANG 1
#elif defined( __GNUC__ )
#  define COMPILER_GCC 1
#else
#  error new compiler
#endif

30 Nov 2017 • Preprocessor madness 2

More preprocessor craziness:

#define A 1
// do not define B
#if A == B
#endif
int main() { return 0; }

Shows no warnings with -Wall -Wextra, you have to go -Weverything to get -Wundef to get "warning: "B" is not defined, evaluates to 0".

This doesn't seem like a big deal but we got bit by it. A colleague renamed the OSX platform define to MACOS, then presuambly he went down all the compile errors and fixed them. Unfortunately we had something like:

#if WINDOWS
#include "windows_semaphore.h"
#elif OSX
#include "osx_semaphore.h"
#else
#include "posix_semaphore.h"
#endif

which doesn't throw any errors if you change the define to MACOS because OSX defines the POSIX semaphore interface. It is broken though, because all the sem_ functions return "not implemented" errors at runtime!

It's very frustrating that the most basic building blocks we have in programming are full of shitty shit like this. I have no suggestions on how to improve things or any interesting commentary to add beyond that.

17 Nov 2017 • RSS feed

Someone asked for one, ole hyvä.

Email me if it doesn't work in your feed reader!

17 Nov 2017 • Deadlock

In a bounded lock-free multi-producer queue, pushing an element takes two steps. First you need to acquire a node for writing to avoid races with other producers, then you need to flag the node as fully written once you're done with it. Then to pop from the queue you check if the head node has been fully written, then try to acquire it if you're multi-consumer too.

(Look here or here for the details.)

One issue to note is that dequeues can fail, even though the queue is non-empty, in the sense that items have been successfully enqueued and not yet dequeued. Specifically this can happen:

Thread P1 acquires a node for writing, and is descheduled.
Thread P2 acquires a node for writing, then writes and releases it.
Thread C checks if the head node is ready to read, but it isn't and returns failure.

It makes sense to guard such a queue with a semaphore. An obvious way to do that is to make the semaphore count how many elements are in the queue. You can't do that with a queue like this! If you write something like:

// producer
if( q.push( x ) )
	sem_post( &sem );

// consumer
while( true ) {
	sem_wait( &sem );
	T y;
	if( q.pop( &y ) )
		// do things with y
}

but that's incorrect because you decrement sem even when you haven't dequeued anything, and the item pushed by P2 gets lost. So instead you might try:

// producer
if( q.push( x ) )
	sem_post( &sem );

// consumer
while( true ) {
	T y;
	if( q.pop( &y ) ) {
		// do things with y
		sem_wait( &sem );
	}
}

but then you're spinning while you wait for P1 to finish.

We ran into a particularly nasty instance of the first case of this at work. We have a queue which accepts various commands, and one of them is a "flush" command where the sender goes to sleep and the consumer wakes them up again (with an Event) once the flush is done. So something like this happened:

Thread C waits on queueSem.
Thread P1 acquires a node for writing, and is descheduled.
Thread P2 pushes a flush command, signals queueSem, and waits on flushEvent.
Thread C wakes up, checks if the head node is ready to read, but it isn't and goes back to sleep.
Thread P1 wakes up and completes its write, signals queueSem, and exits.
Thread C wakes up, processes P1's command, and waits on queueSem again.
P2 is waiting for C to signal flushEvent, C is waiting for P2 to signal queueSem, and we're toast.

The fix is to make queueSem count the (negative) number of consumer threads that are asleep. The change is very simple:

// consumer
while( true ) {
	T y;
	if( q.pop( &y ) )
		// do things with y
	else
		sem_wait( &sem );
}

Which avoids the deadlock thusly:

Thread C sees the queue is empty and waits on queueSem.
Thread P1 acquires a node for writing, and is descheduled.
Thread P2 pushes a flush command, signals queueSem, and waits on flushEvent.
Thread C wakes up, checks if the head node is ready to read, but it isn't and goes back to sleep.
Thread P1 wakes up and completes its write, signals queueSem, and exits.
Thread C wakes up, processes commands until dequeue fails. This includes the flush, so C signals flushEvent, then waits on queueSem again.
Thread P2 wakes up and proceeds normally.

I'm not totally satisfied with that solution because I feel like I've misunderstood something more fundamental, something that will stop me running into similar problems in the future. Please email me if you know!

All in all writing an MPSC lock-free queue has been an enormous waste of time courtesy of shit like this. We need to enqueue things from a signal handler, which means locks and malloc are gone. Even so, since we're single-consumer and we only push in signal handlers so I believe we could have used a mutex to avoid races between producers and kept the consumer lock-free. I wasn't sure if you could get signalled while still inside a signal handler. FYI the answer is no, unless you set SA_NODEFER. I also don't know if pthread_mutex_lock etc are signal-safe, and of course if you try to Google it you just get pages of trash about "you can deadlock if the thread holding the lock gets signalled!!!!!". Presumably they are but I didn't want to risk it.

14 Nov 2017 • Preprocessor madness

This code (or similar) compiles and runs with every compiler I tested but one:

#define A( x ) 1
int main() {
	return A(); // -> "return 1;"
}

You can even use x in the macro body and it's fine:

#define A( x ) x + 1
int main() {
	return A(); // -> "return + 1;`
}

The only compiler that does the right thing and rejects this (if the spec says this is ok then the spec is fucked) is of course the AMD shader compiler.

3 Nov 2017 • C++ tricks: sized array arguments

In C if you write a function void f( char x[ 4 ] ); then the compiler ignores the 4 and treats it as char * x. This has two problems, firstly sizeof( x ) gives you sizeof( char * ) and not 4 * sizeof( char ) (GCC does warn about this), and the compiler doesn't complain if you pass in an array that's too small.

In C++ you can write void f( char ( &x )[ 4 ] ); instead and it works.

A code example:

void f( char x[ 4 ] ) {
	// warning: 'sizeof' on array function parameter 'x' will return size of 'char*'
	printf( "%zu\n", sizeof( x ) ); // prints 8
}

void g( char ( &x )[ 4 ] ) {
	printf( "%zu\n", sizeof( x ) ); // prints 4
}

int main() {
	char char3[ 3 ] = { };
	char char4[ 4 ] = { };
	char * charp = NULL;

	f( char3 ); // fine
	f( char4 );
	f( charp ); // fine

	g( char3 ); // error: invalid initialization of reference of type 'char (&)[4]' from expression of type 'char [3]'
	g( charp );
	g( char4 ); // error: invalid initialization of reference of type 'char (&)[4]' from expression of type 'char*'

	return 0;
}

1 Nov 2017 • Linux vs BSD in a man page

man gethostname on Linux:

The returned name shall be null-terminated, except that if namelen is
an insufficient length to hold the host name, then the returned name
shall be truncated and it is unspecified whether the returned name is
null-terminated.

and

RETURN VALUE
	Upon successful completion, 0 shall be returned; otherwise, −1 shall be returned.

ERRORS
	No errors are defined.

And on OpenBSD:

The returned name is always NUL terminated.

and

ERRORS
     The following errors may be returned by these calls:

     [EFAULT]           The name parameter gave an invalid address.

     [ENOMEM]           The namelen parameter was zero.

     [EPERM]            The caller tried to set the hostname and was not the
                        superuser.

31 Oct 2017 • Monocypher is excellent

Monocypher is by far the best C/C++ crypto library, probably by far the best crypto library full stop.

It's a single pair of .c/.h files. The interface has easy to use implementations of sensible primitives and algorithms. The manual is absolutely wonderful, with really clear descriptions of what each function does and what guarantees they provide.

ATM I'm using it in Medfall to sign updates. I sign a manifest that lists all the game files and their hashes, and the public key is hardcoded in the client. It's less than 100 lines of code for everything. The keygen and signing utilities, and the client side verification code.

If you rip out arc4random from the portable LibreSSL and pair it with monocypher you have everything you need to make an encrypted game networking protocol. Something like:

Client and server have hard coded signing keys, and know each other's public keys.
When opening a connection, both ends generate a new random x25519 key pair, sign the x25519 public key with their hardcoded signing key, send the signature and x25519 key to each other, and verify them with the hardcoded public signing keys.
Do crypto_key_exchange and use that to encrypt messages to each other.

The client hasn't really proven its identity to the server because you have to ship the same private key with every client so it's easy to fake, but that's not a big deal. On the other hand the client does know it's talking to the correct server, so you don't have to worry about sending your login credentials to random hackers.

I don't think you need to care about replay attacks here. To impersonate the server, you would need to take a signed x25519 public key and crack the secret key and that should be impossible. But you can stick a (signed) timestamp in there if you want. (doesn't totally mitigate it but you can reduce the time they have to crack a key to like a few seconds)

29 Oct 2017 • GL_FRAMEBUFFER_SRGB sucks

2023 addendum: this post is very bad, you need to mark your framebuffers as sRGB so alpha blending works correctly

I replaced GL_FRAMEBUFFER_SRGB with explicit linear-to-sRGB conversions in my shaders.

It's a little bit more code but having the extra control is worth it. The big wins are a UI that looks like it does in image editors, and being able to easily turn off sRGB for certain debug visualisations.

Some GLSL for myself to copy paste into future projects:

float linear_to_srgb( float linear ) {
	if( linear <= 0.0031308 )
		return 12.92 * linear;
	return 1.055 * pow( linear, 1.0 / 2.4 ) - 0.055;
}

vec3 linear_to_srgb( vec3 linear ) {
	return vec3( linear_to_srgb( linear.r ), linear_to_srgb( linear.g ), linear_to_srgb( linear.b ) );
}

vec4 linear_to_srgb( vec4 linear ) {
	return vec4( linear_to_srgb( linear.rgb ), linear.a );
}

24 Oct 2017 • Roadblocks to releasing Medfall on macOS

Releasing software for OSX is annoying.

I kind of want to go down the "here is a totally unsupported and untested package" route, just because I can and it would probably be fine. I don't want to have to boot into OSX and actually test it as part of my release process because it's already annoying having to do both Windows and Linux and I can do both at the same time with my desktop/laptop.

But even that is a nightmare, because OSX is extremely hard to get working in a VM and I can't do it. Also it's illegal.

So you can cross compile. Clang supports cross compilation out of the box so it's less annoying than you might expect, but you still have to get all the headers/libs from an OSX machine, and then you have to install the LLVM linker which for some reason is packaged separately, and then that might be illegal too but I haven't read the EULA.

Building an installer package is more difficult. .pkgs uses a stupid format called XAR (I don't know if it's actually bad but WTF even if it was good you have to use zip or tar because everything supports those and does not support xar) so you have to download some Github project (maybe 7z can do it but it gives a scary warning trying to list the .pkg I built on OSX).

Inside the .pkg there are three files. There's a file called PackageInfo, which is some XML and looks easy enough to generate. There's a file called Bom which is some binary manifest, and you need to download another Github project to generate that. The last file is called Payload, which is another stupid archive format (cpio, which 7z seems ok with) + gzip, and contains the folder you pointed pkgbuild at.

I feel like I could probably get cross compilation and the .pkg stuff working but it would take ages and be boring so I'm not going to.

ADDENDUM: apparently you can jailbreak Apple TVs. Would be pretty funny to run OSX builds (it's cross-arch but not cross-OS so it's probably simpler) on the Apple equivalent of a Raspberry Pi. Of course I haven't tried it and I'm not going to but it amuses me that it might be possible.

The other major annoyance is that OSX doesn't support GL past 4.1, which means you miss out on:

Compute shaders
BC6 and BC7
glTexStorage2D, which lets you allocate immutable textures with space for mipmaps in a single call
Persistent mapping
Multidraw indirect
Explicit uniform locations
Clip control
DSA
Debug mode (lol)
Probably other stuff that I don't know is awesome

I'm actually only really upset about losing clip control, because you need that to do massive draw distances. The rest I can either live without (compute, BPTC, explicit uniforms) or are pretty simple to conditionally support, so it's not the end of the world but it is annoying. I also don't really understand why Apple wants to push Metal like this, if I ever write a second render backend it's obviously going to be D3D and not Metal.

24 Oct 2017 • Vim: peek definition

Visual Studio has this awesome thing called "Peek Definition", which lets you open a temporary window that shows you the definition of whatever you wanted to look at. That link has a screenshot so you can see what I mean.

I have something similar but crappier working in Vim. You need to go through the stupid ctags hoops, ATM I am using vim-gutentags which actually seems pretty good. Then you can do something like

map <C-]> :vsplit<CR>:execute "tag " . expand( "<cword>" )<CR>zz<C-w>p

and then press C-] to get an unfocused vsplit at the definition of whatever the cursor was over.

It's not as good because if it can't find the tag you get a vsplit of whatever you were looking at. It totally fails on member functions (like size) because ctags doesn't understand C++. With functions it asks you whether you want to jump to the body or the prototype, when you probably always just want to jump to the prototype. That last one might be fixable if I tweak the flags to ctags but meh.

18 Oct 2017 • OpenSMTPD is excellent

Addendum: this post has a sequel

Whenever I have to interact with email software that isn't OpenSMTPD I'm just so appalled by how shitty it is. Except maybe rspamd. Email software just seems to follow the 1980s Unix philosophy of "do one thing and completely suck dick at it".

My entire config looks like this:

pki mikejsavage.co.uk certificate "/etc/ssl/mikejsavage.co.uk.fullchain.pem"
pki mikejsavage.co.uk key "/etc/ssl/private/mikejsavage.co.uk.key"

listen on lo
listen on lo port 10028 tag DKIM
listen on egress tls pki mikejsavage.co.uk
listen on egress port 587 tls-require pki mikejsavage.co.uk auth

accept from any for local virtual { "@" => mike } deliver to mda "rspamc --mime --ucl --exec /usr/local/bin/dovecot-lda-mike" as mike
accept from local tagged DKIM for any relay
accept from local for any relay via smtp://127.0.0.1:10027

That's 9 lines of config. DKIMProxy's config is 8 lines. Dovecot's config is 2453 lines split across 34 files. WTF how can you suck so much? DKIMProxy's only job is to add a header to outgoing emails. Dovecot is probably of similar complexity to smtpd but has two orders of magnitude more config. I have a bunch of spamd/bgpd garbage lying around too and I have no idea if it does anything. Nuts.

pop3d looks extremely good, like this is how Dovecot should be, but it's POP and POP is useless. God damn. It's made me pretty tempted to do an imapd though. I'd have to keep Dovecot around for the MDA until smtpd gets filters, but after that I could drop everything but smtpd/rspamd/imapd and be happy.

The gmail spam filter does an extremely bad job of dealing with modern spam. The spam of old times is pretty much solved, the spam I got on this domain before I set up rspamd was all obviously fake invoices and Russian dating websites, really simple to filter out.

Modern spam is also trivial to spot, but the people spamming buy ads from Google so they'll never block it. I mean crap like newsletters after you explicitly unchecked the box that allows them to send you junk, the biweekly terms and conditions updates from shit startups, etc. If you blocked any email that contains an unsubscribe link or the phrase "Terms and Conditions" you would catch 100% of it with nearly no false positives. It's so easy but they won't do it.

Amusingly the gmail spam filter does a perfect job on my work inbox:

Change Management Training - Change Management Training in Paris, France Helsinki
Mike - Invitation to discuss how ex-Google/McKinsey team is replacing HR with bots
PMP Certification Workshop - PMP Certification Training in Helsinki, Finland
Data for Breakfast - Join us in Stockholm
PMP Certification Training
Data warehousing: Let the past tell your future
Webinar: Making Today's Data Rapidly Consumable
One-day Agile & Scrum training - Agile & Scrum training in Helsinki, Finland

15 Oct 2017 • Not even not upgrading can save me

A bunch of shit in Firefox has been breaking for me lately. :open in Vimperator has entirely stopped working on my laptop. Not even restarting helps.

The NoScript buttons in the tab bar randomly disappear and I have to restart.

Firefox randomly can't open web pages unless I try again. I thought it might have been a router problem or something but no other piece of software on my PC has this problem.

FFS I don't upgrade software so I can avoid garbage like this, but apparently not even that helps.

Actually I did upgrade this server to 6.2 today. I noticed it had 186 days of uptime before I took it down, so it's been alive since I did the update to 6.1. None of the long-running software I use has crashed a single time during that period. Upgrading probably took less than five minutes and everything came up and worked first time. Why can't all software be OpenBSD?

Couple of updates a few days later: my RSS cronjob is hanging for some unknown reason and I needed to reinstall rspamd (I think I built it myself before). Also the Firefox failures have spread to my desktop.

15 Oct 2017 • Optimising vs expanding to fill all available resources

Parallelising code does not make it faster.

You actually run slightly slower, because you have to deal with the overhead of dispatch and context switches and expensive futex calls. But we do it anyway because it makes code run in less time. So you trade CPU time for wall clock time. Or throughput for latency.

In games, people use thread pools to go wide when they have lots of the same work that must must be done to get a frame out. Things like culling, broad-phase collision detection, skinning, etc.

It's not immediately obvious that high framerate corresponds to low latency rather than high throughput. If you think of a frame as taking the inputs at the start of the frame, like the state of the world last frame and player/network inputs, and then producing the next frame as output, it kind of makes sense. You're reducing the latency between receiving the inputs and spewing the output.

It's also really surprising how little you benefit from using multiple threads. A typical desktop PC has 2 or 4 cores. The Steam hardware survey says that's 95% of the market (the gamer market even!), you're looking at less than 4x speedup.

That's a bad habit I need to break. When something needs optimising one of the first things that comes to mind is "put it on the thread pool". On one hand it's easy (ADDENDUM: not gonna edit this out but of course threading is not easy), on the other it's junk speedup and other optimisation methods are not a huge amount harder. Parallelising my code should probably be the last optimisation I make!

Anyway I was thinking about this because of all the Firefox talk about having one thread per tab and GPU text rendering and GPU compositing and etc. Ok Firefox runs in less wall clock time because it has 4x more resources, but now my whole PC runs like trash. The reason multi-core CPUs were such a huge upgrade when they first came out was that shit apps didn't make your PC unusable anymore! But now the shit apps are becoming parallelised, we're going back to the bad old times.

The GPU stuff isn't in yet but I'm looking forward to the "we made our code 10x slower but put it on hardware that's 100x faster!!" post, swiftly followed by having to close my web browser whenever I want to play games.

7 Oct 2017 • Code for my intro to raytracing talk

I gave a talk about the basics of raytracing for the Catz Computer Science Society a while ago. I was drawing on my wacom so there are no slides and nobody recorded it, but the code is on Github and I'm still quite pleased with how simple it ended up being.

Feel free to use it for whatever.

7 Oct 2017 • C++ tricks: autogdb

One of the nice things about developing on Windows is that if your code crashes in debug mode, you get a popup asking if you want to break into the debugger, even if you ran it normally.

With some crap hacks we can achieve something pretty similar for Linux:

#pragma once

#include <sys/ptrace.h>
#include <sys/wait.h>

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>

static void pause_forever( int signal ) {
	while( true ) {
		pause();
	}
}

static void uninstall_debug_signal_handlers() {
	signal( SIGINT, SIG_IGN );
	signal( SIGILL, pause_forever );
	signal( SIGTRAP, SIG_IGN );
	signal( SIGABRT, pause_forever );
	signal( SIGSEGV, pause_forever );
}

static void reset_debug_signal_handlers() {
	signal( SIGINT, SIG_DFL );
	signal( SIGILL, SIG_DFL );
	signal( SIGTRAP, SIG_DFL );
	signal( SIGABRT, SIG_DFL );
	signal( SIGSEGV, SIG_DFL );
}

static void prompt_to_run_gdb( int signal ) {
	uninstall_debug_signal_handlers();

	const char * signal_names[ NSIG ];
	signal_names[ SIGINT ] = "SIGINT";
	signal_names[ SIGILL ] = "SIGILL";
	signal_names[ SIGTRAP ] = "SIGTRAP";
	signal_names[ SIGABRT ] = "SIGABRT";
	signal_names[ SIGSEGV ] = "SIGSEGV";

	char crashed_pid[ 16 ];
	snprintf( crashed_pid, sizeof( crashed_pid ), "%d", getpid() );
	fprintf( stderr, "\nPID %s received %s. Debug? (y/n)\n", crashed_pid, signal_names[ signal ] );

	char buf[ 2 ];
	read( STDIN_FILENO, &buf, sizeof( buf ) );
	if( buf[ 0 ] != 'y' ) {
		exit( 1 );
	}

	// fork off and run gdb
	pid_t child_pid = fork();
	if( child_pid == -1 ) {
		err( 1, "fork" );
	}
	reset_debug_signal_handlers();

	if( child_pid == 0 ) {
		execlp( "cgdb", "cgdb", "--", "-q", "-p", crashed_pid, ( char * ) 0 );
		execlp( "gdb", "gdb", "-q", "-p", crashed_pid, ( char * ) 0 );
		err( 1, "execlp" );
	}

	if( signal != SIGINT && signal != SIGTRAP ) {
		waitpid( child_pid, NULL, 0 );
		exit( 1 );
	}
}

static bool being_debugged() {
	pid_t parent_pid = getpid();
	pid_t child_pid = fork();
	if( child_pid == -1 ) {
		err( 1, "fork" );
	}

	if( child_pid == 0 ) {
		// if we can't ptrace the parent then gdb is already there
		if( ptrace( PTRACE_ATTACH, parent_pid, NULL, NULL ) != 0 ) {
			if( errno == EPERM ) {
				printf( "! echo 0 > /proc/sys/kernel/yama/ptrace_scope\n" );
				printf( "! or\n" );
				printf( "! sysctl kernel.yama.ptrace_scope=0\n" );
			}
			exit( 1 );
		}

		// ptrace automatically stops the process so wait for SIGSTOP and send PTRACE_CONT
		waitpid( parent_pid, NULL, 0 );
		ptrace( PTRACE_CONT, NULL, NULL );

		// detach
		ptrace( PTRACE_DETACH, parent_pid, NULL, NULL );
		exit( 0 );
	}

	int status;
	waitpid( child_pid, &status, 0 );
	if( !WIFEXITED( status ) ) {
		err( 1, "WIFEXITED" );
	}

	return WEXITSTATUS( status ) == 1;
}

static void install_debug_signal_handlers( bool debug_on_sigint ) {
	if( being_debugged() ) return;

	if( debug_on_sigint ) {
		signal( SIGINT, prompt_to_run_gdb );
	}
	signal( SIGILL, prompt_to_run_gdb );
	signal( SIGTRAP, prompt_to_run_gdb );
	signal( SIGABRT, prompt_to_run_gdb );
	signal( SIGSEGV, prompt_to_run_gdb );
}

Include that somewhere in your code and stuff #if PLATFORM_LINUX install_debug_signal_handlers( true ); #endif at the top of main. Then when your program crashes you will get a prompt like PID 19418 received SIGINT. Debug? (y/n).

GDB often crashes and if you break with ctrl+c you can get problems when you quit GDB, but when it does work it's nice and it's definitely better than nothing.

BTW I wrote this ages ago and I can't remember many of the details around signal handling so don't ask me.

3 Oct 2017 • More installer junk

This is a bit of a followup to my last post about installers.

Turns out the "append /$MultiUser.InstallMode to the UninstallString" trick doesn't really work.

The problem is that when the user navigates to the folder with the uninstaller and double clicks it, as opposed to going through control panel, it will not have the command line switch and therefore will try to uninstall whatever you set as the default mode.

Coming up with a fix for this has been quite frustrating. A part of me wants to say "if the user wants to shoot themselves in the foot I can't do anything about it", but running the uninstaller manually seems pretty innocent to me, and we are the ones that lose money when our product doesn't work.

It's difficult because we are installing a plugin for some third-party software, so we have two installation folders and both of them move depending on whether it's a machine-wide or single user installation. If we only had our folder then we could just nuke whatever folder the uninstaller is in, but we need to locate the second folder too.

One suggestion was to remove the option for machine-wide installations. This would allow us to simplify the installer config (which is already pretty simple but christ I really hate putting any logic at all in these shitty crippled non-languages) but some of our clients have slow IT and requiring them to install the plugin for all of their users separately is a no go.

Another idea I tried was to look at whether the uninstaller exe was located under Program Files. It's ok but I don't think we can totally rule out users moving the installation folder somewhere else, like to another drive or something.

So I finally settled on writing an installmode.txt next to the installer. It's robust against moving the folder around and running the uninstaller directly, but the user can still go and delete the txt file if they really want to or the installer can fail to write it or etc.

I still don't like my solution because I really hate writing code that can fail. It's a huge relief when you can write a bit of code, no matter how trivial, and know that it can never go wrong. In this case I don't really have a choice because Windows doesn't provide a robust way to install software. (installers are literally just self extracting zips that also write registry keys)

It's especially upsetting because this code is going to be shipped to our non-technical customers. I dread the day when someone comes in with an insane installation problem, all of our suggestions take weeks to test and are expensive for the customer because they have to go through their outsourced IT, and then they either burn out and give up or we simply can't figure it out. Huge huge waste of time for everyone involved, and we lose the sale.

Someone noted a funny issue with the installer. If you did both a machine-wide and a single user installation at the same time, the W10 settings app would merge them together into a single entry and you couldn't choose which one to remove. They had the same name in control panel at that point too but at least both entries were there. The fix for that is to give them different registry key names. So something like HKLM\...\Uninstall\OurSoftwareAllUsers\UninstallString and HKCU\...\Uninstall\OurSoftwareCurrentUser\UninstallString. Or SHCTX\...\Uninstall\OurSoftware$MultiInstall.InstallMode. And I guess give them different DisplayNames too so you can distinguish them.

1 Oct 2017 • Really finishing the job

This is a followup to Rust performance: finishing the job.

Outsmarted by a crustacean

Over on lobste.rs, pbsd points out that my approach is bad and mentions a very neat trick. Subtracting 255 is equivalent to adding one when using u8, so you can keep 16 counters in a register and increment them by subtracting the mask returned by _mm_cmpeq_epi8. You have to stop every 255 chunks to make sure the counters don't overflow, but other than that it's quite simple. The hot loop becomes:

__m128i needles = _mm_set1_epi8( needle );
while( haystack < one_past_end - 16 ) {
	__m128i counts = _mm_setzero_si128();

	for( int i = 0; i < 256; i++ ) {
		if( haystack >= one_past_end - 16 ) {
			break;
		}

		__m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
		__m128i cmp = _mm_cmpeq_epi8( needles, chunk );
		counts = _mm_sub_epi8( counts, cmp );

		haystack += 16;
	}

	__m128i sums = _mm_sad_epu8( counts, _mm_setzero_si128() );
	u16 sums_[ 8 ];
	_mm_storeu_si128( ( __m128i * ) sums_, sums );
	c += sums_[ 0 ] + sums_[ 4 ];
}

Another neat trick is that we can use _mm_sad_epu8 to add the 8 counts at the end. It's slightly faster than storing the counts to u8[ 16 ] and summing them normally.

With the same test setup as last time, this runs in 2.01ms. Again it helps to unroll the loop manually. The inner loop is so simple now it actually helps to unroll 4x, and if we do that it runs in 1.92ms!

Branchless scalar code

The original code can be made branchless. The trick is you replace if( haystack[ i ] == needle ) c++; with c += haystack[ i ] == needle ? 1 : 0;, which can be computed with a CMP and SETZ.

GCC is smart enough to perform this optimisation already, even at -O2, so no benchmark for this one.

AVX2

AVX2 has 32 wide versions of all the instructions we used in the SSE version. The code changes are simple (basically find and replace) so I won't include them here.

Interestingly, the AVX2 version doesn't actually run any faster. I spoke to JM about it and he said I might be bottlenecked on memory bandwidh. The "lorem ipsum" string I search in is 42.6MB, so searching that in 1.92ms is 22.2GB/s. I have a single stick of 3000MHz RAM, so sure enough that is the bottleneck.

Conclusion

Here's the same table as last time but with the new results added:

Version	Time	% of `-O2`	% of `-O3`
Scalar `-O2`	21.3ms	100%	-
Scalar `-O3`	7.09ms	33%	100%
Old vectorised	2.74ms	13%	39%
Old unrolled	2.45ms	12%	35%
New vectorised	2.01ms	9%	28%
New unrolled	1.92ms	9%	27%

The new vectorised code is 10x faster than the original!

Full code

You can download the code I used to generate those results if you want to try it yourself. You'll also need the code from the last post if you want to compare before and after.

3 Sep 2017 • OpenGL uniforms and renderer design rambling

I recently did a bit of a renderer overhaul in my engine and I'm very pleased with how it turned out so now seems like a good time to blog about it. I don't think there's anything left in my renderer that's blatantly bad or unportable, yet there's still obvious improvements that can be made whenever I feel like working on that. (I like leaving things that are easy, fun and non-critical because if I ever get bored or stuck on something else I can go and work on them)

This post is going to roughly outline the evolution of setting OpenGL uniforms in my game engine. It's a simple sounding problem ("put some data on the GPU every frame") but OpenGL gives you several different ways to do it and it's not obvious which way is best. I assume it's common knowledge in the industry, but it took me a long time to figure it out and I don't recall ever seeing it written down anywhere.

glGetUniformLocation and glUniform

This is what everyone starts off with, it's what all the OpenGL tutorials describe, and it's what most free Github engines use. I'm not going to go over it in great detail because everyone else already has.

I will say though that the biggest problem by far with this method is that the book keeping becomes a pain in the ass once you move beyond anything trivial.

Uniform block objects

A step up from loose uniforms are UBOs. Basically you can stuff your uniforms in a buffer like you do with everything else, and use that like a struct from GLSL. The Learn OpenGL guy has a full explanation of how it works.

It's best to group uniforms by how frequently you update them. So like you have a view UBO with the view/projection matrices and camera position, a window UBO with window dimensions, a light view UBO with the light's VP matrix for shadow maps, a sky UBO with skybox parameters, etc. There's not actually very many different UBOs you need, so you can hardcode an enum of all the ones you use, and then bind the names to the enum with glUniformBlockBinding.

Of course you can have a "whatever" UBO that you just stuff everything in while prototyping too.

As a more concrete example you can do this:

// in a header somewhere
const u32 UNIFORMS_VIEW = 0;
const u32 UNIFORMS_LIGHT_VIEW = 1;
const u32 UNIFORMS_WINDOW = 2;
const u32 UNIFORMS_SKY = 3;

// when creating a new shader
const char * ubo_names[] = { "view", "light_view", "window", "sky" };
for( GLuint i = 0; i < ARRAY_COUNT( ubo_names ); i++ ) {
	GLuint idx = glGetUniformBlockIndex( program, ubo_names[ i ] );
	if( idx != GL_INVALID_INDEX ) {
		glUniformBlockBinding( program, idx, i );
	}
}

// rendering
GLuint ub_view;
glBindBuffer( GL_UNIFORM_BUFFER, ub_view );
glBufferData( GL_UNIFORM_BUFFER, ... );
// ...
glBindBufferBase( GL_UNIFORM_BUFFER, UNIFORMS_VIEW, ub_view );

I found it helpful to write a wrapper around glBufferData. UBOs have funny alignment requirements (stricter than C!) and it was annoying having to mirror structs between C and GLSL. So instead I wrote a variadic template that lets me write renderer_ub_easy( ub_view, V, P, camera_pos );, which copies its arguments to a buffer with the right alignment and uploads it. The implementation is kind of hairy but here you go:

template< typename T >
constexpr size_t renderer_ubo_alignment() {
	return min( align4( sizeof( T ) ), 4 * sizeof( float ) );
}

template<>
constexpr size_t renderer_ubo_alignment< v3 >() {
	return sizeof( float ) * 4;
}

template< typename T >
constexpr size_t renderer_ub_size( size_t size ) {
	return sizeof( T ) + align_power_of_2( size, renderer_ubo_alignment< T >() );
}

template< typename S, typename T, typename... Rest >
constexpr size_t renderer_ub_size( size_t size ) {
	return renderer_ub_size< T, Rest... >( sizeof( S ) + align_power_of_2( size, renderer_ubo_alignment< S >() ) );
}

inline void renderer_ub_easy_helper( char * buf, size_t len ) { }

template< typename T, typename... Rest >
inline void renderer_ub_easy_helper( char * buf, size_t len, const T & first, Rest... rest ) {
	len = align_power_of_2( len, renderer_ubo_alignment< T >() );
	memcpy( buf + len, &first, sizeof( first ) );
	renderer_ub_easy_helper( buf, len + sizeof( first ), rest... );
}

template< typename... Rest >
inline void renderer_ub_easy( GLuint ub, Rest... rest ) {
	constexpr size_t buf_size = renderer_ub_size< Rest... >( 0 );
	char buf[ buf_size ];
	memset( buf, 0, sizeof( buf ) );
	renderer_ub_easy_helper( buf, 0, rest... );
	glBindBuffer( GL_UNIFORM_BUFFER, ub );
	glBufferData( GL_UNIFORM_BUFFER, sizeof( buf ), buf, GL_STREAM_DRAW );
}

I'm not 100% sure I got the alignment stuff right but it works for everything I've thrown at it so far.

In terms of book keeping it is better than loose uniforms, but you still need to allocate/deallocate/keep track of all your uniform buffers. It's less but still non-zero.

glMapBuffer and glBindBufferRange

For this next one you actually need to reorganise your renderer a little. Instead of submitting draw calls to the GPU immediately, you build a list of draw calls and submit them all at once at the end of the frame. More specifically you should build a list of render passes, each of which has a target framebuffer, some flags saying whether you should clear depth/colour at the start of the pass, and a list of draw calls.

People do talk about this on the internet but they focus on the performance benefits:

You can sort your draw calls by pipeline state to minimise the number of costly state changes
You can submit all your draw calls on a background thread
I guess this is how D3D12/Vulkan work so it makes porting easier too

Neither loose uniforms nor UBOs really work with this model anymore though. Maybe you can pack uniform uploads into the draw call list, but that's a pain and ugly.

The pro secret is quite simple: map a huge UBO at the start of the frame, copy the entire frame's uniforms into it, then make the offsets/lengths part of your pipeline state and bind them with glBindBufferRange.

There's no book keeping beyond telling your renderer when to start/end the frame/passes. You can use the variadic template from above with few modifications so setting uniforms is still a one-liner. It's like going from a retained mode API to an immediate mode API. If you don't upload a set of uniforms for a given frame, they just don't exist.

To make it totally clear what I mean the game code looks like this:

renderer_begin_frame();

UniformBinding light_view_uniforms = renderer_uniforms( lightP * lightV, light_pos );

// fill shadow map
{
	renderer_begin_pass( shadow_fb, RENDERER_CLEAR_COLOUR_DONT, RENDERER_CLEAR_DEPTH_DO );

	RenderState render_state;
	render_state.shader = get_shader( SHADER_WRITE_SHADOW_MAP );
	render_state.uniforms[ UNIFORM_LIGHT_VIEW ] = light_view_uniforms;

	draw_scene( render_state );

	renderer_end_pass();
}

// draw world
{
	renderer_begin_pass( RENDERER_CLEAR_COLOUR_DO, RENDERER_CLEAR_DEPTH_DO );

	RenderState render_state;
	render_state.shader = get_shader( SHADER_SHADOWED_VERTEX_COLOURS );
	render_state.uniforms[ UNIFORM_VIEW ] = renderer_uniforms( V, P, game->pos );
	render_state.uniforms[ UNIFORM_LIGHT_VIEW ] = light_view_uniforms;
	render_state.textures[ 0 ] = shadow_fb.texture;

	draw_scene( render_state );

	renderer_end_pass();
}

renderer_end_frame();

renderer_begin_frame clears the list of render passes and maps the big UBO, renderer_begin_pass records the target framebuffer and what needs clearing, draw_scene contains a bunch of draw calls which basically copy the RenderState and Meshes (VAOs) into the render pass's list of draw calls, and finally renderer_end_frame unmaps the big UBO and submits everything.

One pitfall is that glMapBuffer is probably going to return a pointer to write combining memory, so you should make sure to write the entire buffer, including all the padding you use to align things (just write zeroes). It's probably not required on modern CPUs, but it's good for peace of mind.

In case I haven't explained this well you should probably just look at my implementation in renderer.cc and renderer.h. Or look at Dolphin which does something similar.

GL_MAP_PERSISTENT_BIT

For the sake of completion, if you're using GL4 you get to use a persistent map which should be a tiny bit faster. But it's the same idea.

2 Sep 2017 • Detecting TCP server crashes

I was wondering what happens when an HTTP server gets killed in the middle of a request. The OS should close all the open sockets, but what happens on the client? Can it tell that the server was shutdown forcefully, or does recv return 0 like it does for a normal shutdown?

I was wondering specifically because my HTTP client doesn't look at the response headers, and I wasn't sure if that would lead to problems down the line.

The Arch/OpenBSD man pages don't have anything to say about it. The junky and often outdated die.net man pages talk about ECONNRESET, and SO has a few questions that mention it, but nowhere else does. Grepping the OpenBSD kernel sources doesn't make it obvious.

So let's just test it. The client code is trivial and of no use to you because it's written against my libraries. On the server we can do nc -l 13337 and then pkill -9 nc from another terminal. The result:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(13337), sin_addr=inet_addr("108.61.209.87")}, 16) = 0
recvfrom(3, "", 16, 0, NULL, NULL)      = 0

It looks just like a normal shutdown! And that's a bug in my HTTP client: I don't look at Content-Length so truncated responses are not considered an error.

I guess the bigger picture here is that TCP is not as high level as you might expect. Your protocol has to be able to distinguish between premature and intentional shutdowns. Somewhat related is that you can't tell when the other party has processed your TCP packet. ACK just means it's sitting in a buffer somewhere in the TCP stack, and there's no guarantee that the application has actually seen your data yet. So if your application need acks, you have to add them to your protocol too!

(The even bigger picture is that if you trust the other party at all, it will bite you)

24 Aug 2017 • ggformat

ggformat is the string formatting library I use in my game engine.

It's awesome because you can add your own types and you won't go grey waiting for it to compile. It's portable and doesn't allocate memory, so it's ideal for game engines. It's just printf, but better!

Get the real docs and the code from Github.

23 Aug 2017 • Saving scroll position when refreshing

I've noticed that when you refresh this blog (with Linux + Firefox) it always scrolls back up to the top. If you try to Google it you get a few results from people doing some insane Javascript infinite scrolling.

My blog doesn't do anything, so I've no idea what could be causing it.

23 Aug 2017 • Never update anything

I updated my Broadcom driver and my wifi stopped working. #archlinux made fun of me for running a two year old kernel, so I went ahead and upgraded that. It did fix my wifi but somehow it broke ASAN. WTF seriously

I caved to the stupid Whatsapp bug where it scrolls up to some random position whenever you open a conversation (which was introduced last time I updated) and the new version has some dumb crap merging photos together that is annoying and terribly implemented.

I always know in my head that literally nothing good can come from updating software but I still do it. It's a bad habit that I need to break. Of course on Linux that means I can never install new software because everything has to be dynamically linked against the latest versions and broken against everything else.

20 Aug 2017 • Ruoka Helsingissä

(heh that needed <meta charset="utf-8"> to render properly locally)

Some quick restaurant reviews for Helsinki, in no particular order:

La Soupe: can get a nice cup of soup and a quiche in a few minutes. The soup is slightly greasy but good. They used to just have a daily menu on the website but they changed and it and now it's more confusing
Soppakeittö: better soup but more expensive and you have to eat it there
Roslund: Great burgers and great chips
Corretto: very good bolognese on wednesday, very good lasagne on friday. I assume the other stuff they do is good too but I never tried it
Smyg (Sinne): been twice and haven't really liked it either time
Stone's: nice chips but the burger was WTF AWFUL. Just really bad tasting meat
Fig: too slow at lunch but the dinners are solid
Karl Johan: used to do really nice pea soup on thursdays and had a really nice spicy beef soup once, but they seem to have dropped all that and replaced the lunch menu with fucking meatloaf every day. Really nice for dinner
Pompier: pretty good lunches but pretty greasy too. Super quick. They also have pea soup thursdays but I don't like it
Krog Madame: pretty good lunches that are not greasy. Also quick
Gran Delicato: other people seem to like it but I've only ever found it ok
Oba: the toasted flatbreads are decent, the meatballs are really excellent
Ateljé Finne: real good
Muru: big disappointment because I other people said it is great. Firstly put the goddamn menu on paper, I don't want to have to decide with the waiter standing there and I'm going to forget exactly what they say. Secondly the food was just not that great. I can make better risotto than what I ate there and Muru is probably the most expensive restaurant I've been to so that just shouldn't happen
Limone: it's pretty good and only a few minutes away
Salve: massive portions of eh food
Blue Peter: 25 euro lunch buffet that is worse than any 10 euro buffet you can find in town
Karl Fazer cafe (the big one): nice but mad overpriced. Went a bunch when J was expensing everything
The basement of Stockmann: nice pastries
S-Market: if you can get the warm garlic bread or ciabatta it's great, 35 43 cent croissants are great too.
K-Market: go to S-Market instead
Rivoletto: go to Corretto instead
Red Koi Thai: is a favourite at work but I thought it was meh
Mei Lin: eh
New Bamboo Center: eh
Fat Ramen: eh
Beijing 8: dumplings that don't taste of anything with some sauce to dip them in
Factory: eh
Kungfu Kitchen: eh, expensive for lunch
Ônam: I like the lemongrass chicken
GIWA: really nice simple food, S says it's legit Korean
Helkatti: the best
Surely other places I forgot...

16 Aug 2017 • SIGGRAPH 2017

I don't understand how anyone can say the food is so much better in America. I've been to NY and SF which are both supposed to be amazing for food, and just got back from LA.

Their reasoning is usually along the lines of "oh, if we Uber/subway for half an hour we can find a single restaurant that serves great food". Ok, but then if you try to pick somewhere at random it's very close to 100% chance that it's just awful. If you try to be smart, and only pick from places that have four or more stars online it's still very close to 100% chance that it's just awful.

It's fine if I'm planning a meal out somewhere and we can ask people what is actually good, but if I'm out and hungry and want to find somewhere quickly it's very annoying.

There's an absolutely ridiculous amount of signs in LA. Ads for everything, signs saying you must or must not do something plus some law reference code, all over the place. It's actually quite jarring going from Helsinki, which has less signage and even less that I can actually read, to the crazy visual pollution of LA.

Driving vehicles with unmuffled exhausts should be punishable with huge fines. Let's look at an extreme case: driving an unmuffled car through the middle of LA at night. Everyone within a block or two's radius will get woken up, and if you drive a few miles that's easily several thousand people. Now let's say that costs them $5 in lost productivity the next day. That's easily five digits of damage just because they wanted to be a shithead and do something which doesn't actually bring them any benefit.

The same logic goes for police sirens. If the guy you're chasing did less than $10k of damage, write the victim a cheque and be done with it, it's a net positive.

Don't share hotel rooms at conferences. I had plans to go see friends in SF after Siggraph, so I was very deliberately avoiding meeting people and shaking hands and going to bed on time so I could be healthy and fresh for fun times with my best friends. It very nearly backfired on me because one of the guys in my room got sick on the first day. I was lucky enough to stay good until I was waiting in SFO to go home. (but then the vacation days I booked to recover got wasted because I was sick and wouldn't have gone to work anyway)

America is big enough that NY -> LA takes almost as long as Helsinki -> NY. Fuck that! Helsinki -> SF is direct and I should hang out there both sides next time.

BTW the advances talks about cloud rendering in HZD and ocean rendering were awesome, as was the open problems one about using deep learning to learn TAA. The latter sounds like "oh god please no god", but the presenter was really very good at explaining everything and his videos were cool. Unfortunately I can't find any videos of the talks and only having slides is not so great.

The ocean rendering talk finally made clipmaps click for me. You upload a constant mesh with more triangles in the middle at init, then when rendering you snap it to integer coordinates and sample your heightmap texture in the vertex shader. Very simple! For performance I assume you can sample high mip levels at high lod levels to preserve locality when sampling. And if that's true you probably want to page out lower mip levels when you aren't using them to save memory.

14 Aug 2017 • Rust performance: finishing the job

Today I saw a story about profiling and optimising some Rust code. It's a nice little into to perf, but the author stops early and leaves quite a lot on the table. The code he ends up with is:

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
	let i = r / self.k;

	let mut count = 0;

	// count all the matching bytes b/t the closest checkpoint and our desired lookup
	for idx in (i * self.k) + 1 .. r + 1 {
		if bwt[idx] == a {
			count += 1;
		}
	}

	// return the sampled checkpoint for this character + the manual count we just did
	self.occ[i][a as usize] + count
}

Let's factor out the hot loop to make it dead clear what's going on:

// BWTSlice is just [u8]
pub fn count(bwt: &[u8], a: u8) -> usize {
	let mut c = 0;
	for x in bwt {
		if x == a {
			c += 1;
		}
	}
	c
}

pub fn get(&self, bwt: &BWTSlice, r: usize, a: u8) -> usize {
	let i = r / self.k;
	self.occ[i][a as usize] + count(bwt[(i * self.k) + 1 .. r + 1])
}

It's just counting the number of times a occurs in the array bwt. This code is totally reasonable and if it didn't show up in the profiler you could just leave it at that, but as we'll see it's not optimal.

BTW I want to be clear that I have no idea what context this code is used in or whether there are higher level code changes that would make a bigger difference, I just want to focus on optimising the snippet from the original post.

SIMD

x86 has had instructions to perform the same basic operation on more than one piece of data for quite a while now. For example there are instructions that operate on four floats at a time, instructions that operator on a pair of doubles, instructions that operate on 16 8bit ints, etc. Generally, these are called SIMD instructions, and on x86 they fall under the MMX/SSE/AVX instruction sets. Since the loop we want to optimise is doing the same operation to every element in the array independently of one another, it seems like a good candidate for vectorisation. (which is what we call rewriting normal code to use SIMD instructions)

Rewrite It In C++

I would have liked to have optimised the Rust code, and it is totally possible, but the benchmarking code for rust-bio does not compile with stable Rust, nor does the Rust SIMD library. There's not much I'd rather do less than spend ages dicking about downloading and installing other people's software to try and fix something that should really not be broken to begin with, so let's begin by rewriting the loop in C++. This is unfortunate because my timings aren't comparable to the numbers in the original blog post, and I'm not able to get numbers for the Rust version.

size_t count_slow( u8 * haystack, size_t n, u8 needle ) {
	size_t c = 0;
	for( size_t i = 0; i < n; i++ ) {
		if( haystack[ i ] == needle ) {
			c++;
		}
	}
	return c;
}

As a fairly contrived benchmark, let's use this count how many times the letter 'o' appears in a string containing 10000 Lorem Ipsums. To actually perform the tests I disabled CPU frequency scaling (this saves about 1.5ms!), wrapped the code in some timing boilerplate, ran the code some times (by hand, so not that many), and recorded the fastest result. See the end of the post for the full code listing if you want to try it yourself.

If we build with gcc -O2 -march=native (we really only need -mpopcnt. FWIW -march=native helps the scalar code more than it helps mine) the benchmark completes in 21.3ms. If we build with -O3 the autovectoriser kicks in and the benchmark completes in 7.09ms. Just to reiterate, it makes no sense to compare these with the numbers in the original article, but I expect if I was able to compile the Rust version it would be about the same as -O2.

Vectorising by hand

The algorithm we are going to use is as follows:

The instructions we want to use only work on data that's aligned to a 16 byte boundary, so we need to run the slow loop a few times if haystack is not aligned (this is called "loop peeling")
For each block of 16 bytes, we can compare all of them with needle at once to get a mask with [the PCMPEQB instruction](https://msdn.microsoft.com/en-us/library/bz5xk21a(v=vs.90).aspx). Note that matches are set to 0xff (eight ones), rather than just a single one like C comparisons.
We can count the number of ones in the mask, and divide it by eight to get the number of needles in that 16 byte block. x86 has POPCNT to count the number of ones in a 64 bit number, so we need to call that twice per 16 byte block.
When we're down to less than 16 bytes remaining, fall back to the slow loop again.

(BTW see the followup post for a better approach)

Unsurprisingly the implementation is quite a bit trickier:

size_t count_fast( const u8 * haystack, size_t n, u8 needle ) {
	const u8 * one_past_end = haystack + n;
	size_t c = 0;

	// peel
	while( uintptr_t( haystack ) % 16 != 0 && haystack < one_past_end ) {
		if( *haystack == needle ) {
			c++;
		}
		haystack++;
	}

	// haystack is now aligned to 16 bytes
	// loop as long as we have 16 bytes left in haystack
	__m128i needles = _mm_set1_epi8( needle );
	while( haystack < one_past_end - 16 ) {
		__m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
		__m128i cmp = _mm_cmpeq_epi8( needles, chunk );
		u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
		u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
		c += ( pophi + poplo ) / 8;
		haystack += 16;
	}

	// remainder
	while( haystack < one_past_end ) {
		if( *haystack == needle ) {
			c++;
		}
		haystack++;
	}

	return c;
}

But it's totally worth it, because the new code runs in 2.74ms, which is 13% the time of -O2, and 39% the time of -O3!

Unrolling

Since the loop body is so short, evaluating the loop condition ends up consuming a non-negligible amount of time per iteration. The simplest fix for this is to check whether there are 32 bytes remaining instead, and run the loop body twice per iteration:

while( haystack < one_past_end - 32 ) {
	{
		__m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
		haystack += 16; // note I also moved this up. seems to save some microseconds
		__m128i cmp = _mm_cmpeq_epi8( needles, chunk );
		u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
		u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
		c += ( pophi + poplo ) / 8;
	}
	{
		__m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
		haystack += 16;
		__m128i cmp = _mm_cmpeq_epi8( needles, chunk );
		u64 pophi = popcnt64( _mm_cvtsi128_si64( _mm_unpackhi_epi64( cmp, cmp ) ) );
		u64 poplo = popcnt64( _mm_cvtsi128_si64( cmp ) );
		c += ( pophi + poplo ) / 8;
	}
}

It's just a little bit faster, completing the benchmark in 2.45ms, which is 89% of the vectorised loop, 12% of -O2, and 35% of -O3.

Conclusion

For reference, here's a little table of results:

Version	Time	% of `-O2`	% of `-O3`
Scalar `-O2`	21.3ms	100%	-
Scalar `-O3`	7.09ms	33%	100%
Vectorised	2.74ms	13%	39%
Unrolled	2.45ms	12%	35%

Hopefully this post has served as a decent introduction to vectorisation, and has shown you that not only can you beat the compiler, but you can really do a lot better than the compiler without too much difficulty.

I am no expert on this so it's very possible that there's an even better approach (sure enough, there is). I should really try writing the loop in assembly by hand just to check the compiler hasn't done anything derpy.

Full code

For reference if you want to try it yourself. It should compile and run anywhere x86, let me know if you have problems.

14 Jul 2017 • Fixing the Visual Studio forms designer

More to file under "things that are excruciatingly stupid so nobody smart writes about them".

One thing that causes the designer to shit the bed is if your form isn't at the top of the non-designer file. So a specific example:

namespace Things {
	class FuckingEverythingUp { }
	class MyForm : Form {
		// ...
	}
}

will not work (and will give you a useless error message about dragging and dropping from the components window). You need to move MyForm above FuckingEverythingUp. Btw Microsoft made $85 billion revenue last year and has over 100k employees.

The other thing that's not so obvious to work around (but still pretty obvious) is custom form components. In our case we have a few hacked components to enable text anti-aliasing (lol), but the designer can't handle them. I got sick of going into the designer file and replacing them all with normal labels whenever I wanted to change the UI, so I added methods like ConvertLabelToHackLabel and call them in the form constructor. All they do is make a new thing and copy all the properties over.

The only things that are non-trivial are copying events, which is copied and pasted from StackOverflow thusly:

using System.Reflection;

// ...

var eventsField = typeof(Component).GetField("events", BindingFlags.NonPublic | BindingFlags.Instance);
var eventHandlerList = eventsField.GetValue(originalButton);
eventsField.SetValue(hackedButton, eventHandlerList);

and making sure you update the form's AcceptButton to point at the hacked button as needed.

Booooooooring!

4 Jul 2017 • Vim

Pressing o in visual mode moves the cursor to the other end of the selection. So if you're selecting downwards it moves the cursor to the top and lets you select upwards.

30 Jun 2017 • Wat

Found this in the arch repos:

core/mkinitcpio-nfs-utils 0.3-5
    ipconfig and nfsmount tools for NFS root support in mkinitcpio

Seems awesome!

"My internet is flaking and now I can't turn my PC on"

"My internet is flaking and now I can't run any programs"

"My internet is slow so running cowsay took half an hour"

30 Jun 2017 • C++ tricks: least effort conditional breakpoints

Let's say you want to place a breakpoint deep in some leaf code, but only when the user presses a key.

For a more concrete example, my recent refactoring broke collision detection on some parts of the map. I want to be able to point the camera at a broken spot, press a key, and step through the collision routines to see what went wrong. My terrain collision routines use quadtrees and hence are recursive, and I'd like to be able to break fairly close to the leaf nodes to minimise the amount of stepping, but still before the leaf nodes in case anything interesting happens.

Debuggers have conditional breakpoints but I doubt they can express something so complex, and I don't want to learn another shitty meta language on top of the real programming language I already use which is inevitably different for each debugger I use.

Obviously a simple hack is to add a global variable, but this happens so often it would be nice to leave them in the entire time. In my case I added extern bool break1; extern bool break2; etc to one of my common headers, put bool break1 = false; bool break2 = false; etc in breakbools.cc, and added that to my list of common objects.

(2024 update: you can just do inline bool break1 = false; in a header now)

Then adding the breakpoint I want is very simple. High up in my frame loop I add break1 = input->keys[ KEY_T ], and in my collision routine I add something like if( break1 && node bounding box is sufficiently small ) asm( "int $3" ), and it does exactly what I want. (for MSVC you need __debugbreak(); instead of int 3)

29 Jun 2017 • Writing installers for Windows

Writing a software installer for Windows is apparently a slog of people with weird configs and requests asking for things that are impossible to implement nicely.

Everyone has to do this and it's conceptually so trivial (extract an archive) so it's baffling how this is so difficult to get right, and it's crazy to think about how much time is wasted on this shit. I'm not a fuckup, and in total I've probably wasted several days on this.

The biggest roadblock is that Google is just completely worthless. You try to search for something and the results are saturated with absolute shit that's totally unrelated because for whatever reason Google puts huge weight on popular/recent articles that are only very loosely related to what you want. "Oh he has Windows and uninstall in the query, let's return millions of forum posts asking how to uninstall software!" etc. Of course that means this blog post is excruciatingly dull to write with no benefit because nobody can find it.

The hardest part is getting a nice system-wide or single-user installation without running into UAC sadness.

I know this is an extremely boring topic, but that's exactly why I want to write about it. When I run into stuff this dull my brain switches off and it takes me 10x longer than it should. If someone told me exactly how to deal with this up front and I could just autopilot through it would have been a huge win.

The ideal way would be to only request admin rights if they want to do a system-wide installation, which requires you to re-exec the installer and ask for admin, then implement some hacks to jump to the right screen. Also cross your fingers that browsers don't delete the installer as soon as it exits if you click "run" instead of "save". Not sure if any of them actually do that but it's a huge amount of testing that nobody wants to do and very fragile against the instability of webshit.

So you give up on doing it properly and always present a UAC dialog when they run the installer. To save you some time Googling, the right way to do this is with MultiUser.nsh (*). The docs for it are ok, but it crucially doesn't cover how the uninstaller should identify what version it should remove. Ideally you should be able to install both system-wide and per-user at the same time, and be able to uninstall them both separately (not because this is a valuable thing to do but because it shows that your uninstaller can figure out the right thing to do). The answer is !define MULTIUSER_INSTALLMODE_COMMANDLINE and add /$MultiUser.InstallMode to the end of your UninstallString key (so the installer stores what mode control panel should run the uninstaller with). You DON'T need to do anything funny to make sure your uninstaller registry keys get written to the write place (HKLM for system-wide, HKCU for single-user), just use WriteRegStr SHCTX ... and it'll do the right thing.

(*): I came back and reread this post. By "the right way" I don't mean that MultIuser.nsh is good. It gets a lot of things wrong, but it's the least bad option. This other plugin looks a bit better but I didn't try it.

Btw have fun testing this. You have to log in and out after every one-line change (sloooooooow on Windows), and you'll never notice if you break something later on because surely you aren't going to leave UAC enabled.

Another topic that's annoying to get right: uninstaller signing. Make a stub installer that only writes the uninstaller and quits, sign the uninstaller, then add it with File, ...

25 Jun 2017 • C++ tricks: NO_INIT

This one is very simple and I'm surprised I've not written it down already.

Default initialisation is widely considered to be good, but if you're being a performance nut you might want to opt-out. In D you can do int x = void;. Rust apparently has mem::uninitialized(). You can do the same thing in C++ thusly:

enum class NoInit { DONT };
#define NO_INIT NoInit::DONT // if you want
struct v3 {
	float x, y, z;
	v3() { x = y = z = 0; }
	v3( NoInit ) { }
};

v3 a;
v3 b( NO_INIT );
// a = (0, 0, 0) b = (garbage)

On a similar note, in my code I prefer to design my structs so that all zeroes is the default state, and then my memory managers all memset( 0 ) when you allocate something. I find it easier than getting proper construction right, and I've heard that echoed by a few other people so I guess it's not a totally bunk idea.

11 Apr 2017 • bug489729

bug489729 was an awesome Firefox extension that disabled the shit where dragging a tab (which happens all the time by accident) off the tab bar causes it to open in a new window (which takes like a full second on a 6700K and makes all my windows resize)

(and of course you can't turn it off without a fucking extension)

I'm rehosting it here, mostly for my own convenience: bug.xpi

1 Apr 2017 • C++ tricks: better casting

C style casts are not awesome. Their primary use is to shut up conversion warnings when you assign a float to an int etc. This is actually harmful and can mask actual errors down the line when you change the float to something else and it starts dropping values in the middle of your computation.

Some other nitpicks are that they are hard to grep for and can be hard to parse.

In typical C++ fashion, static_cast and friends solve the nitpicks but do absolutely nothing about the real problem. Fortunately, C++ gives us the machinery to solve the problem ourselves. This first one is copied from Charles Bloom:

template< typename To, typename From >
inline To checked_cast( const From & from ) {
	To result = To( from );
	ASSERT( From( result ) == from );
	return result;
}

If you're ever unsure about a cast, use checked_cast and it will assert if the cast starts eating values. Even if you are sure, use checked_cast anyway for peace of mind. It lets you change code freely without having to worry about introducing tricky bugs.

Another solution is to specify the type you're casting from as well as the type you're casting to. The code for this is a bit trickier:

template< typename S, typename T >
struct SameType {
	enum { value = false };
};
template< typename T >
struct SameType< T, T > {
	enum { value = true };
};

#define SAME_TYPE( S, T ) SameType< S, T >::value

template< typename From, typename To, typename Inferred >
To strict_cast( const Inferred & from ) {
	STATIC_ASSERT( SAME_TYPE( From, Inferred ) );
	return To( from );
}

You pass the first two template arguments and leave the last one to template deduction, like int a = strict_cast< float, int >( 1 ); (which explodes). I've not actually encountered a situation where this is useful yet, but it was a fun exercise.

Maybe it's good for casting pointers?

25 Mar 2017 • Least effort unit tests

I wanted a C++ unit testing library that isn't gigantic and impossible to understand, doesn't blow up compile times, doesn't need much boilerplate, doesn't put the testing code miles from the code being tested, and doesn't have its own silly build requirements that make it a huge pain in the ass to use. Unfortunately all the C++ testing libraries are either gigantic awful monoliths (e.g. googletest), or tiny C libraries that are a little too inconvenient to actually use (e.g. minunit).

Ideally it wouldn't give you awful compiler errors when you get it wrong but that's probably impossible.

Behold:

#pragma once

#if defined( UNITTESTS )

#include <stdio.h>

#define CONCAT_HELPER( a, b ) a##b
#define CONCAT( a, b ) CONCAT_HELPER( a, b )
#define COUNTER_NAME( x ) CONCAT( x, __COUNTER__ )

#define AT_STARTUP( code ) \
	namespace COUNTER_NAME( StartupCode ) { \
		static struct AtStartup { \
			AtStartup() { code; } \
		} AtStartupInstance; \
	}

#define UNITTEST( name, body ) \
	namespace { \
		AT_STARTUP( \
			int passed = 0; \
			int failed = 0; \
			puts( name ); \
			body; \
			printf( "%d passed, %d failed\n\n", passed, failed ); \
		) \
	}

#define TEST( p ) \
	if( !( p ) ) { \
		failed++; \
		puts( "    FAIL: " #p ); \
	} \
	else { \
		passed++; \
	}

#define private public
#define protected public

#else

#define UNITTEST( name, body )
#define TEST( p )

#endif

It uses the nifty cinit constructor trick to run your tests before main, and you can dump UNITTESTs anywhere you like (at global scope). Example usage:

#include <stdio.h>
#define UNITTESTS // -DUNITTESTS, etc
#include "ggunit.h"

int main() {
	printf( "main\n" );
	return 0;
}

UNITTEST( "testing some easy stuff", {
	TEST( 1 == 1 );
	TEST( 1 == 2 );
} );

UNITTEST( "testing some more easy stuff", {
	for( size_t i = 0; i <= 10; i++ ) {
		TEST( i < 10 );
	}
} );

which prints:

testing some easy stuff
    FAIL: 1 == 2
1 passed, 1 failed

testing some more easy stuff
    FAIL: i < 10
10 passed, 1 failed

hello

It would be great if you could put UNITTESTs in the middle of classes etc to test private functionality, but you can't and the simplest workaround is #define private/protected public. Super cheesy but it works. It's not perfect but it works. It's ugly. It's not a Theoretically Awesome Injection Framework Blah Blah. It works and is two lines, you can't beat it.

2024 update: Carmack approved lol

23 Mar 2017 • Caches are fast, hashes are fast

Or, how to make a C/C++ build system in 2017

Here's a problem I've been having at work a lot lately:

Put a change set up for review
Someone asks me to split a commit into its own change set so we can merge it faster
Make a new branch, cherry-pick, and swap back
Keep working on my original changes. At some point I will want to build them
MSbuild/make see that lots of files have new timestamps because I checked out master, and builds all of them even though they haven't actually changed
C++ is stupid and my Windows build is in a VM so it takes fucking ages

Obviously the dream solution here would be to have good compilers(*) and/or a good language, but neither of those are going to happen any time soon.

*: as an aside that would solve one of the big pain points with C++. Everybody goes to these massive efforts splitting up the build so they can build incrementally which introduces all of its own headaches and tracking dependencies and making sure everything is up to date and etc and it's just awful. If compilers were fast we could just build everything at once every time and not have to deal with it.

Anyway since they aren't going to happen, the best we can do is throw computing power at the problem. Most of them build some kind of dependency graph, e.g. bin.exe depends on mod1.obj which depends on mod1.cpp, then looks at what has been modified and recompiles everything that depends on it. Specifically they look at the last modified times and if a file's dependencies are newer than it then you need to rebuild.

Maybe that was a good idea decades ago, but these days everything is in cache all of the time and CPUs are ungodly fast, so why not take advantage of that, and just actually check if a file's contents are different? I ran some experiments with this at work. We have 55MB of code (including all the 3rd party stuff we keep in the repo - btw the big offenders are like qt and the FBX SDK, it's not our codebase with the insane bloat), and catting it all takes 50ms. We have hashes that break 10GB/s (e.g. xxhash), which will only add like 5 ms on top of that. (and probably much closer to 0 if you parallelise it)

So 55ms. I'm pretty sure we don't have a single file in our codebase that builds in 55ms.

From this point it's pretty clear what to do: for each file you hash all of its inputs and munge them together, and if the final hash is different from last time you rebuild. Don't cache any of the intermediate results just do all the work every time, wasting 55ms per build is much less bad than getting it wrong and wasting minutes of my time. Btw inputs should also include things like command line flags, like how ninja does it.

The only slightly hard part is making sure hashes don't get out of sync with reality. Luckily I'm a genius and solved that too: you just put the hash in the object file/binary. With ELF files you can just add a new section called .build_input_hash or something and dump it in there, presumably you can do the same on Windows too (maybe IMAGE_SCN_LNK_INFO? I spent a few minutes googling and couldn't find an immediate answer).

For codegen stages you would either just run them all the time or do the timestamp trick I guess, since we are ignoring their timestamps and hopefully your codegen is not slow enough for it to matter very much.

Anyone want to work on this? I sure don't because my god it's boring, but I wish someone else would.

UPDATE: I've been told you can work around my specific example with git branch and git rebase --onto (of course), but this would still be nice to have.

2 Mar 2017 • C++ tricks: ZERO

After writing memset( &x, 0, sizeof( x ) ); for the millionth time, you might start to get lazy and decide it's a good idea to #define ZERO( p ) memset( p, 0, sizeof( *p ) );. This turns out to be very easy to misuse:

int x;
ZERO( &x ); // cool
int y[ 8 ];
ZERO( &y ); // cool
ZERO( y ); // y[ 0 ] = 0, no warnings
int * z = y;
ZERO( z ); // y[ 0 ] = 0
ZERO( &z ); // z = NULL

You can try things like making ZERO take a pointer instead, but you still always end up with cases where the compiler won't tell you that you screwed up. The problem is that there's no way for ZERO to do the right thing to a pointer because it can't know how big the object being pointed at is. The simplest solution is to simply not allow that:

template< typename T > struct IsAPointer { enum { value = false }; };
template< typename T > struct IsAPointer< T * > { enum { value = true }; };

template< typename T >
void zero( T & x ) {
	static_assert( !IsAPointer< T >::value );
	memset( &x, 0, sizeof( x ) );
}

and as a bonus, we can use the same trick from last time to make it work on fixed-size arrays too:

template< typename T, size_t N >
void zero( T x[ N ] ) {
	memset( x, 0, sizeof( T ) * N );
}

Neat! (maybe)

2 Mar 2017 • C++ tricks: safe ARRAY_COUNT

Lots of C/C++ codebases have a macro for finding the number of elements in a fixed size array. It's usually defined as #define ARRAY_COUNT( a ) ( sizeof( a ) / sizeof( ( a )[ 0 ] ) ), which is great:

int asdf[ 4 ]; // ARRAY_COUNT( asdf ) == 4

until someone comes along and decides that asdf needs to be dynamically sized and changes it to be a pointer instead:

int * asdf; // ARRAY_COUNT( asdf ) == sizeof( int * ) / sizeof( int ) != 4

Now every piece of code that uses ARRAY_COUNT( asdf ) is broken, which is annoying by itself, but that still looks totally fine to the compiler and it's not even going to warn you about it.

Well you can fix it by doing this:

template< typename T, size_t N >
constexpr size_t ARRAY_COUNT( const T ( &arr )[ N ] ) {
	return N;
}

which correctly explodes when you pass it a pointer:

a.cc: In function ‘int main()’:
a.cc:9:27: error: no matching function for call to ‘ARRAY_COUNT(int*&)’
  return ARRAY_COUNT( asdf );
			   ^
a.cc:3:18: note: candidate: template<class T, long unsigned int N> constexpr size_t ARRAY_COUNT(const
T (&)[N])
 constexpr size_t ARRAY_COUNT( const T ( &arr )[ N ] ) {
		  ^~~~~~~~~~~
a.cc:3:18: note:   template argument deduction/substitution failed:
a.cc:9:27: note:   mismatched types ‘const T [N]’ and ‘int*’
  return ARRAY_COUNT( asdf );

31 Jan 2017 • Dumping a git repository to an encrypted zip file

ADDENDUM: This is trash, use gitolite to give your work PC read only access instead.

I want to be able to access my dotfiles repository from anywhere without actually giving people public access. I don't want to fuck about making a new user account with restricted shell/setting up a massive web server/etc.

The simplest solution I can think of is making a post-receive hook that dumps the repository to an encrypted zip and copying that to the (static) web root, which is done like this:

#! /bin/sh

OUT=/path/to/webroot/dotfiles.7z

rm "$OUT"
git archive master | 7z a -sidotfiles.tar -ppassword -mhe=on "$OUT"

Ok so it's a 7z not a zip, but some zip implementations (like Windows Explorer) only support shitty encryption so you were going to have to install 7zip anyway.

25 Jan 2017 • Windows post-install for developers

This is just a checklist for myself covering what to do with a fresh Windows installation. It covers disabling all the annoying crap Windows comes with by default, updating manually because Windows Update is broken in Windows 7 SP1, and a list of handy programs.

Install drivers and reboot.
Go to services.msc, stop and disable: Superfetch, Windows Defender, Windows Firewall, Windows Search.
Go to Control Panel and view by small icons. Go to Administrative Tools, Computer Management, Local Users and Groups, Users, right click Administrator and enable it. Log in as Administrator. This is now your user account, and you can delete your old one.
Right click the start menu, click properties, use small icons, never combine taskbar buttons, unlock, drag to left, lock. Under the start menu tab, click customise, then disable Devices and Printers, Games, Help, Highlight newly installed programs, Music, Pictures, and Use large icons.
Click the start menu, right click Computer, Advanced system settings (in the sidebar), Startup and Recovery settings, disable Automatically restart. Close that window and go to Performance settings. Uncheck lots of crap.
Right click the desktop, enable Windows Classic theme.
Go to Control Panel, then Action Center and disable UAC (in the sidebar), then go to Change Action Center settings (also in the sidebar) and disable problem reporting and all the messages (except for maybe Windows Update). Go to AutoPlay and disable that too. Go to Mouse and disable enhanced pointer precision. Go to Sound and select the No Sounds sound scheme. Go to Power Options and disable monitor/PC sleeping.
In the start menu, search for folder options. Go to view and show hidden files and known extensions.
In Computer, right click your C drive, go to Security, Advanced, Change Permissions, click your name, Edit, check full control, click OK, check "Replace all child permissions...", click OK.
Install Firefox.
Install MSVC Community 2013. Uncheck all the optional features. This download is huge so start it first!
Install the Windows 7 convenience rollup update and its dependencies.
Download Autoruns, Process Explorer, and Process Monitor.
Install Color Cop, AutoHotkey, 7-Zip, Everything.
Install Git for Windows, Notepad++, and Vim.
Install Renderdoc, Apitrace, Intel GPA, and the DirectX SDK.
Install GIMP, Inkscape, Blender, Wings3D, and Meshlab.
Download the Cygwin installer. Install tmux, openssh, lua and vim.
In a cygwin shell, run ssh-host-config, and follow the prompts. chown cyg_server: /var/empty; chmod 700 /var/empty; net start sshd.

29 Dec 2016 • Billions

When people are trying to sell candidates on their company, they like to throw around statistics like "we do billions of Xs per year", or if they want to pull out the really big guns, "we do one billion Xs per day".

I (and presumably everyone else) hear these statistics and think "wow a billion is a big number that's impressive" and don't think too much more about it.

Until now! Let's figure out just how impressive these numbers are. Assuming 365 * 24 * 60 * 60 seconds in a year, one billion Xs per year is 32 Xs per second, which is actually not impressive at all. If we try again with one billion per day (86400 seconds in a day) we get 11.5k per second, which is back into impressive territory... until you think about it.

My crappy game engine on my laptop with integrated graphics does 1M verts per second without the fans spinning up. AAA games routinely do 100x that - 4 orders of magnitude higher than 1b/day.

I should sell my engine by telling people it can do QUADRILLIONS of verts per year!!!!!!!!

23 Apr 2016 • Auto-mounting removable drives

Put this in /etc/udev/rules.d/10-automount.rules:

KERNEL!="sd[c-z][1-9]", GOTO="media_by_label_auto_mount_end"

# Global mount options
ACTION=="add", ENV{mount_options}="relatime,users,sync"

# Filesystem specific options
ACTION=="add", PROGRAM=="/lib/initcpio/udev/vol_id -t %N", RESULT=="vfat|ntfs", ENV{mount_options}="$env{mount_options},utf8,gid=100,umask=002"
ACTION=="add", PROGRAM=="/lib/initcpio/udev/vol_id --label %N", ENV{dir_name}="%c"
ACTION=="add", PROGRAM!="/lib/initcpio/udev/vol_id --label %N", ENV{dir_name}="usbhd-%k"
ACTION=="add", RUN+="/bin/mkdir -p /mnt/%E{dir_name}", RUN+="/bin/mount -o $env{mount_options} /dev/%k /mnt/%E{dir_name}"
ACTION=="remove", ENV{dir_name}=="?*", RUN+="/bin/umount -l /mnt/%E{dir_name}", RUN+="/bin/rmdir /mnt/%E{dir_name}"
LABEL="media_by_label_auto_mount_end"

You need to change the first line (specifically the [c-z] bit) if you have more (or less) than two non-removable drives. I don't know exactly how it works, but it does the job. I copied it from the arch wiki years ago and I'm putting it here for my own reference.

27 Dec 2015 • Moving to OpenBSD

My last VPS provider got bought out, so I figured now would be a good time to buy a more respectable (i.e. not found on lowendstock) server, which also gives me the opportunity to experiment with OpenBSD.

As expected, almost everything has gone smoothly. However, there have been a couple of pain points, which I'll document here for future me and any lucky Googlers.

tinc

If you copy and paste a working config from a Linux server, the clients can all ping the server, but you get an error like "No route to host" when you try it the other way around. This turned out to be a pf one-liner:

pass out on tun0 to 10.0.0.0/8

(where tun0 is my VPN's interface and 10.0.0.0/8 is my VPN's subnet)

luarocks

There isn't a luarocks port yet, and building can be slightly annoying. It goes like this:

./configure --sysconfdir=/etc/luarocks \
	--lua-version=5.1 \
	--lua-suffix=51 \
	--with-lua-include=/usr/local/include/lua-5.1
make build
make install

The configure lines for 5.2 and 5.3 look like you would expect. The install step creates a symlink from luarocks to luarocks-5.x so if you are lazy like me you should correct that to point at your favourite version.

lua-ev

The latest stable release doesn't work with Lua 5.3, and for some reason the lua-ev rock doesn't look in /usr/local/include. We can fix the former by installing the scm version, and the latter with CFLAGS:

luarocks install \
	https://luarocks.org/manifests/brimworks/lua-ev-scm-1.rockspec \
	CFLAGS=-I/usr/local/include

11 Nov 1111 • About me meta-post

This is Mike's blog and also the landing page for Thanks Again Oy. It sits at the intersection of automated trading and gamedev, which is to say I mostly shitpost about C++.