All perf zero quality BC4 encoding

24 Jan 2024 • All perf zero quality BC4 encoding

BC4 encoding is pretty straightforward. You find the min/max to use as endpoints and do a little bit of funny arithmetic to compute the selectors. But what if we didn't care at all about quality?

The motivation behind this is we have a lot of single channel decals in Cocaine Diesel that are mostly pure white on a pure transparent background. For example:

and we were storing them as 4 channel RGBA PNGs because that's easy for artists to work with. 99% flat PNGs compress well on disk but always eat 32 bits per pixel in VRAM, or 8x as much space as BC4. Once we accumulated a few hundred decals it started to cause problems on 1GB GPUs which would otherwise have had no issues running the game, and was also a lot slower to render than more bandwidth efficient textures would have been. At this point we hadn't settled on an asset pipeline, I was hoping we could store source assets wherever possible and compile optimised assets along with the engine in CI for release builds, because giving a build system of terrible self written compilers to non-developers is pain. So we set about converting hundreds of PNGs to BC4 at runtime.

The obvious place to start is an existing DXT compression library. There are loads, stb_dxt, squish, rgbcx, etc. This was a very long time ago and I didn't keep notes so I have no benchmarks, but needless to say it was too slow. Even at 50ms per texture it adds 15s to the game startup time when you have 300 of them, which is way too much for a game that starts in two seconds in debug builds.

A simple way to make it faster is to just not compute accurate endpoints, and instead hardcode them to 0 and 255. Then we remap alpha from [0,256) to [0,8) to get our selectors, which is dividing by 32. Finally we remap them again to the actual non-linear order BC4 uses with a LUT and pack them tightly. That looks like this:

struct BC4Block {
    u8 endpoints[ 2 ];
    u8 indices[ 6 ];
};

static BC4Block FastBC4( Span2D< const RGBA8 > rgba ) {
    BC4Block result;

    result.endpoints[ 0 ] = 255;
    result.endpoints[ 1 ] = 0;

    constexpr u8 index_lut[] = { 1, 7, 6, 5, 4, 3, 2, 0 };

    u64 indices = 0;
    for( size_t i = 0; i < 16; i++ ) {
        u64 index = index_lut[ rgba( i % 4, i / 4 ).a >> 5 ];
        indices |= index << ( i * 3 );
    }

    memcpy( result.indices, &indices, sizeof( result.indices ) );

    return result;
}

static Span2D< BC4Block > RGBAToBC4( Span2D< const RGBA8 > rgba ) {
    Span2D< BC4Block > bc4 = AllocSpan2D< BC4Block >( sys_allocator, rgba.w / 4, rgba.h / 4 );

    for( u32 row = 0; row < bc4.h; row++ ) {
        for( u32 col = 0; col < bc4.w; col++ ) {
            Span2D< const RGBA8 > rgba_block = rgba.slice( col * 4, row * 4, 4, 4 );
            bc4( col, row ) = FastBC4( rgba_block );
        }
    }

    return bc4;
}

which was better, but still slow enough to be annoying. So the next obvious thing to try is vectorising it. Each row in the source texture is four pixels at four bytes per pixel stored contiguously, which is exactly a 128-bit SSE register. We can extract the alpha channel with PSHUFB and POR, shift it down by 5 bits and mask off the bottom 3 bits of each pixel to compute the selectors, do the LUT remap with another PSHUFB, and pack the resulting 3-bit selectors with PEXT. For a single block that's:

static BC4Block FastBC4( Span2D< const RGBA8 > rgba ) {
    BC4Block result;

    result.endpoints[ 0 ] = 255;
    result.endpoints[ 1 ] = 0;

    // in practice you would lift these out and not load them over and over
    __m128i alpha_lut_row0 = _mm_setr_epi8(  3,  7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 );
    __m128i alpha_lut_row1 = _mm_setr_epi8( -1, -1, -1, -1,  3,  7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1 );
    __m128i alpha_lut_row2 = _mm_setr_epi8( -1, -1, -1, -1, -1, -1, -1, -1,  3,  7, 11, 15, -1, -1, -1, -1 );
    __m128i alpha_lut_row3 = _mm_setr_epi8( -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  3,  7, 11, 15 );
    __m128i lut = _mm_setr_epi8( 1, 7, 6, 5, 4, 3, 2, 0, 9, 10, 11, 12, 13, 14, 15, 16 );
    __m128i mask = _mm_set1_epi8( 7 );

    __m128i row0 = _mm_load_si128( ( const __m128i * ) rgba.row( 0 ).ptr );
    __m128i row1 = _mm_load_si128( ( const __m128i * ) rgba.row( 1 ).ptr );
    __m128i row2 = _mm_load_si128( ( const __m128i * ) rgba.row( 2 ).ptr );
    __m128i row3 = _mm_load_si128( ( const __m128i * ) rgba.row( 3 ).ptr );

    __m128i block = _mm_or_si128(
        _mm_or_si128( _mm_shuffle_epi8( row0, alpha_lut_row0 ), _mm_shuffle_epi8( row1, alpha_lut_row1 ) ),
        _mm_or_si128( _mm_shuffle_epi8( row2, alpha_lut_row2 ), _mm_shuffle_epi8( row3, alpha_lut_row3 ) )
    );

    __m128i high_bits = _mm_and_si128( _mm_srli_epi64( block, 5 ), mask );

    __m128i selectors = _mm_shuffle_epi8( lut, high_bits );

    u64 packed0 = _pext_u64( _mm_extract_epi64( selectors, 0 ), 0x0707070707070707_u64 );
    u64 packed1 = _pext_u64( _mm_extract_epi64( selectors, 1 ), 0x0707070707070707_u64 );
    u64 packed = packed0 | ( packed1 << 24 );

    memcpy( result.indices, &packed, sizeof( result.indices ) );

    return result;
}

Again I have no benchmarks, but this was... better again but still too slow. It ended up being a fun experiment, although ultimately not good enough, and now we store the PNGs in a separate source assets repo, compress them with rgbcx (i.e. a good BC4 encoder) and zstd, and copy the .dds.zst textures by hand to the main repo. It's not great but also really not that bad and automating it fully would be more trouble than it's worth for now.

We did keep the non-SIMD FastBC4 around for faster iterations when adding new textures, but nothing ever goes through it in release builds.