12 Sep 2018 • Branch prediction minutiae in LZ decoders

Say we have an LZSS inner decode loop like this (not good, just an example):

u8 ctrl = read_u8();

// match when MSB of ctrl is set, literal otherwise
if( ctrl >= 128 ) {
	u32 len = ctrl - 128 + MML;
	if( len == 127 )
		len += read_extra_length();
	u16 offset = read_u16();
	memcpy( op, op - offset, len );
	op += len;
else {
	u32 len = ctrl;
	if( len == 127 )
		len += read_extra_length();
	memcpy( op, ip, len );
	op += len;
	ip += len;

We have an unpredictable branch to decide between literals and matches, and the branch misprediction penalties can eat a lot of time if you're hitting lots of short copies, which you do the majority of the time. There's also a branch to read spilled length bytes but we hit that less than 1% of the time and when we do hit it the branch misprediction isn't such a big deal because we get lots of data out of it, so we are going to ignore that in this post.

LZ4's solution to this is to always alternate between literals and matches, and send a 0-len literal when you need to send two matches in a row. That might look like this:

u8 ctrl = read_u8();
u32 literal_len = ctrl & 0xF;

if( literal_len == 15 )
	literal_len += read_extra_length();
memcpy( op, ip, literal_len );

op += literal_len;
ip += literal_len;

u32 match_len = ctrl >> 4 + MML;
u16 match_offset = read_u16();
if( match_len == 15 )
	match_len += read_extra_length();
memcpy( op, op - match_offset, match_len );

op += match_len;

So we got rid of the branch, but not really! memcpy has a loop, and now we're polluting that branch's statistics with 0-len literals. This does end up being an improvement on modern CPUs though, from the Haswell secton in Agner Fog's uarch PDF:

3.8 Pattern recognition for conditional jumps

The processor is able to predict very long repetitive jump patterns with few or no mispredictions. I found no specific limit to the length of jump patterns that could be predicted. One study found that it stores a history of at least 29 branches. Loops are successfully predicted up to a count of 32 or a little more. Nested loops and branches inside loops are predicted reasonably well.

Modern CPUs are able to identify loops and perfectly predict the exit condition. A good memcpy copies 16 or 32 bytes at a time, so we don't pay any misprediction penalties until at least 512 bytes, at which point we don't care because we got so much data out of it.