1 Oct 2017 • Really finishing the job

This is a followup to Rust performance: finishing the job.

Outsmarted by a crustacean

Over on lobste.rs, pbsd points out that my approach is bad and mentions a very neat trick. Subtracting 255 is equivalent to adding one when using u8, so you can keep 16 counters in a register and increment them by subtracting the mask returned by _mm_cmpeq_epi8. You have to stop every 255 chunks to make sure the counters don't overflow, but other than that it's quite simple. The hot loop becomes:

__m128i needles = _mm_set1_epi8( needle );
while( haystack < one_past_end - 16 ) {
	__m128i counts = _mm_setzero_si128();

	for( int i = 0; i < 256; i++ ) {
		if( haystack >= one_past_end - 16 ) {
			break;
		}

		__m128i chunk = _mm_load_si128( ( const __m128i * ) haystack );
		__m128i cmp = _mm_cmpeq_epi8( needles, chunk );
		counts = _mm_sub_epi8( counts, cmp );

		haystack += 16;
	}

	__m128i sums = _mm_sad_epu8( counts, _mm_setzero_si128() );
	u16 sums_[ 8 ];
	_mm_storeu_si128( ( __m128i * ) sums_, sums );
	c += sums_[ 0 ] + sums_[ 4 ];
}

Another neat trick is that we can use _mm_sad_epu8 to add the 8 counts at the end. It's slightly faster than storing the counts to u8[ 16 ] and summing them normally.

With the same test setup as last time, this runs in 2.01ms. Again it helps to unroll the loop manually. The inner loop is so simple now it actually helps to unroll 4x, and if we do that it runs in 1.92ms!

Branchless scalar code

The original code can be made branchless. The trick is you replace if( haystack[ i ] == needle ) c++; with c += haystack[ i ] == needle ? 1 : 0;, which can be computed with a CMP and SETZ.

GCC is smart enough to perform this optimisation already, even at -O2, so no benchmark for this one.

AVX2

AVX2 has 32 wide versions of all the instructions we used in the SSE version. The code changes are simple (basically find and replace) so I won't include them here.

Interestingly, the AVX2 version doesn't actually run any faster. I spoke to JM about it and he said I might be bottlenecked on memory bandwidh. The "lorem ipsum" string I search in is 42.6MB, so searching that in 1.92ms is 22.2GB/s. I have a single stick of 3000MHz RAM, so sure enough that is the bottleneck.

Conclusion

Here's the same table as last time but with the new results added:

VersionTime% of -O2% of -O3
Scalar -O221.3ms100%-
Scalar -O37.09ms33%100%
Old vectorised2.74ms13%39%
Old unrolled2.45ms12%35%
New vectorised2.01ms9%28%
New unrolled1.92ms9%27%

The new vectorised code is 10x faster than the original!

Full code

You can download the code I used to generate those results if you want to try it yourself. You'll also need the code from the last post if you want to compare before and after.