cmov

30 Mar 2018 • cmov

The reason cmov is not always is a win is that you have to block on the branch condition and on both sides of the branch.

To make that more explicit, say you have some code like this, and let's assume we take the true side of the branch.

int x;
bool p = [pred code]; // let's say it ends up being true
if( p ) {
	[true code];
	x = something;
}
else {
	[false code];
	x = something;
}

With branch prediction and out of order you can run [pred code] in parallel with [true code], and you don't need to bother running [false code]. If [pred code] and [true code] can overlap perfectly then the branch becomes very nearly free.

With cmov, it looks like this:

bool p = [pred code]; // let's say it ends up being true
int xtrue = [true code];
int xfalse = [false code];
int x = xfalse;
x = cmov( x, xtrue, p );

The difference is that the end result of x not only depends on just [true code], but also on [pred code] and [false code]. You have to wait for those results to come in before you can compute x. Whereas before you ran [pred code] in parallel and didn't run [false code] at all.

To decide between an explicit branch and cmov you need to figure out the expected costs of both. For cmov you just go look in the instruction tables and figure it out, for a branch it's slightly more work:

expected_cost_of_branch =
	probability_of_misprediction * ( misprediction_penalty + latency_of_pred_code )
	+ ( 1 - probability_of_misprediction ) * cost_of_predicted_branch

misprediction_penalty is like 16-20 cycles, cost_of_predicted_branch can be 1 or 2 cycles depending on whether it's a branch taken or not taken (for Skylake) (look in the instruction tables). latency_of_pred_code is how long it takes to figure out you took the wrong side of the branch.

That formula is actually incomplete, you have four cases. You have "correctly predicted and takes true side", "mispredicted and takes true side", "correctly predicated and takes false side", and "mispredicted and takes false side". With those probabilities you can also add in the cost of the code in each side of the branch, but those probabilities are harder to estimate, and just having the cost of the branch op can give you a decent idea of whether cmov will help or not.

Rant time

WTF isn't there a cmov intrinsic? WTF isn't there a not-cmov intrinsic? The compiler authors will of course say "oh you need to do a lot of measurements to make sure that it's a win so it's best to just leave it to us". Ok but I have done the measurements, I know my branch is unpredictable and I drew the pipeline diagrams so I know cmov will be faster on average. Why are you making me waste a ton of time guessing the syntax that will make the compiler generate the right code? Jesus.