mikejsavage.co.uk • About • Archive • RSS • Thanks for blocking ads! Blocking ads owns: AdGuard for Safari / uBlock Origin for everything else
The reason cmov is not always is a win is that you have to block on the branch condition and on both sides of the branch.
To make that more explicit, say you have some code like this, and let's assume we take the true side of the branch.
int x;
bool p = [pred code]; // let's say it ends up being true
if( p ) {
[true code];
x = something;
}
else {
[false code];
x = something;
}
With branch prediction and out of order you can run [pred code]
in
parallel with [true code]
, and you don't need to bother running
[false code]
. If [pred code]
and [true code]
can overlap perfectly
then the branch becomes very nearly free.
With cmov, it looks like this:
bool p = [pred code]; // let's say it ends up being true
int xtrue = [true code];
int xfalse = [false code];
int x = xfalse;
x = cmov( x, xtrue, p );
The difference is that the end result of x
not only depends on just
[true code]
, but also on [pred code]
and [false code]
. You have to
wait for those results to come in before you can compute x
. Whereas
before you ran [pred code]
in parallel and didn't run [false code]
at all.
To decide between an explicit branch and cmov you need to figure out the expected costs of both. For cmov you just go look in the instruction tables and figure it out, for a branch it's slightly more work:
expected_cost_of_branch =
probability_of_misprediction * ( misprediction_penalty + latency_of_pred_code )
+ ( 1 - probability_of_misprediction ) * cost_of_predicted_branch
misprediction_penalty
is like 16-20 cycles, cost_of_predicted_branch
can be 1 or 2 cycles depending on whether it's a branch taken or not
taken (for Skylake) (look in the instruction tables).
latency_of_pred_code
is how long it takes to figure out you took the
wrong side of the branch.
That formula is actually incomplete, you have four cases. You have "correctly predicted and takes true side", "mispredicted and takes true side", "correctly predicated and takes false side", and "mispredicted and takes false side". With those probabilities you can also add in the cost of the code in each side of the branch, but those probabilities are harder to estimate, and just having the cost of the branch op can give you a decent idea of whether cmov will help or not.
WTF isn't there a cmov intrinsic? WTF isn't there a not-cmov intrinsic? The compiler authors will of course say "oh you need to do a lot of measurements to make sure that it's a win so it's best to just leave it to us". Ok but I have done the measurements, I know my branch is unpredictable and I drew the pipeline diagrams so I know cmov will be faster on average. Why are you making me waste a ton of time guessing the syntax that will make the compiler generate the right code? Jesus.