Count the 1 bits in an integer – interview question and branch predictor crash course
Let's say the integer we want to count the "1" bits in is:
698 (decimal)
In this case the binary representation is:
0010 1011 1010 (binary)
so the answer is: 6
Let’s look at three simple implementations and try to guess which one would be faster and why:
(note: the code is in Java and the >>> operator is the unsigned shift. It’s usually just >> in other languages).
Solution #1 – count to fixed size
int cnt = 0; long l = getLongToCount(); int bits = 64; for (int k = 0; k < bits; k++) { cnt += l & 1; l >>>= 1; }
Solution #2 – count while non zero
int cnt = 0; long l = getLongToCount(); while (l != 0) { cnt += l & 1; l >>>= 1; }
Solution #3 – count while non zero and add 1 only
int cnt = 0; long l = getLongToCount(); while (l != 0) { if ((l & 1) == 1) { cnt++; } l >>>= 1; }
Why is there a difference?
Even though all solutions are correct, there is a significant performance difference between them. If modern CPUs didn’t have branch predictors a branch (“if” or any other condition, like a loop) would cause severe performance issues, as the processor would have to stall and wait until the branch variable is evaluated. However, this is slow as the processor is working on the instructions on a pipeline scheme meaning ideally the CPU’s pipeline is always full and doing something meaningful.
The branch predictor tries to help with this so even though there is a branch ahead it tries to guess which way the operations will go and keep on working on that. If the prediction was unsuccessful then the pipeline has to emptied and re-filled with the other branch's instructions. This is a quite expensive operation as the pipeline on a modern CPU can hold even dozen of instructions.
This means that blindly adding two numbers is much cheaper than causing a possible branch misprediction with an “if” statement:
cnt += l & 1; // Very fast, always around or less than 1 clock cycle if ((l & 1) == 1) { // If prediction is correct very fast, otherwise quite expensive cnt++; }
The results
(ran on 10 000 000, 64 bit longs).
As expected the branched version is significantly slower, but the version with “while” is still somewhat faster than the version with a fixed “for” loop. The reason is that the number of operations on average is lower with the “while” as it will stop as soon as there is no more “1” bits in the integer while the “for” loop always executes 64 operations – even if they are cheap, they have to be executed.
So is a branch always slower?
The evaluation of the branch variable is usually cheap and fast, so adding a branch in itself won’t make the system slower. However, if the branch variable is almost random or follows a complex pattern the branch predictor will frequently miss it and every miss is expensive. In summary, if the branch variable is static the code will be (almost) as fast as the code without a branch.
What is the fastest way to count the bits then?
All of the above samples are unpractical in terms of performance as for a single 64 bit integer it takes 64 operations to add up the 1s. The fastest solution typically uses some kind of dictionary to reduce the number of operations. For example, if we create a mapping in a simple array from all 4 bit values to counts like this:
[0] -> 0
[1] -> 1
[2] -> 1
[3] -> 2
...
[15] -> 4
then we can do the 64 bit counting in just 16 operations (4 bit at a time). However, significantly increasing the dictionary size may not increase performance in all cases as cache misses may come into play if the dictionary is too big and cannot fit into cache.
Comments
Post a Comment