Faster division and modulo operation - the power of two
The power of two - fast division and modulo operations
There are some - admittedly rare - cases, when the division and modulo operations are responsible for a great percent of the total runtime.
Such case might be when a high performance circular array positioning is calculated and the pointers use the modulo (ptr = curr % ArraySize) operation to calculate the real position in the current array.
The good news is, that if all the variables are known at compile time, the compiler will help to optimise the division/modulo. The bad news is, that with a dynamic sized array, the compiler won't help anything.
Even worse news is that both the DIV and MOD operations are pretty slow on every CPU (or at least slower than other atomic operations).
Speeding it all up
If the pointer calculation in the previous example is a concern, there is a way to speed it up: use the powers of two! If the divider is a power of two, both the division and modulo operations become extremely simple:
However, because D is a power of two, these can be rewritten as:
As both the shift (>>) and AND (&) operations are extremely fast, the division/modulo operations can be tuned if the divider is a power of two and the code is accordingly (the current CPUs will not do this on the fly for you, you have to rewrite the code).
Rewritten division
If the input can be anything, we need to determine whether the divider is a power of two:
If a number is a power of two, it means it has only a single bit set to 1. If we subtract 1 from this number, we have all the positions set to 1 on the right of the original bit:
If we AND these numbers and the result is zero, it was a power of two!
Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!
DIV
For dividing, we need to find the number of shifts:
Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!
If the divider is 8, we will have 3 if the "shift" variable (shifting 1 right is the same as dividing by 2).
From here on, all we need to do is substitute the division with the shifting:
MOD
For modulo operation, it's even easier. After deciding whether the number is a power of two, just create a simple AND mask:
Then replace the MOD operation with and AND:
Results
The results are actually quite impressive. I've tested on both an i7 and an Atom processor and the results where the between 10-20x faster than with simple division or modulo (Ref is the empty cycle to give reference for standard speed; Pow2DivMod ran when the divider was power of two and DivMod ran in all other cases).
There are some - admittedly rare - cases, when the division and modulo operations are responsible for a great percent of the total runtime.
Such case might be when a high performance circular array positioning is calculated and the pointers use the modulo (ptr = curr % ArraySize) operation to calculate the real position in the current array.
The good news is, that if all the variables are known at compile time, the compiler will help to optimise the division/modulo. The bad news is, that with a dynamic sized array, the compiler won't help anything.
Even worse news is that both the DIV and MOD operations are pretty slow on every CPU (or at least slower than other atomic operations).
Speeding it all up
If the pointer calculation in the previous example is a concern, there is a way to speed it up: use the powers of two! If the divider is a power of two, both the division and modulo operations become extremely simple:
The number X in decimal: 143 in binary: 1000 1111
The divisor D in decimal: 4 in binary: 0100
The result (with integers): X / D = 0010 0011 = 35 X % D = 0000 0011 = 3
However, because D is a power of two, these can be rewritten as:
X / D = X >> 2 = 0010 0011
(eliminating the last 2 bits; if the divider is 8 then the last 3 bits...) X % D = X & 0011 = 0000 0011
(looking at the bits on the right of the divider)
As both the shift (>>) and AND (&) operations are extremely fast, the division/modulo operations can be tuned if the divider is a power of two and the code is accordingly (the current CPUs will not do this on the fly for you, you have to rewrite the code).
Rewritten division
If the input can be anything, we need to determine whether the divider is a power of two:
if ((divider & (divider - 1)) == 0) { …
If a number is a power of two, it means it has only a single bit set to 1. If we subtract 1 from this number, we have all the positions set to 1 on the right of the original bit:
Number: 1000 Number-1: 0111
If we AND these numbers and the result is zero, it was a power of two!
Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!
DIV
For dividing, we need to find the number of shifts:
int shift = 0; int divtmp = divider; while (divtmp > 1) { divtmp = divtmp >> 1; shift++; }
Be warned, that this operation takes time so do it only once, outside the loop otherwise all performance gain will be lost!
If the divider is 8, we will have 3 if the "shift" variable (shifting 1 right is the same as dividing by 2).
From here on, all we need to do is substitute the division with the shifting:
// int x = i / divider; int x = i >> shift;
MOD
For modulo operation, it's even easier. After deciding whether the number is a power of two, just create a simple AND mask:
int and = (divider) – 1;
Then replace the MOD operation with and AND:
// int x = i % divider; int x = i & and;
Results
The results are actually quite impressive. I've tested on both an i7 and an Atom processor and the results where the between 10-20x faster than with simple division or modulo (Ref is the empty cycle to give reference for standard speed; Pow2DivMod ran when the divider was power of two and DivMod ran in all other cases).
I'd like to know why it has been so difficult to find an algebraic solution to the modular operation. Wouldn't this speed things up?
ReplyDelete