Friday, June 6, 2014

Beating the binary search algorithm – interpolation search, galloping search

Binary search is one of the simplest yet most efficient algorithms out there for looking up data in sorted arrays. The question is, can it be beaten by more sophisticated methods? Let’s have a look at the alternatives.

In some cases hashing the whole dataset is not feasible or the search needs to find the location as well as the data itself. In these cases the O(1) runtime cannot be achieved with hash tables, but O(log(n)) worst case runtime is generally available on sorted arrays with different divide and conquer approaches.

Before jumping to conclusions, it is worth to note that there are a lot of ways to “beat” an algorithm: required space, required runtime, and required accesses to the underlying data structure can be different priorities. For the following runtime and comparison tests different random arrays between 10,000 and 81,920,000 items were created with 4 byte integer elements. The keys were evenly distributed with an average increment of 5 between them.

Binary search

The binary search is a guaranteed runtime algorithm, whereas the search space is always halved at each step. Searching for a specific item in an array guaranteed to finish in O(log(n)) time, and if the middle point was selected luckily even sooner. It means that an array with 81,920,000 elements only needs 27 or less iterations to find the element’s location in the array.
Because of the random jumps of the binary search, this algorithm is not cache friendly so some fine tuned versions would switch back to linear search as long as the size of the search space is less than a specified value (64 or less typically). However, this final size is very much architecture dependent, so most frameworks don’t have this optimization.

Galloping search; galloping search with binary search fallback

If the length of the array is unknown for some reason, the galloping search can identify the initial range of the search scope. This algorithm starts at the first element and keeps doubling the upper limit of the search range until the value there is larger than the searched key. After this, depending on the implementation, the search either falls back to a standard binary search on the selected range, or restarts another galloping search. The former one guarantees an O(log(n)) runtime, the latter one is closer to O(n) runtime.
Galloping search is efficient if we expect to find the element closer to the beginning of the array.

Sampling search

The sampling search is somewhat similar to the binary search but takes several samples across the array before deciding which region to focus on. As a final step, if the range is small enough, it falls back to a standard binary search to identify the exact location of the element.
The theory is quite interesting but in practice the algorithm doesn’t perform too well.

Interpolation search; interpolation search with sequential fallback

The interpolation search supposed to be the “smartest” among the tested algorithms. It resembles the way humans are using phonebooks, as it tries to guess the location of the element by assuming that the elements are evenly distributed in the array.
As a first step it samples the beginning and the end of the search space and then guesses the element’s location. It keeps repeating this step until the element is found. If the guesses are accurate, the number of comparisons can be around O(log(log(n)), runtime around O(log(n)), but unlucky guesses easily push it up to O(n).
The smarter version switches back to linear search as soon as the guessed location of the element is presumably close to the final location. As every iteration is computationally expensive compared to binary search, falling back to linear search as the last step can easily outperform the complex calculations needed to guess the elements location on a short (around 10 elements) region.
One of the big confusions around interpolation search is that the O(log(log(n)) number of comparisons may yield O(log(log(n)) runtime. This is not the case, as there is a big tradeoff between storage access time versus CPU time needed to calculate the next guess. If the data is huge and storage access time is significant, like on an actual disk, interpolation search will easily beat binary search. However, as the tests show, if access time is very short, as in RAM, it may not yield any benefit at all.

Test results

All the code was hand written for the tests in Java (source code at the end); each test was run 10 times on the same array; the arrays were random, in memory integer arrays.

The interpolation search fell back to linear search if the assumed distance was 10 or fewer items, while the sampling search took 20 samples across the search space before deciding where to continue the search. Also, when the key space was less than 2000 items, it fell back to standard binary search.

As a reference, Java’s default Arrays.binarySearch was added to compare its runtime to the custom implementations.

Average search time / element, given the array size

Average comparisons / search, given the array size

Despite of the high expectations for interpolation search, the actual runtime did not really beat the default binary search. When storage access time is long, a combination of some kind of hashing and B+ tree probably would be a better choice, but it is worth to note that on uniformly distributed arrays the interpolation search combined with sequential search always beats the binary search on the number of comparisons required. It’s also interesting how efficient the platform’s binary search was, so for most cases it’s probably not worth it to replace it with something more sophisticated.

Raw data – average runtime per search

10,0001.50E-04 ms1.60E-04 ms2.50E-04 ms3.20E-04 ms5.00E-05 ms1.50E-04 ms1.00E-04 ms
20,0005.00E-05 ms5.50E-05 ms1.05E-04 ms2.35E-04 ms7.00E-05 ms1.15E-04 ms6.50E-05 ms
40,0004.75E-05 ms5.00E-05 ms9.00E-05 ms1.30E-04 ms5.25E-05 ms1.33E-04 ms8.75E-05 ms
80,0004.88E-05 ms5.88E-05 ms9.88E-05 ms1.95E-04 ms6.38E-05 ms1.53E-04 ms9.00E-05 ms
160,0005.25E-05 ms5.94E-05 ms1.01E-04 ms2.53E-04 ms6.56E-05 ms1.81E-04 ms9.38E-05 ms
320,0005.16E-05 ms6.13E-05 ms1.22E-04 ms2.19E-04 ms6.31E-05 ms2.45E-04 ms1.04E-04 ms
640,0005.30E-05 ms6.06E-05 ms9.61E-05 ms2.12E-04 ms7.27E-05 ms2.31E-04 ms1.16E-04 ms
1,280,0005.39E-05 ms6.06E-05 ms9.72E-05 ms2.59E-04 ms7.52E-05 ms2.72E-04 ms1.18E-04 ms
2,560,0005.53E-05 ms6.40E-05 ms1.11E-04 ms2.57E-04 ms7.37E-05 ms2.75E-04 ms1.05E-04 ms
5,120,0005.53E-05 ms6.30E-05 ms1.26E-04 ms2.69E-04 ms7.66E-05 ms3.32E-04 ms1.18E-04 ms
10,240,0005.66E-05 ms6.59E-05 ms1.22E-04 ms2.92E-04 ms8.07E-05 ms4.27E-04 ms1.42E-04 ms
20,480,0005.95E-05 ms6.54E-05 ms1.18E-04 ms3.50E-04 ms8.31E-05 ms4.88E-04 ms1.49E-04 ms
40,960,0005.87E-05 ms6.58E-05 ms1.15E-04 ms3.76E-04 ms8.59E-05 ms5.72E-04 ms1.75E-04 ms
81,920,0006.75E-05 ms6.83E-05 ms1.04E-04 ms3.86E-04 ms8.66E-05 ms6.89E-04 ms2.15E-04 ms

Raw data – average number of comparisons per search


Source code

The full Java source code for the search algorithms can be found here. Keep in mind, the code is not production quality; for instance, it may have too many or too few range checks in some cases!

1 comment:

  1. You state: "One of the big confusions around interpolation search is that the O(log(log(n)) number of comparisons may yield O(log(log(n)) runtime" and "If the guesses are accurate, the number of comparisons can be around O(log(log(n)), runtime around O(log(n))"

    This is not correct.

    Mind you that both 2*log(log(n)) and 1000000*log(log(n)) are both O(log(log(n)). For any constant c, c*f(n) is O(f(n)).

    The c tends to be comparatively large in case of interpolation search vs other usual search algorithms. And it depends on implementation[*]. But with large enough n asymptotically faster algorithm will win.

    If one looks at the performance data presented one could see that, when compared with example binary search code, example interpolation search has a lead of ~15% for n > 500000. Even built-in binary search (reliably faster that anything else) is virtually equal at the biggest n (n=81920000).

    So yes, interpolation search will be faster when data is big enough. How big depends on how optimized implementation is[*], here it looks that it would be for n somewhere between 0.5 million and 100+ million.

    There is other issue with interpolation search, not mentioned here:
    while it's expected cost is O(log(log(n))), it's pessimistic cost is O(n). Try filling test arrays with numbers obtained the following way (of course sort it):
    (int)Math.pow(62*Math.random(), 1.41)

    *] Example implementation is not optimized. It converts back and forth between int and float (which is slow), it has extraneous ifs, etc...