Sunday, February 5, 2012

Quicksort, binary search and linear search performance - far from what you believe


When to sort then binary search and when to do linear search?

We can all agree that binary search is faster than linear search – at least in general and if the searched array size is large enough. This means that everyone who learned the following equation will prefer to use binary search to linear search:

log(N) < N/2

And here is a good example when a nice education background may distort your decision: there are ample cases when in practice the above does not hold true!

In theory

To compare linear search to binary search, it makes sense to compare the sorting then searching to simply linear search. Let’s assume number of searches on an N sized array is K. In this case the following equations should be compared:

To get K results from an array:
  With sort and binary search time: N * log(N) + K * log(N)
  With linear search time: K * (N/2)

Looking at the above the sort/binary search is definitely not always smaller than linear search. Generally speaking, if your number of searches (K) is small enough, it does not make sense to sort and binary search: just do brute force linear searches on the array.

Do a little math on paper and calculate the inequality of the above equations (on a fixed N what is the minimum number of K searches to make the binary search faster than linear).

In practice

The problem with any kind of searches and sorts is that they are not optimal and require external function call whereas linear search is almost always done in place (or the compiler compiles our 3 lines of code into an inline function).

So, in practice that minimum K searches can be even further away from each other than what a pure mathematical approach would suggest. In practice, there are cases when binary search never catches up with linear search! Before reading the measurement results, please have a guess: at what array size does it makes absolutely no sense to sort and binary search instead of doing linear search? Well, for an array size of 1 it is true. How about bigger arrays? 10 maybe 100? Guess then read on!

Setting up measurements

To measure the performance of both approaches I created a routine that probes the times of searches: if linear is faster, increase the number of searches on the same array, giving a chance for the binary search to catch up (and eliminate the handicap of sorting).

1.05^X
The first probe was a single search (K=1) then increasing it by 5% all the time (1.05^X). On small numbers this gives a very precise guess while on larger numbers it’s increasing aggressively to finally reach a limit (a slight ‘bug’ is that on the first run it will re-try with the very small previous values a few times: it will try 1 for 9 times, 2 for 10 times, 3 for 5 times… It’s small performance drawback which did not really needed to be fixed as all further probes would be retried as: 1.05^(X*0.8))

If the TimeOf(Sort +KTimes(BinarySearch)) / TimeOf(KTimes(LinearSearch)) reached 1.00 the probes stopped and the test displayed the ArraySize and KTimes values indicating how many times a search need to be run on the array so the sort + binary search is faster than repeated linear searches.

The input arrays were randomized by a custom function. This is a very important aspect as a random, unsorted array takes more time to be sorted than just checking for a sorted order (in real life, a semi sorted array is more likely than a completely random so a smart sort routine can lower the minimal K value).


Platforms

For the measurement platforms I used Windows/.NET 4 and MacOSX/MONO 2.10.8 and the built-in Array.Sort method to sort the integer arrays. The linear search was a simple FOR cycle checking systematically every array value in the array and if a match was found it stopped.

Source code

The linear search code where ‘count’ represents the number of searches that should be repeated on the array:

public static int LinearSeek(int[] arr, int count)
{
 Random rnd = new Random();

 Stopwatch sw = Stopwatch.StartNew();

 for(int i = 0; i < count; i++)
 {
  int toSeek = rnd.Next(0, arr.Length);

  for(int j = 0; j < arr.Length; j++)
  {
   if(arr[j] == toSeek)
   {
    break;
   }
  }
 }

 return (int)sw.ElapsedMilliseconds;
}

The sort and binary search code:

public static int SortAndSeek(int[] arr, int count) 
{
 Random rnd = new Random();

 Stopwatch sw = Stopwatch.StartNew();

 Array.Sort(arr);

 for(int i = 0; i < count; i++)
 {
  int toSeek = rnd.Next(0, arr.Length);
  Array.BinarySearch(arr, toSeek);
 }

 return (int)sw.ElapsedMilliseconds;
}

The randomization code used the traditional shuffle method to make a perfectly random array:

public static void Randomize(int[] arr)
{
 Random rnd = new Random();

 for(int i = 0; i < arr.Length - 1; i++)
 {
  int place = rnd.Next(i+1, arr.Length);
  int tmp = arr[i];
  arr[i] = arr[place];
  arr[place] = tmp;
 }
}

To get the average time for linear search and binary search, both functions were called 100 times at every N/K probe.


Measurements

Before actually getting the results I was positive that an array with around 10 elements should be sorted if frequent access is required to take advantage of the binary search speed. It turns out that either on .NET or Mono platform


an array with less than 128 elements never takes advantage of the binary search: the linear search will always be faster!

This is very unexpected and really good to know!

Let’s see how many searches is required for the sort + binary search to catch up with the standard brute force method, the linear search (where data is missing the binary search was never able to catch up):



Array Size # of searches .NET # of searches Mono
4 Never catches up Never catches up
8 Never catches up Never catches up
16 Never catches up Never catches up
32 Never catches up Never catches up
64 Never catches up Never catches up
128 9,170 Never catches up
256 3,456 4,201
512 1,240 3,456
1,024 1,583 1,745
2,048 761 761
4,096 490 384
8,192 214 214
16,384 80 131
32,768 60 185
65,536 69 159
131,072 49 138
262,144 40 144
524,288 42 159
1,048,576 47 167
2,097,152 44 176
4,194,304 42 176
8,388,608 42 176

It is interesting to see the ratio of the theoretical limits of sort + binary search VS linear search. In every case we are supposed to get a value around 1 since the time required to the binary search and linear search was the same so the following equation should hold true:

N * log(N) + K * log(N) = K * (N/2)

Another unforeseen result is they are not the same! On the .NET platform they slowly converge to 1 (TimeOf(linear) / TimeOf(binary)) but on the Mono platform the ratio converges to 4 meaning the .NET implementation of the Array.Sort is around 4 times faster than on the Mono platform.

On small arrays these values can be very extreme, suggesting that sorting small arrays are very inefficient and the expected run time is nowhere near around the N * log(N) time.




Conclusion

Contrary to the popular belief, sorting and then performing binary search in some cases can be always slower than just performing consecutive linear searches.

One other interesting conclusion is that sorting small arrays is very expensive compared to sorting large (>200.000 elements) ones.

No comments:

Post a Comment