Quicksort, binary search and linear search performance - far from what you believe
We can all agree that binary search is faster than linear search – at least in general and if the searched array size is large enough. This means that everyone who learned the following equation will prefer to use binary search to linear search:
log(N) < N/2
And here is a good example when a nice education background may distort your decision: there are ample cases when in practice the above does not hold true!
In theory
To compare linear search to binary search, it makes sense to compare the sorting then searching to simply linear search. Let’s assume number of searches on an N sized array is K. In this case the following equations should be compared:
To get K results from an array:
With sort and binary search time: N * log(N) + K * log(N)
With linear search time: K * (N/2)
Looking at the above the sort/binary search is definitely not always smaller than linear search. Generally speaking, if your number of searches (K) is small enough, it does not make sense to sort and binary search: just do brute force linear searches on the array.
Do a little math on paper and calculate the inequality of the above equations (on a fixed N what is the minimum number of K searches to make the binary search faster than linear).
In practice
The problem with any kind of searches and sorts is that they are not optimal and require external function call whereas linear search is almost always done in place (or the compiler compiles our 3 lines of code into an inline function).
So, in practice that minimum K searches can be even further away from each other than what a pure mathematical approach would suggest. In practice, there are cases when binary search never catches up with linear search! Before reading the measurement results, please have a guess: at what array size does it makes absolutely no sense to sort and binary search instead of doing linear search? Well, for an array size of 1 it is true. How about bigger arrays? 10 maybe 100? Guess then read on!
Setting up measurements
To measure the performance of both approaches I created a routine that probes the times of searches: if linear is faster, increase the number of searches on the same array, giving a chance for the binary search to catch up (and eliminate the handicap of sorting).
1.05^X |
If the TimeOf(Sort +KTimes(BinarySearch)) / TimeOf(KTimes(LinearSearch)) reached 1.00 the probes stopped and the test displayed the ArraySize and KTimes values indicating how many times a search need to be run on the array so the sort + binary search is faster than repeated linear searches.
The input arrays were randomized by a custom function. This is a very important aspect as a random, unsorted array takes more time to be sorted than just checking for a sorted order (in real life, a semi sorted array is more likely than a completely random so a smart sort routine can lower the minimal K value).
Platforms
For the measurement platforms I used Windows/.NET 4 and MacOSX/MONO 2.10.8 and the built-in Array.Sort method to sort the integer arrays. The linear search was a simple FOR cycle checking systematically every array value in the array and if a match was found it stopped.
Source code
The linear search code where ‘count’ represents the number of searches that should be repeated on the array:
public static int LinearSeek(int[] arr, int count) { Random rnd = new Random(); Stopwatch sw = Stopwatch.StartNew(); for(int i = 0; i < count; i++) { int toSeek = rnd.Next(0, arr.Length); for(int j = 0; j < arr.Length; j++) { if(arr[j] == toSeek) { break; } } } return (int)sw.ElapsedMilliseconds; }
The sort and binary search code:
public static int SortAndSeek(int[] arr, int count) { Random rnd = new Random(); Stopwatch sw = Stopwatch.StartNew(); Array.Sort(arr); for(int i = 0; i < count; i++) { int toSeek = rnd.Next(0, arr.Length); Array.BinarySearch(arr, toSeek); } return (int)sw.ElapsedMilliseconds; }
The randomization code used the traditional shuffle method to make a perfectly random array:
public static void Randomize(int[] arr) { Random rnd = new Random(); for(int i = 0; i < arr.Length - 1; i++) { int place = rnd.Next(i+1, arr.Length); int tmp = arr[i]; arr[i] = arr[place]; arr[place] = tmp; } }
To get the average time for linear search and binary search, both functions were called 100 times at every N/K probe.
Measurements
Before actually getting the results I was positive that an array with around 10 elements should be sorted if frequent access is required to take advantage of the binary search speed. It turns out that either on .NET or Mono platform
an array with less than 128 elements never takes advantage of the binary search: the linear search will always be faster!
This is very unexpected and really good to know!
Let’s see how many searches is required for the sort + binary search to catch up with the standard brute force method, the linear search (where data is missing the binary search was never able to catch up):
Array Size | # of searches .NET | # of searches Mono |
---|---|---|
4 | Never catches up | Never catches up |
8 | Never catches up | Never catches up |
16 | Never catches up | Never catches up |
32 | Never catches up | Never catches up |
64 | Never catches up | Never catches up |
128 | 9,170 | Never catches up |
256 | 3,456 | 4,201 |
512 | 1,240 | 3,456 |
1,024 | 1,583 | 1,745 |
2,048 | 761 | 761 |
4,096 | 490 | 384 |
8,192 | 214 | 214 |
16,384 | 80 | 131 |
32,768 | 60 | 185 |
65,536 | 69 | 159 |
131,072 | 49 | 138 |
262,144 | 40 | 144 |
524,288 | 42 | 159 |
1,048,576 | 47 | 167 |
2,097,152 | 44 | 176 |
4,194,304 | 42 | 176 |
8,388,608 | 42 | 176 |
It is interesting to see the ratio of the theoretical limits of sort + binary search VS linear search. In every case we are supposed to get a value around 1 since the time required to the binary search and linear search was the same so the following equation should hold true:
N * log(N) + K * log(N) = K * (N/2)
Another unforeseen result is they are not the same! On the .NET platform they slowly converge to 1 (TimeOf(linear) / TimeOf(binary)) but on the Mono platform the ratio converges to 4 meaning the .NET implementation of the Array.Sort is around 4 times faster than on the Mono platform.
On small arrays these values can be very extreme, suggesting that sorting small arrays are very inefficient and the expected run time is nowhere near around the N * log(N) time.
Conclusion
Contrary to the popular belief, sorting and then performing binary search in some cases can be always slower than just performing consecutive linear searches.
One other interesting conclusion is that sorting small arrays is very expensive compared to sorting large (>200.000 elements) ones.
Comments
Post a Comment