Quick select algorithm - find the Kth element in a list in linear time

July 23, 2012

Quick select algorithm (Hoare's selection algorithm) – select the Kth element or the first K element from a list in linear time

Working with large datasets is always painful, especially when it needs to be displayed in a ‘human readable’ format. It is a very frequent task to display only the largest, newest, most expensive etc. items. While sorting the whole dataset definitely gives a correct result, it is much slower than it needs to be – it needs at least O(n*log(n)) time and an it often uses recursion for the sorting, so in practice it can be quite slow.

The quick select algorithm can get the top K element from a list of N items in linear time, O(n), with a very reasonable multiplication factor. The quick select does not use recursion so the performance is great for even large datasets.

Algorithm

The idea of the quick select is quite simple: just like with quicksort, select a random element from the list, and place every item that is smaller to the first half of the array, and every element that is equal to or greater than the pivot, in the second half (the ‘half’ is not entirely correct, as it is possible that the result will not be exactly ‘half’).

So a step would look like this:

Arr = [5 1 4 3 2]
Pivot = [4]

Steps:

swap [5] and [2] as 5>=4 and 2<
[2 1 4 3 5]

swap [4] and [3] as 4>=4 and 3<4
[2 1 3 4 5]

When we finish with the first iteration, we know the followings:
All elements <4 are on the left of 4
All elements >=4 are on the right of 4 (including the 4 itself)

So, if we are looking for the first 3 elements, we can stop, we found them. If we are looking for the 3rd element, we need more iteration, but we know we must look for it in the first half of the array hence we can ignore the rest:

Arr = [2 1 3 …]
Pivot = [1]

Steps:
swap [2] and [1] as 2>=2 and 1<2
[1 2 3 …]

When we finish this iteration, we know the followings:
All elements <1 are on the left of 1 (none in this case)
All elements >=1 are on the right of 1 (including the 1 itself)

If we were looking for the 1st element, we are done, [1] is the first. However, we know the 3rd element must be right from the [1] and left from [4]:

Arr = […2 3…]
Pivot= [2]
…

Just like with binary search, we keep dropping a segment from the array as we are getting closer to the solution. On average, we halve the search space so it gives us a geometrical series of operations. In the first step, we work with all the items, which is N. The next iteration works only with roughly the half of the array, which is N/2 and so on:

Work = n + n/2 + n/4 + …

To sum it all up, we can use the similarity rule:

Work/2 = n/2 + n/4 + n/8 + …

Hence:

Work – (Work/2) = n
Work/2 = n
Work = 2n
Work = O(n)

Running benchmark

So it is quite clear that this algorithm runs in linear time. The quick selection algorithm Java code would look like this:

public static int selectKth(int[] arr, int k) {
 if (arr == null || arr.length <= k)
  throw new Error();

 int from = 0, to = arr.length - 1;

 // if from == to we reached the kth element
 while (from < to) {
  int r = from, w = to;
  int mid = arr[(r + w) / 2];

  // stop if the reader and writer meets
  while (r < w) {

   if (arr[r] >= mid) { // put the large values at the end
    int tmp = arr[w];
    arr[w] = arr[r];
    arr[r] = tmp;
    w--;
   } else { // the value is smaller than the pivot, skip
    r++;
   }
  }

  // if we stepped up (r++) we need to step one down
  if (arr[r] > mid)
   r--;

  // the r pointer is on the end of the first k elements
  if (k <= r) {
   to = r;
  } else {
   from = r + 1;
  }
 }

 return arr[k];
}

As the algorithm is nice and linear without recursion or complex branches, we expect a very good running time.

To test is, I’ve run the quick select against different array sizes between 1 and 20 million and checked the relative running times (the graph shows many runs summed on the arrays as a single run was too quick to measure precisely):

The graph supports the idea that it is really linear, so that’s good. But how about sorting the array? The following graph shows the sorting and quick select in relative time compared to each other:

It is interesting to note that O(n*log(n)) is almost linear (log(1million)~=20, log(20million)~=24) but still much slower than our quick select implementation.

Quick select than sorting or heap

As sorting the whole dataset is quite slow, it makes sense to select the top K items and sort only that few ‘top’ elements giving the impression to the user as the whole dataset was sorted as she pages through the result set. This will give a running time of O(k*log(k) + n) as opposed to O(n*log(n)) which is much faster if K is reasonably small (few hundreds for example).

An other approach would be to work with a heap and keep popping the smallest number while putting back a larger as we are receiving the N numbers as a stream. This would work with O(n*log(K)) running time as the heap holds K elements so the height is log(K) while we test N numbers in total, although it’s expected running time is larger than the quick select and sort combination.

Comments

UnknownSeptember 17, 2012 at 4:00 PM
Thanks for the blog post!

When you tested the performance, were your test values all non-repeating integers?

I'm doing something similar to your algorithm, and I find that if the median value has repeating values, I will go into an infinite loop. I didn't see any code to protect against this, so I'm wondering if your code naturally solves this problem, or if it is also susceptible to this?
ReplyDelete
Replies
AshwinOctober 4, 2012 at 9:33 PM
This comment has been removed by the author.
ReplyDelete
Replies
AshwinOctober 4, 2012 at 10:46 PM
Could the algorithm be improved if you were to also walk leftwards with the "w" pointer for all a[w] > pivot. That way you could avoid some repeated swaps where a high value is first moved to "r" and then swapped back to a "--w".

Ashwin
ReplyDelete
Replies
gfs.reboucasDecember 10, 2012 at 12:20 PM
Interesting algorithm, is it supposed to work when K is much smaller than N, right? Because I've implemented it in C++ for small arrays (N = 10) with different elements to get a feeling of how it works and I found that it fails for K > 4. I think my code is correct and I think this problem happens when K is near to the pivot but I didn't realize how to solve it.

Anyway, I think this isn't a big problem.
ReplyDelete
Replies
נעםMay 20, 2013 at 5:11 AM
Thanks great blog !
ReplyDelete
Replies
RongAugust 6, 2013 at 1:29 PM
Came across your blog by searching "implementation of fast selection". It's a nice posting.

FYI: The complexity analysis on "Hoare's algorithm" here is too "idealized" (each time you assume an ideal pivot); the realistic complexity should be "O(n^2)". The real "fast selection" algorithm, which is O(n), goes to Blum, Floyd, Pratt, Rivest and Tarjan. See
https://en.wikipedia.org/wiki/Selection_algorithm if you are interested.
ReplyDelete
Replies
UnknownSeptember 15, 2013 at 11:15 PM
Thanks, very simple and clear explanation
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 7:27 AM
Hi, I tried to implement the algorithm in C, on an ARM controller. After some successful testing, I found out there are many combinations that make the program fail:
It can easily happen that you are decrementing the r index, although it is still zero:
>>// if we stepped up (r++) we need to step one down
>> if (arr[r] > mid)
>> r--;
Try a median length of seven, and the values 94,83,82,1,192,122,251, for example
Fails in the first loop already, and sets r to -1, which crashes the program, in the next loop.

Haven't found a solution yet, but testing.

Best regards
Thomas
ReplyDelete
Replies
UnknownSeptember 16, 2013 at 7:31 AM
Forgot to mention:
For the cases it works, it is about four times faster than the ARM C library "qsort" function.
TL
ReplyDelete
Replies
AshishDecember 6, 2013 at 3:44 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJanuary 6, 2014 at 1:41 PM
// if we stepped up (r++) we need to step one down
if (arr[r] > mid)
r--;
Could you clarify this portion of the code? I'm having trouble thinking of cases where this occurs.
ReplyDelete
Replies
ChristinaJanuary 11, 2014 at 12:43 PM
You mention that after the first iteration all elements to the left of the pivot are less than the pivot while all element to the right are larger or equal. However, if your initial array is {5,4,1,3,2} your pivot happens to be the minimum and in this case after the first iteration it will be placed in in the second position, i.e. you have {4,1,3,2,5}. The algorithm still works but because of this condition:
if (arr[r] >= mid)
when you get {1,4,3,2,5}
you swap 1 and 4 since 1>=1 (r is at 0 and w is at 1).
When the pivot is the minimum in the fist iteration you never reach the else body (r++) so you constantly decrement w until it reaches 0 where the first iteration stops.
ReplyDelete
Replies
Erhan OnalMarch 17, 2014 at 2:10 PM
Great post, the best I have seen on the topic!
There is a subtle problem, though... When you pass 4 as k, it finds the 5th smallest value. To fix this, a simple k-- at the very beginning of the algorithm will suffice; if I have to nit-pick...
ReplyDelete
Replies
jlshownSeptember 13, 2018 at 6:06 PM
Very nice done. Thank you for posting this.
ReplyDelete
Replies

Add comment

Adam Horvath's blog

Quick select algorithm - find the Kth element in a list in linear time

Comments

Post a Comment

Popular posts from this blog

MurMurHash3, an ultra fast hash algorithm for C# / .NET

Convert animated WEBP to MP4