Monday, July 23, 2012

Quick select algorithm - find the Kth element in a list in linear time


Quick select algorithm (Hoare's selection algorithm) – select the Kth element or the first K element from a list in linear time

Working with large datasets is always painful, especially when it needs to be displayed in a ‘human readable’ format. It is a very frequent task to display only the largest, newest, most expensive etc. items. While sorting the whole dataset definitely gives a correct result, it is much slower than it needs to be – it needs at least O(n*log(n)) time and an it often uses recursion for the sorting, so in practice it can be quite slow.

The quick select algorithm can get the top K element from a list of N items in linear time, O(n), with a very reasonable multiplication factor. The quick select does not use recursion so the performance is great for even large datasets.

Algorithm

The idea of the quick select is quite simple: just like with quicksort, select a random element from the list, and place every item that is smaller to the first half of the array, and every element that is equal to or greater than the pivot, in the second half (the ‘half’ is not entirely correct, as it is possible that the result will not be exactly ‘half’).

So a step would look like this:

Arr = [5 1 4 3 2]
Pivot = [4]

Steps:

swap [5] and [2] as 5>=4 and 2<
[2 1 4 3 5]

swap [4] and [3] as 4>=4 and 3<4
[2 1 3 4 5]

When we finish with the first iteration, we know the followings:
All elements <4 are on the left of 4
All elements >=4 are on the right of 4 (including the 4 itself)

So, if we are looking for the first 3 elements, we can stop, we found them. If we are looking for the 3rd element, we need more iteration, but we know we must look for it in the first half of the array hence we can ignore the rest:

Arr = [2 1 3 …]
Pivot = [1]

Steps:
swap [2] and [1] as 2>=2 and 1<2
[1 2 3 …]

When we finish this iteration, we know the followings:
All elements <1 are on the left of 1 (none in this case)
All elements >=1 are on the right of 1 (including the 1 itself)

If we were looking for the 1st element, we are done, [1] is the first. However, we know the 3rd element must be right from the [1] and left from [4]:

Arr = […2 3…]
Pivot= [2]
…

Just like with binary search, we keep dropping a segment from the array as we are getting closer to the solution. On average, we halve the search space so it gives us a geometrical series of operations. In the first step, we work with all the items, which is N. The next iteration works only with roughly the half of the array, which is N/2 and so on:

Work = n + n/2 + n/4 + …

To sum it all up, we can use the similarity rule:

Work/2 = n/2 + n/4 + n/8 + …

Hence:

Work – (Work/2) = n
Work/2 = n
Work = 2n
Work = O(n)

Running benchmark

So it is quite clear that this algorithm runs in linear time. The quick selection algorithm Java code would look like this:

public static int selectKth(int[] arr, int k) {
 if (arr == null || arr.length <= k)
  throw new Error();

 int from = 0, to = arr.length - 1;

 // if from == to we reached the kth element
 while (from < to) {
  int r = from, w = to;
  int mid = arr[(r + w) / 2];

  // stop if the reader and writer meets
  while (r < w) {

   if (arr[r] >= mid) { // put the large values at the end
    int tmp = arr[w];
    arr[w] = arr[r];
    arr[r] = tmp;
    w--;
   } else { // the value is smaller than the pivot, skip
    r++;
   }
  }

  // if we stepped up (r++) we need to step one down
  if (arr[r] > mid)
   r--;

  // the r pointer is on the end of the first k elements
  if (k <= r) {
   to = r;
  } else {
   from = r + 1;
  }
 }

 return arr[k];
}

As the algorithm is nice and linear without recursion or complex branches, we expect a very good running time.

To test is, I’ve run the quick select against different array sizes between 1 and 20 million and checked the relative running times (the graph shows many runs summed on the arrays as a single run was too quick to measure precisely):



The graph supports the idea that it is really linear, so that’s good. But how about sorting the array? The following graph shows the sorting and quick select in relative time compared to each other:



It is interesting to note that O(n*log(n)) is almost linear (log(1million)~=20, log(20million)~=24) but still much slower than our quick select implementation.

Quick select than sorting or heap

As sorting the whole dataset is quite slow, it makes sense to select the top K items and sort only that few ‘top’ elements giving the impression to the user as the whole dataset was sorted as she pages through the result set. This will give a running time of O(k*log(k) + n) as opposed to O(n*log(n)) which is much faster if K is reasonably small (few hundreds for example).

An other approach would be to work with a heap and keep popping the smallest number while putting back a larger as we are receiving the N numbers as a stream. This would work with O(n*log(K)) running time as the heap holds K elements so the height is log(K) while we test N numbers in total, although it’s expected running time is larger than the quick select and sort combination.

21 comments:

  1. Thanks for the blog post!

    When you tested the performance, were your test values all non-repeating integers?

    I'm doing something similar to your algorithm, and I find that if the median value has repeating values, I will go into an infinite loop. I didn't see any code to protect against this, so I'm wondering if your code naturally solves this problem, or if it is also susceptible to this?

    ReplyDelete
    Replies
    1. I was testing with randomized arrays; The algorithm handles repeating values well. If you think of the general selection, repeating values do not make any difference; however, poorly written solutions will generally fail with repeating values. GIve it a go, it's open source ;)

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Could the algorithm be improved if you were to also walk leftwards with the "w" pointer for all a[w] > pivot. That way you could avoid some repeated swaps where a high value is first moved to "r" and then swapped back to a "--w".

    Ashwin

    ReplyDelete
    Replies
    1. Good point Ashwin. The above code is the purest form of the algorithm without additional tweaks - like the one you mentioned. Watch out though, depending on your input data, adding that extra branch might be very expensive if mispredicted compared to a branchless cached memory read/write operation!

      Delete
  4. Interesting algorithm, is it supposed to work when K is much smaller than N, right? Because I've implemented it in C++ for small arrays (N = 10) with different elements to get a feeling of how it works and I found that it fails for K > 4. I think my code is correct and I think this problem happens when K is near to the pivot but I didn't realize how to solve it.

    Anyway, I think this isn't a big problem.

    ReplyDelete
    Replies
    1. Maybe try translating line by line? The above algorithm works fine for small and large arrays :)

      Delete
  5. Came across your blog by searching "implementation of fast selection". It's a nice posting.

    FYI: The complexity analysis on "Hoare's algorithm" here is too "idealized" (each time you assume an ideal pivot); the realistic complexity should be "O(n^2)". The real "fast selection" algorithm, which is O(n), goes to Blum, Floyd, Pratt, Rivest and Tarjan. See
    https://en.wikipedia.org/wiki/Selection_algorithm if you are interested.

    ReplyDelete
  6. Thanks, very simple and clear explanation

    ReplyDelete
  7. Hi, I tried to implement the algorithm in C, on an ARM controller. After some successful testing, I found out there are many combinations that make the program fail:
    It can easily happen that you are decrementing the r index, although it is still zero:
    >>// if we stepped up (r++) we need to step one down
    >> if (arr[r] > mid)
    >> r--;
    Try a median length of seven, and the values 94,83,82,1,192,122,251, for example
    Fails in the first loop already, and sets r to -1, which crashes the program, in the next loop.


    Haven't found a solution yet, but testing.

    Best regards
    Thomas

    ReplyDelete
    Replies
    1. Hi Thomas,

      The Java implementation above correctly prints out the numbers from your sequence when k is set to 0..6:
      1
      82
      83
      94
      122
      192
      251

      I think your C code might be flaky?

      Delete
    2. Hello Adam,

      I think Java, being an interpreted language, checks for false indizes and probably ignores access to an array with a size of seven bytes with an index of (FFFF).
      The good thing with this ARM controller is that it has a very good debugger, and it can easily be traced down to the line that decrements r index while r is still zero. The microcontroller goes into a hard fault then when accessing the array with that index in the next loop.

      In some cases, mainly when the pivot is the smallest number of the set of numbers currently checked, the "else" path is never taken, the algorithm loops through the "if" part only:

      if (arr[r] >= mid) { // put the large values at the end
      int tmp = arr[w];
      arr[w] = arr[r];
      arr[r] = tmp;
      w--;
      } else { // the value is smaller than the pivot, skip
      r++;
      }

      In some cases, still the "r--" is executed, since "(arr[r] > mid)".

      I am not an expert in Java programming, so maybe a few things are handled differently, but I checked that arrays indices start also at 0, and the meaning of >, >= and <, <= are hopefully also the same.

      Thomas


      Delete
    3. This comment has been removed by the author.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. I found the problem:

      The compiler did not handle the last comparisation correctly:

      if (k <= r) {
      to = r;
      } else {
      from = r + 1;
      }

      instead of setting "from" to "r+1" ==0 in the mentioned case, it set "to" to "r"== 0xFFFF.
      The problem seemed to be related to a 16 Bit/ 32 Bit implementation problem.

      Thomas


      Delete
  8. Forgot to mention:
    For the cases it works, it is about four times faster than the ARM C library "qsort" function.
    TL

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. // if we stepped up (r++) we need to step one down
    if (arr[r] > mid)
    r--;
    Could you clarify this portion of the code? I'm having trouble thinking of cases where this occurs.

    ReplyDelete
  11. You mention that after the first iteration all elements to the left of the pivot are less than the pivot while all element to the right are larger or equal. However, if your initial array is {5,4,1,3,2} your pivot happens to be the minimum and in this case after the first iteration it will be placed in in the second position, i.e. you have {4,1,3,2,5}. The algorithm still works but because of this condition:
    if (arr[r] >= mid)
    when you get {1,4,3,2,5}
    you swap 1 and 4 since 1>=1 (r is at 0 and w is at 1).
    When the pivot is the minimum in the fist iteration you never reach the else body (r++) so you constantly decrement w until it reaches 0 where the first iteration stops.

    ReplyDelete
  12. Great post, the best I have seen on the topic!
    There is a subtle problem, though... When you pass 4 as k, it finds the 5th smallest value. To fix this, a simple k-- at the very beginning of the algorithm will suffice; if I have to nit-pick...

    ReplyDelete