Monday, May 26, 2014

Java 8 parallel sort - internals

Java 8 is a quite big update to the platform and luckily they focused on multicore systems too. The new Arrays.parallelSort() implementation is quite interesting: it is using quicksort, merge sort, and Timsort to achieve the best performance, depending on the underlying data type and size.

Arrays.parallelSort() – not always parallel

The first thing that needs to be noted is that it’s not always a parallel sort. If the array size is small enough (<= 8192 elements) or the number of reported logical CPUs is 1, it always falls back to a dual pivot quick sort. Note, that the “logical CPU” here refers to hyper-threading, so an average i7 dual core CPU will report 4 logical CPUs to the Java environment.

To override the number of reported CPUs for the sort algorithm, we simply need to pass in any value to the following system property when starting the JVM:
java.util.concurrent.ForkJoinPool.common.parallelism

Sorting the chunks and chunk sizes

The algorithm (Arrays.java) will decide on the chunk size before submitting to sort the array using the helper (ArraysParallelSortHelpers.java). The helper also receives a working buffer as this parallel sort implementation will require the same amount of memory as the original array. This property itself may make it unsuitable for certain cases, like almost sorted arrays or extremely large arrays.

The chunk size that will be distributed among CPUs will be at least 8192 items, or Array size / (4 * reported CPUs) if the latter is larger.
So, an array of 500.000 items with a standard 2 core, 4 virtual cores machine would have:

500.000 / (4 * 4) = 31.250 items / chunk

The original array is split into these 4 logical blocks, which are sorted then merged together. With the 4 logical processing units, each quarter is split into another quarter (16th), then sorted and merged back to the original array.

The sorting itself is always the dual pivot quick sort for raw types (int, byte, short…) and TimSort for anything else.

Interestingly, the helper class has native implementation for all raw types for performance reasons, instead of using generics.

But why does the 8192 chunk size work so well?

From the Arrays.java documentation it seems like a totally empirical value: “Using smaller sizes typically results in memory contention across tasks that makes parallel speedups unlikely.” So contention is definitely a reason, but if we look at the current CPU architectures, there is something interesting about L1 cache sizes too:

Haswell (current) L1 64kb (32kb data, 32kb instructions) per core.
Skylake (future) L1 128kb (64kb+64kb) per core.

Fitting everything in L1 cache would be ideal, but 32kb for data on the typical 64 bit systems would mean only 4096 pointers (items in the array):
32768 * 8 / 64 = 4096

So why is the 8192 working that well? The answer is more around Java runtime: the
-XX:+UseCompressedOops 

flag is on by default enabled for systems less than 32gb of memory on 64bit JVM since Java SE 6u23. This flag “compresses” pointers and stores the 64 bit pointers as 32 bit, sparing significant memory.

How about Skylake and later?

The chunk size works very well for current CPU architectures, but as L1 cache size will grow fairly soon, this may not be the perfect size. Unfortunately the 8129 chunk size is hard wired and cannot be overridden, unlike the number of virtual CPUs. This would mean that on a newer generation CPUs the allocation of new working buffer, distributing workload across CPUs then re-merging the result may not be more efficient than single threaded sorting for not too large arrays.

Timsort

Timsort is part of Python since version 2.3 and Java since 7. The idea behind this algorithm is that with real world data the array is almost never random – it contains smaller and larger sorted sub-arrays, called “runs”. The algorithm keeps identifying these runs and merges them together – hence it needs a working buffer for the merge operations.

Although it’s not possible to beat the O(n*log(n)) lower bound in sorting, Timsort can be very efficient when the array is already sorted or has large "runs" that are sorted. It’s lower bound is O(n), unlike quicksort, which is O(n*log(n)); upper bound is still O(n*log(n)), unlike quicksort, which is O(n^2).

Apart from using working buffer, the other issue with Timsort is that the implementation is quite complex: finding the runs, merging them, then finding the correct place for the next run using galloping search requires much more complex implementation than quicksort. The Java implementation is almost 1000 lines, while a quicksort implementation can fit on a screen (around 100 lines).

For source code reference, have a look at Arrays.java and ArraysParallelSortHelpers.java

7 comments:

  1. Hi! When you say "The sorting itself is always the dual pivot quick sort for raw types (int, byte, short…) and TimSort for anything else," does that mean that the built in Integer type is treated as an object and therefore uses TimSort?

    Thank you!

    ReplyDelete
    Replies
    1. Good question! It will use the FJObject internally because Integer[] cannot be used for int[] (no auto unboxing for arrays).
      FJObject uses TimSort.

      Delete
  2. Thank you for the clarification!

    ReplyDelete
  3. Hello! I have another question for you: I'm running some experiments testing the parallelsort against other multithreaded quicksorts, and I'm trying to find out more information about this 8192 switch-over. I was wondering where you were able to find this information if you don't mind sharing? Also, I hope this is not an offensive question, but I wanted to ask about your credentials, reason being, I would like to cite your blog in the paper I'm writing. If you're open to sharing that information that would be awesome! If not though I completely understand. Thank you once again for your time.

    ReplyDelete
    Replies
    1. Hi Laura, can you please drop me a private message?

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi,

      I was always confused why was the number 8192. This was by far one of the best explanation I've found.

      But I've a question in this. An Integer Variable is of 4Bytes. So ideally of the 8192 bits, we can store 8192/32 = 256 numbers in CPU cache at once.

      So the minimum granularity in this case is suppose to be 256, why are all parallelSort functions in Java (be it byte[], int[], long[], etc.) all use 8192 as the default number? Shouldn't it vary according to the function?

      Delete