Java 8 parallel sort - internals
Arrays.parallelSort() – not always parallel
The first thing that needs to be noted is that it’s not always a parallel sort. If the array size is small enough (<= 8192 elements) or the number of reported logical CPUs is 1, it always falls back to a dual pivot quick sort. Note, that the “logical CPU” here refers to hyper-threading, so an average i7 dual core CPU will report 4 logical CPUs to the Java environment.
To override the number of reported CPUs for the sort algorithm, we simply need to pass in any value to the following system property when starting the JVM:
Sorting the chunks and chunk sizes
The algorithm (Arrays.java) will decide on the chunk size before submitting to sort the array using the helper (ArraysParallelSortHelpers.java). The helper also receives a working buffer as this parallel sort implementation will require the same amount of memory as the original array. This property itself may make it unsuitable for certain cases, like almost sorted arrays or extremely large arrays.
The chunk size that will be distributed among CPUs will be at least 8192 items, or Array size / (4 * reported CPUs) if the latter is larger.
So, an array of 500.000 items with a standard 2 core, 4 virtual cores machine would have:
500.000 / (4 * 4) = 31.250 items / chunk
The original array is split into these 4 logical blocks, which are sorted then merged together. With the 4 logical processing units, each quarter is split into another quarter (16th), then sorted and merged back to the original array.
The sorting itself is always the dual pivot quick sort for raw types (int, byte, short…) and TimSort for anything else.
Interestingly, the helper class has native implementation for all raw types for performance reasons, instead of using generics.
But why does the 8192 chunk size work so well?
From the Arrays.java documentation it seems like a totally empirical value: “Using smaller sizes typically results in memory contention across tasks that makes parallel speedups unlikely.” So contention is definitely a reason, but if we look at the current CPU architectures, there is something interesting about L1 cache sizes too:
Haswell (current) L1 64kb (32kb data, 32kb instructions) per core.
Skylake (future) L1 128kb (64kb+64kb) per core.
Fitting everything in L1 cache would be ideal, but 32kb for data on the typical 64 bit systems would mean only 4096 pointers (items in the array):
32768 * 8 / 64 = 4096
So why is the 8192 working that well? The answer is more around Java runtime: the
flag is on by default enabled for systems less than 32gb of memory on 64bit JVM since Java SE 6u23. This flag “compresses” pointers and stores the 64 bit pointers as 32 bit, sparing significant memory.
How about Skylake and later?
The chunk size works very well for current CPU architectures, but as L1 cache size will grow fairly soon, this may not be the perfect size. Unfortunately the 8129 chunk size is hard wired and cannot be overridden, unlike the number of virtual CPUs. This would mean that on a newer generation CPUs the allocation of new working buffer, distributing workload across CPUs then re-merging the result may not be more efficient than single threaded sorting for not too large arrays.
Timsort is part of Python since version 2.3 and Java since 7. The idea behind this algorithm is that with real world data the array is almost never random – it contains smaller and larger sorted sub-arrays, called “runs”. The algorithm keeps identifying these runs and merges them together – hence it needs a working buffer for the merge operations.
Although it’s not possible to beat the O(n*log(n)) lower bound in sorting, Timsort can be very efficient when the array is already sorted or has large "runs" that are sorted. It’s lower bound is O(n), unlike quicksort, which is O(n*log(n)); upper bound is still O(n*log(n)), unlike quicksort, which is O(n^2).
Apart from using working buffer, the other issue with Timsort is that the implementation is quite complex: finding the runs, merging them, then finding the correct place for the next run using galloping search requires much more complex implementation than quicksort. The Java implementation is almost 1000 lines, while a quicksort implementation can fit on a screen (around 100 lines).
For source code reference, have a look at Arrays.java and ArraysParallelSortHelpers.java