Unexpected performance of parallel connection in Java 8
I am having a performance issue while using a stream created with spliterator()
over Iterable. i.e. how StreamSupport.stream(integerList.spliterator(), true)
. I wanted to prove it with a regular collection. Below are some test results.
Question: Why is a parallel stream created from an iterable so much slower than a stream created from an ArrayList or IntStream?
Out of range
public void testParallelFromIntRange() {
long start = System.nanoTime();
IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
System.out.println("Is Parallel: "+stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Parallel: true
Parallel stream from range Takes: 490 milliseconds
From Iterable
public void testParallelFromIterable() {
Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
long start = System.nanoTime();
Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
System.out.println("Is Parallel: " + stream.isParallel());
stream.forEach(ParallelStreamSupportTest::calculate);
long end = System.nanoTime();
System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
TimeUnit.NANOSECONDS) + " milli seconds");
}
Parallel: true
ParallelStream from Iterable Takes: 12517 milliseconds
And such a trivial calculation method.
public static Integer calculate(Integer input) {
return input + 2;
}
source to share
Not all separators are created equal. One of the tasks of the splitter is to split the source into two parts that can be processed in parallel. A good separator will split the source roughly in half (and can continue to do so recursively.)
Now, imagine you are writing a source separator that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is split the source into "first" and "rest". This is as bad as it gets. The result is a computation tree that is very "right hard".
The delimiter you get from the data structure has more options to work with; it knows the layout of the data and can use it for better separation and therefore better parallel performance. The delimiter for ArrayList can always be halved and stores information about how much data is in each half. This is really good. A B-tree splitter might get a good distribution (since each half of the tree has about half the elements), but not as good as an ArrayList splitter because it doesn't know the exact size. The delimiter for LinkedList is about as bad as it gets; all he can do is (first, rest). And the same for getting the delimiter from the iterator.
Now everything is not necessarily lost; if the work per element is high, you can overcome the bad splitting. But if you do little work per item, you will be limited by the quality of the splits from your splitter.
source to share
There are several problems with your benchmark.
-
Stream<Integer>
cannot be compared withIntStream
due to boxing overhead. - You don't do anything with the result of the computation, which makes it difficult to know if the code is actually executing.
- You are comparing with help
System.nanoTime
instead of using a proper comparison tool.
Here's an example of a JMH based test:
import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;
public class Ranges {
final static int SIZE = 10_000_000;
@Benchmark
public long intStream() {
Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();
return st.parallel().mapToInt(x -> x).sum();
}
@Benchmark
public long contiguousSet() {
ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
Stream<Integer> st = cs.stream();
return st.parallel().mapToInt(x -> x).sum();
}
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder()
.include(".*Ranges.*")
.forks(1)
.warmupIterations(5)
.measurementIterations(5)
.build()
).run();
}
}
And the output is:
Benchmark Mode Samples Score Score error Units
b.Ranges.contiguousSet thrpt 5 13.540 0.924 ops/s
b.Ranges.intStream thrpt 5 27.047 5.119 ops/s
So, IntStream.range
about twice as fast as ContiguousSet
, which is perfectly reasonable given that the ContiguousSet does not implement its own Spliterator and uses the default fromSet
source to share