Unexpected performance of parallel connection in Java 8

I am having a performance issue while using a stream created with spliterator()

over Iterable. i.e. how StreamSupport.stream(integerList.spliterator(), true)

. I wanted to prove it with a regular collection. Below are some test results.

Question: Why is a parallel stream created from an iterable so much slower than a stream created from an ArrayList or IntStream?

Out of range

 public void testParallelFromIntRange() {
    long start = System.nanoTime();
    IntStream stream = IntStream.rangeClosed(1, Integer.MAX_VALUE).parallel();
    System.out.println("Is Parallel: "+stream.isParallel());
    stream.forEach(ParallelStreamSupportTest::calculate);
    long end = System.nanoTime();
    System.out.println("ParallelStream from range Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
            TimeUnit.NANOSECONDS) + " milli seconds");
}

      

Parallel: true
Parallel stream from range Takes: 490 milliseconds

From Iterable

 public void testParallelFromIterable() {
    Set<Integer> integerList = ContiguousSet.create(Range.closed(1, Integer.MAX_VALUE), DiscreteDomain.integers());
    long start = System.nanoTime();
    Stream<Integer> stream = StreamSupport.stream(integerList.spliterator(), true);
    System.out.println("Is Parallel: " + stream.isParallel());
    stream.forEach(ParallelStreamSupportTest::calculate);
    long end = System.nanoTime();
    System.out.println("ParallelStream from Iterable Takes : " + TimeUnit.MILLISECONDS.convert((end - start),
            TimeUnit.NANOSECONDS) + " milli seconds");
}

      

Parallel: true
ParallelStream from Iterable Takes: 12517 milliseconds

And such a trivial calculation method.

public static Integer calculate(Integer input) {
    return input + 2;
}

      

+3


source to share


2 answers


Not all separators are created equal. One of the tasks of the splitter is to split the source into two parts that can be processed in parallel. A good separator will split the source roughly in half (and can continue to do so recursively.)

Now, imagine you are writing a source separator that is only described by an Iterator. What quality of decomposition can you get? Basically, all you can do is split the source into "first" and "rest". This is as bad as it gets. The result is a computation tree that is very "right hard".



The delimiter you get from the data structure has more options to work with; it knows the layout of the data and can use it for better separation and therefore better parallel performance. The delimiter for ArrayList can always be halved and stores information about how much data is in each half. This is really good. A B-tree splitter might get a good distribution (since each half of the tree has about half the elements), but not as good as an ArrayList splitter because it doesn't know the exact size. The delimiter for LinkedList is about as bad as it gets; all he can do is (first, rest). And the same for getting the delimiter from the iterator.

Now everything is not necessarily lost; if the work per element is high, you can overcome the bad splitting. But if you do little work per item, you will be limited by the quality of the splits from your splitter.

+11


source


There are several problems with your benchmark.

  • Stream<Integer>

    cannot be compared with IntStream

    due to boxing overhead.
  • You don't do anything with the result of the computation, which makes it difficult to know if the code is actually executing.
  • You are comparing with help System.nanoTime

    instead of using a proper comparison tool.

Here's an example of a JMH based test:

import com.google.common.collect.ContiguousSet;
import com.google.common.collect.DiscreteDomain;
import com.google.common.collect.Range;
import java.util.stream.IntStream;
import java.util.stream.Stream;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.OptionsBuilder;

public class Ranges {

    final static int SIZE = 10_000_000;

    @Benchmark
    public long intStream() {
        Stream<Integer> st = IntStream.rangeClosed(1, SIZE).boxed();

        return st.parallel().mapToInt(x -> x).sum();
    }

    @Benchmark
    public long contiguousSet() {
        ContiguousSet<Integer> cs = ContiguousSet.create(Range.closed(1, SIZE), DiscreteDomain.integers());
        Stream<Integer> st = cs.stream();

        return st.parallel().mapToInt(x -> x).sum();
    }

    public static void main(String[] args) throws RunnerException {
        new Runner(
                new OptionsBuilder()
                .include(".*Ranges.*")
                .forks(1)
                .warmupIterations(5)
                .measurementIterations(5)
                .build()
        ).run();
    }
}

      



And the output is:

Benchmark                  Mode   Samples        Score  Score error    Units
b.Ranges.contiguousSet    thrpt         5       13.540        0.924    ops/s
b.Ranges.intStream        thrpt         5       27.047        5.119    ops/s

      

So, IntStream.range

about twice as fast as ContiguousSet

, which is perfectly reasonable given that the ContiguousSet does not implement its own Spliterator and uses the default fromSet

+5


source







All Articles