Java sequential implementation is 4 times faster than parallel implementation

Question

Java sequential implementation is 4 times faster than parallel implementation

I created a very simple scenario where I found out some really weird behavior that I can't figure out.

In the following link, I created a sequential implementation: http://ideone.com/B8JYeA There are basically some large fixed size arrays. The algorithm iterates through them and changes the value.

for(int i = 0; i < numberOfCells; i++) {
    h0[i] =  h0[i] + 1;
    h1[i] =  h1[i] + 1;
    h2[i] =  h2[i] + 1;
    h3[i] =  h3[i] + 1;
    h4[i] =  h4[i] + 1;
}

If I run it on my workstation, it takes about 5 seconds.

I have implemented the same in a parallel version. And 8 threads run it simultaneously. The code must be thread safe and there is no dependency between threads.

But still the code runs about 4x slower on my workstation: http://ideone.com/yfwVmr

final int numberOfThreads = Runtime.getRuntime().availableProcessors();

ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);

for(int thread = 0; thread < numberOfThreads; thread++) {
    final int threadId = thread;
    exec.submit(new Runnable() {
        @Override
        public void run() {
            for(int i = threadId; i < numberOfCells; i += numberOfThreads) {
                h0[i] =  h0[i] + 1;
                h1[i] =  h1[i] + 1;
                h2[i] =  h2[i] + 1;
                h3[i] =  h3[i] + 1;
                h4[i] =  h4[i] + 1;
            }
        }
    });
}

exec.shutdown();

Does anyone have any idea why this is happening?

Edit: This problem is different from the others because the cause appears to be a caching issue. How can I solve this caching problem?

+3

java multithreading parallel-processing executorservice threadpool

RobinXSI 05 May '15 at 16:23

source to share

2 answers

Not really an answer, but: First I would try to maintain the data access location where possible:

final int numberOfCellsPerThread = numberOfCells / numberOfThreads;

public void run() {
    final int start = threadId * numberOfCellsPerThread;
    final int end = start + numberOfCellsPerThread;
    for(int i = start; i < end; i++) {
        h0[i] =  h0[i] + 1;
        h1[i] =  h1[i] + 1;
        h2[i] =  h2[i] + 1;
        h3[i] =  h3[i] + 1;
        h4[i] =  h4[i] + 1;
    }
}

A more detailed explanation of why locality matters, e.g. Why is cache locality important to array performance? or http://en.wikipedia.org/wiki/Locality_of_reference .

It is mostly just using data that is already in the cache where possible. Since the cache is limited in size if a[i]

already in the cache for example. due to the previous read operation, the probability of what a[i+1]

is in the cache is too high. At least above the chance a[i+100]

, for example.

In addition, sequential reads from memory can potentially be optimized in batches by hardware and are most easily predicted by prefetching.

0

JimmyB 05 May '15 at 16:28

source to share

Peter lawrey · Accepted Answer · 2015-05-05T16:32:05+0000

The biggest overhead is the time spent starting and stopping threads. If I reduce the size of the array to 10 from 10000 it takes about the same amount of time.

If you keep a thread pool and separate the work for each thread to write to the local dataset, it is 4x faster on my 6 core machine.

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;


public class ParallelImplementationOptimised {
    static final int numberOfThreads = Runtime.getRuntime().availableProcessors();
    final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);

    private int numberOfCells;

    public ParallelImplementationOptimised(int numberOfCells) {
        this.numberOfCells = numberOfCells;
    }

    public void update() throws ExecutionException, InterruptedException {

        List<Future<?>> futures = new ArrayList<>();
        for(int thread = 0; thread < numberOfThreads; thread++) {
            final int threadId = thread;
            futures.add(exec.submit(new Runnable() {
                @Override
                public void run() {
                    int num = numberOfCells / numberOfThreads;
                    double[] h0 = new double[num],
                            h1 = new double[num],
                            h2 = new double[num],
                            h3 = new double[num],
                            h4 = new double[num],
                            h5 = new double[num],
                            h6 = new double[num],
                            h7 = new double[num],
                            h8 = new double[num],
                            h9 = new double[num];
                    for (int i = 0; i < num; i++) {
                        h0[i] = h0[i] + 1;
                        h1[i] = h1[i] + 1;
                        h2[i] = h2[i] + 1;
                        h3[i] = h3[i] + 1;
                        h4[i] = h4[i] + 1;
                        h5[i] = h5[i] + 1;
                        h6[i] = h6[i] + 1;
                        h7[i] = h7[i] + 1;
                        h8[i] = h8[i] + 1;
                        h9[i] = h9[i] + 1;
                    }
                }
            }));
        }
        for (Future<?> future : futures) {
            future.get();
        }
    }

    public static void main(String[] args) throws ExecutionException, InterruptedException {

        ParallelImplementationOptimised si = new ParallelImplementationOptimised(10);

        long start = System.currentTimeMillis();

        for (int i = 0; i < 10000; i++) {
            if(i % 1000 == 0) {
                System.out.println(i);
            }
            si.update();
        }

        long stop = System.currentTimeMillis();
        System.out.println("Time: " + (stop - start));
        si.exec.shutdown();
    }

}

Sequential implementation 3.3 sec. Parallel implementation Optimized 0.8 sec.

It seems that you are writing the same data to the same cache line. This means that data has to go through L3 cache passes, which takes 20 times longer than L1 cache access. I suggest you try completely separate data structures that are at least 128 bytes to be sure you are not touching the same cache line.

Note. Even if you intend to rewrite the entire cache line, x64 processors will fetch the previous cache line values first.

Another question could be

Why isn't this 20x slower?

A processor core that has captured a cache line can have two hyperthreading threads (i.e., two threads can access the data locally), and that processor can traverse the loop multiple times before it loses a cache line for another processor core. which requires it. This means that the penalty is 20 times not on every access or every cycle, but often enough to get a slower result.

Java sequential implementation is 4 times faster than parallel implementation

More articles: