OpenMP implementation makes my code slow
I am trying to make quicksort work in parallel using openMP. After implementing openMP, my attempt to speed up quicksort got faster and the quicksort sort array was almost twice as slow. My code with openMP implementation:
void quickSort( int a[], int l, int r) {
int j;
if( l < r ) {
#pragma omp parallel
{
j = partition( a, l, r);
#pragma omp sections
{
#pragma omp section
{
quickSort( a, l, j-1);
}
#pragma omp section
{
quickSort( a, j+1, r);
}
}
}
}
}
All sorting happens in the methods section, and if you're wondering how it works, here's the code:
int partition( int a[], int l, int r) {
int pivot, i, j, t;
pivot = a[l];
i = l; j = r+1;
while(1) {
do ++i; while( a[i] <= pivot && i <= r );
do --j; while( a[j] > pivot );
if( i >= j ) break;
t = a[i]; a[i] = a[j]; a[j] = t;
}
t = a[l]; a[l] = a[j]; a[j] = t;
return j;
}
I take time in my main before I call quickSort and I stop the timer before printf in the main. The number of threads is defined to be 10 (I tried with 4.2 and 1 on my computer). My results after sorting a list with 1,000,000 random numbers from 0 to 100:
time (without openMP) is 6.48004 - 5.32001
with openMP times is 11.8309 and 10.6239 (with 2-4 threads) How can this be true?
source to share
The general idea behind quicksort is:
[......................]
This list of items is broken down into 2 tasks:
[..........][..........]
And each of the "tasks" is then split over and over again:
[..][..][..][..][..][..]
Now the processor loves to work with data that is close to each other. But, when each of your cores is working with PRETTY data close together, it may happen that one core writes a chunk of data that is in the same cassette as the data on the other core. Since you don't want the cores to write to each other's data, the first write will invalidate the data in the other cores, and hence the other cores will have to take a chunk of RAM again.
|--- cache line ---|
[..][..][..][..][..][..]
^ ^ ^ ^
| | | |
c1 c2 c3 c4
So, no matter what the kernel writes data belonging to that cache line, it flushes all other kernels first. Since you are doing small tasks [..]
quite closely, you increase the likelihood of many invalid cache lines and a lot of repeated data from memory. The effect is much better explained here.
http://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share
Also read http://lwn.net/Articles/252125/ , especially "3.3.4 Multi-Processor Support".
All of this invalidating the cache
fails (which is often) in your non-parallel version, as there is only one core actively working on the data.
Thus, a possible solution is not to separate tasks until they are too small to run efficiently on cores. Another effect you should take into account: OpenMP has to do a bit of management overhead
a task. If the tasks are too small, you also increase the coefficient overhead vs work
.
OpenMP based quicksort that google link came from:
http://berenger.eu/blog/c-openmp-a-shared-memory-quick-sort-with-openmp-tasks-example-source-code/
May inspire you.
source to share