OpenMP loop gives different results for the same sequential loop
I have serial code:
double* a = malloc((1000000) * sizeof(double));
double* b = malloc((1000000) * sizeof(double));
double totalA = 0;
for (int i = 0; i < 1000000; i++) {
if (i == 0) {
a[i] = sin(i);
}
b[i] = sin(i+1);
if (i < 1000000-1) {
a[i+1] = b[i];
}
totalA += a[i];
}
Output totalA
after this sequential loop 0.232883978073
.
Then I have an OpenMP version (note: all variables are reinitialized):
double* a = malloc((1000000) * sizeof(double));
double* b = malloc((1000000) * sizeof(double));
double totalA = 0;
#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < 1000000; i++) {
if (i == 0) {
a[i] = sin(i);
}
b[i] = sin(i+1);
if (i < 1000000-1) {
a[i+1] = b[i];
}
totalA += a[i];
}
However, the output is totalA
from this code -0.733714826779
.
I cannot understand what my life is, why it is different.
Thank.
UPDATE
After playing around for a while, it seems like the statements if
in the loop are being ignored strangely. The actual statements in blocks if
are executed on all iterations of the loop (as if the clause if
does not exist).
For example, changing the block if
to:
if (i < 555555) {
a[i+1] = b[i];
}
doesn't seem to matter.
I'm still no wiser than what's going on here.
Your code contains a race condition . The conflicting statements are the assignment a[i+1] = b[i];
that writes to the array a
and the operator totalA += a[i];
that is read from a
.
There is no guarantee in your code that the iteration responsible for writing to a specific location in the array is performed prior to the iteration that is read from that location.
To further demonstrate this problem, ordering a segment of a loop containing conflicting statements solves the problem (but most likely destructive to your performance):
#pragma omp parallel for ordered reduction(+:totalA)
for (int i = 0; i < 1000000; i++) {
if (i == 0) {
a[i] = sin(i);
}
b[i] = sin(i+1);
#pragma omp ordered
{
if (i < 1000000-1) {
a[i+1] = b[i];
}
totalA += a[i];
}
}
It's probably best to avoid the problem altogether and rewrite your program to get rid of the loop-related dependency:
#define N 1000000
double* a = malloc(N * sizeof(double));
double* b = malloc(N * sizeof(double));
double totalA = 0;
a[0] = sin(0);
totalA += a[0];
#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < N - 1; i++) {
b[i] = sin(i + 1);
a[i + 1] = b[i];
totalA += a[i + 1];
}
b[N - 1] = sin(N);
Finally, please note that as a sin(0) ≈ 0.0
statement
a[0] = sin(0); totalA += a[0];
can simply be replaced with
a[0] = 0.0;
You can simplify your main loop:
a[0] = sin(0);
for (int i = 0; i < N-1; i++) {
b[i] = sin(i+1);
a[i+1] = b[i];
totalA += a[i];
}
totalA += a[N-1];
b[N-1] = sin(N);
Let's call the iterator for thread1 i1
and thread2 i2
. Then, when, for example, i2=i1+1
write to a[i1+1]
and read to a[i2]
will be the same address, and the value read or written will depend on which stream is the first. This is a race condition.
A simple fixation is to notice that totalA = a[0] + b[0] + b[1] + ... b[N-2]
.
a[0] = sin(0);
totalA += a[0];
#pragma omp parallel for reduction(+:totalA)
for (int i = 0; i < N-1; i++) {
b[i] = sin(i+1);
a[i+1] = b[i];
totalA += b[i];
}
b[N-1] = sin(N);
This creates 0.232883978073
for N=1000000
.