Gc auto-vectorisation (raw data-ref)
I don't understand why such code is not vectorized with gcc 4.4.6
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + pfTab[iIndex];
}
note: not vectorized: unhandled data-ref
However, if I write the following code
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
gcc succeeds in auto-vectorizing this loop
if i add omp directive
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
I have the following error, not vectorized: raw data-ref
Could you please help me why the first code and third code are not auto-vectorized?
Second question: math operand doesn't seem to be vectorized (exp, log, etc.), this code for example
for (int i = 0; i < iSize; i++)
pfResult[i] = exp(pfResult[i]);
not vectorized. Is it because of my gcc version?
Edit : with new gcc 4.8.1 and openMP 2011 (echo | cpp -fopenmp -dM | grep -i open) I have the following error for all types of loops even mostly
for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
Edit2:
#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int szGlobalWorkSize = 131072;
int iGID = 0;
int j = 0;
omp_set_dynamic(0);
// warmup
#if WARMUP
#pragma omp parallel
{
#pragma omp master
{
printf("%d threads\n", omp_get_num_threads());
}
}
#endif
printf("Pagesize=%d\n", getpagesize());
float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
float fValue = 0.5f;
struct timeval tim;
gettimeofday(&tim, NULL);
double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
double time = omp_get_wtime();
int iChunk = getpagesize();
int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
//#pragma omp parallel for
for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
time = omp_get_wtime() - time;
gettimeofday(&tim, NULL);
double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf Time1\n", tLaunch2-tLaunch1);
printf("%.6lf Time2\n", time);
}
result with
#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm
lot
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;
thank
source to share
GCC cannot vectorize the first version of your loop because it cannot prove it is pfTab[iIndex]
not contained somewhere within the memory spanned by pfResult[0] ... pfResult[iSize-1]
(pointer dithering). Indeed, if it pfTab[iIndex]
is somewhere inside this memory, then its value should be overwritten by an assignment in the body of the loop, and the new value should be used in subsequent iterations. You should use a keyword restrict
to hint to the compiler that this will never happen, and then it should happily draw your code:
$ cat foo.c
int MyFunc(const float *restrict pfTab, float *restrict pfResult,
int iSize, int iIndex)
{
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + pfTab[iIndex];
}
$ gcc -v
...
gcc version 4.6.1 (GCC)
$ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:3: note: LOOP VECTORIZED.
foo.c:1: note: vectorized 1 loops in function.
The second version is vectorized because the value is passed to a variable with an automatic storage duration. The general assumption here is that it pfResult
does not extend to the stack memory where it is stored fTab
(a cursory reading of the C99 language specification does not make it clear whether this assumption is weak or something in the standard allows).
The OpenMP version is not a vector version due to the way OpenMP is implemented in GCC. It uses code for parallel regions.
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
effectively becomes:
struct omp_data_s
{
float *pfResult;
int iSize;
float *fTab;
};
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
struct omp_data_s omp_data_o;
omp_data_o.pfResult = pfResult;
omp_data_o.iSize = iSize;
omp_data_o.fTab = fTab;
GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
MyFunc._omp_fn.0 (&omp_data_o);
GOMP_parallel_end ();
pfResult = omp_data_o.pfResult;
iSize = omp_data_o.iSize;
fTab = omp_data_o.fTab;
}
void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
{
int start = ...; // compute starting iteration for current thread
int end = ...; // compute ending iteration for current thread
for (int i = start; i < end; i++)
omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
}
MyFunc_omp_fn0
contains the jumbled code for the function. The compiler cannot prove that it omp_data_i->pfResult
does not point to memory, what is aliased, omp_data_i
and in particular its member fTab
.
To draw this cycle, you must do fTab
firstprivate
. This will turn it into an automatic variable in the highlighted code and will be equivalent to your second case:
$ cat foo.c
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for firstprivate(fTab)
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
$ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: LOOP VECTORIZED.
foo.c:4: note: vectorized 1 loops in function.
source to share