Gc auto-vectorisation (raw data-ref)

Question

Gc auto-vectorisation (raw data-ref)

I don't understand why such code is not vectorized with gcc 4.4.6

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}

 note: not vectorized: unhandled data-ref

However, if I write the following code

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

gcc succeeds in auto-vectorizing this loop

if i add omp directive

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

I have the following error, not vectorized: raw data-ref

Could you please help me why the first code and third code are not auto-vectorized?

Second question: math operand doesn't seem to be vectorized (exp, log, etc.), this code for example

for (int i = 0; i < iSize; i++)
         pfResult[i] = exp(pfResult[i]);

not vectorized. Is it because of my gcc version?

Edit : with new gcc 4.8.1 and openMP 2011 (echo | cpp -fopenmp -dM | grep -i open) I have the following error for all types of loops even mostly

   for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }


note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.

Edit2:

#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main()
{
        int szGlobalWorkSize = 131072;
        int iGID = 0;
        int j = 0;
        omp_set_dynamic(0);
        // warmup
        #if WARMUP
        #pragma omp parallel
        {
        #pragma omp master
        {
        printf("%d threads\n", omp_get_num_threads());
        }
        }
        #endif
        printf("Pagesize=%d\n", getpagesize());
        float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
        float fValue = 0.5f;
        struct timeval tim;
        gettimeofday(&tim, NULL);
        double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
        double time = omp_get_wtime();
        int iChunk = getpagesize();
        int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
        //#pragma omp parallel for
        for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }
        time = omp_get_wtime() - time;
        gettimeofday(&tim, NULL);
        double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
        printf("%.6lf Time1\n", tLaunch2-tLaunch1);
        printf("%.6lf Time2\n", time);
}

result with

#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm

lot

note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;

thank

+3

gcc openmp

parisjohn 20 nov. '14 at 9:35

source to share

1 answer

Hristo iliev · Accepted Answer · 2014-11-20T18:29:51+0000

GCC cannot vectorize the first version of your loop because it cannot prove it is pfTab[iIndex]

not contained somewhere within the memory spanned by pfResult[0] ... pfResult[iSize-1]

(pointer dithering). Indeed, if it pfTab[iIndex]

is somewhere inside this memory, then its value should be overwritten by an assignment in the body of the loop, and the new value should be used in subsequent iterations. You should use a keyword restrict

to hint to the compiler that this will never happen, and then it should happily draw your code:

$ cat foo.c
int MyFunc(const float *restrict pfTab, float *restrict pfResult,
           int iSize, int iIndex)
{
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}
$ gcc -v
...
gcc version 4.6.1 (GCC)
$ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:3: note: LOOP VECTORIZED.
foo.c:1: note: vectorized 1 loops in function.

The second version is vectorized because the value is passed to a variable with an automatic storage duration. The general assumption here is that it pfResult

does not extend to the stack memory where it is stored fTab

(a cursory reading of the C99 language specification does not make it clear whether this assumption is weak or something in the standard allows).

The OpenMP version is not a vector version due to the way OpenMP is implemented in GCC. It uses code for parallel regions.

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

effectively becomes:

struct omp_data_s
{
  float *pfResult;
  int iSize;
  float *fTab;
};

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  struct omp_data_s omp_data_o;

  omp_data_o.pfResult = pfResult;
  omp_data_o.iSize = iSize;
  omp_data_o.fTab = fTab;

  GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
  MyFunc._omp_fn.0 (&omp_data_o);
  GOMP_parallel_end ();
  pfResult = omp_data_o.pfResult;
  iSize = omp_data_o.iSize;
  fTab = omp_data_o.fTab;
}

void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
{
  int start = ...; // compute starting iteration for current thread
  int end = ...; // compute ending iteration for current thread

  for (int i = start; i < end; i++)
    omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
}

MyFunc_omp_fn0

contains the jumbled code for the function. The compiler cannot prove that it omp_data_i->pfResult

does not point to memory, what is aliased, omp_data_i

and in particular its member fTab

.

To draw this cycle, you must do fTab

firstprivate

. This will turn it into an automatic variable in the highlighted code and will be equivalent to your second case:

$ cat foo.c
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
   float fTab = pfTab[iIndex];
   #pragma omp parallel for firstprivate(fTab)
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}
$ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: LOOP VECTORIZED.
foo.c:4: note: vectorized 1 loops in function.

Gc auto-vectorisation (raw data-ref)

More articles: