For nested loops with CUDA

I have a problem with some for nested loops that I need to convert from C / C ++ to CUDA. Basically I have 4 for nested loops that use the same array and do bit shift operations.

#define N 65536

// ----------------------------------------------------------------------------------

int a1,a2,a3,a4, i1,i2,i3,i4;

int Bit4CBitmapLookUp[16] = {0, 1, 3, 3, 7, 7, 7, 7, 15, 15, 15, 15, 15, 15, 15, 15};

int _cBitmapLookupTable[N];

int s = 0;  // index into the cBitmapLookupTable

for (i1 = 0; i1 < 16; i1++)
{
    // first customer
    a1 = Bit4CBitmapLookUp[i1] << 12;

    for (i2 = 0; i2 < 16; i2++)
    {
        // second customer
        a2 = Bit4CBitmapLookUp[i2] << 8;

        for (i3 = 0; i3 < 16; i3++)
        {
            // third customer
            a3 = Bit4CBitmapLookUp[i3] << 4;

            for (i4 = 0;i4 < 16;i4++)
            {
                // fourth customer
                a4 = Bit4CBitmapLookUp[i4];

                // now actually set the sBitmapLookupTable value
                _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;

                s++;

            } // for i4
        } // for i3
    } // for i2
} // for i1

      

This is the code I have to convert to CUDA. I tried different ways, but every time I got the wrong output. Here I am posting my version of the CUDA transform (part from part of the kernel)

#define N 16

//----------------------------------------------------------------------------------

// index for the GPU
int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3 = i1;
int i4 = i2;

__syncthreads();
for(i1 = i2 = 0; i1 < N, i2 < N; i1++, i2++)
{
    // first customer
    a1 = Bit4CBitmapLookUp_device[i1] << 12;

    // second customer
    a2 = Bit4CBitmapLookUp_device[i2] << 8;

    for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
        // third customer
        a3 = Bit4CBitmapLookUp_device[i3] << 4;

        // fourth customer
        a4 = Bit4CBitmapLookUp_device[i4];

        // now actually set the sBitmapLookupTable value
        _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
        s++;
    }
} 

      

I am new to CUDA and I am still involved, but in fact I cannot find a solution for those with nested loops. Thank you in advance.

+3


source to share


1 answer


As noted, there is an initialization problem. I would recommend that you rewrite your program like this

int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3;
int i4;

while(i1 < N && i2 < N){
  a1 = ..;
  a2 = ..;
  for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
    // third customer
    a3 = Bit4CBitmapLookUp_device[i3] << 4;

    // fourth customer
    a4 = Bit4CBitmapLookUp_device[i4];

    // now actually set the sBitmapLookupTable value
    _cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
    s ++;
  }
  s += blockDim.x*gridDim.x*blockDim.y*gridDim.y;
  i1 += blockDim.x*gridDim.x;
  i2 += blockDim.y*gridDim.y;
}

      



I have not tested it, so I cannot guarantee that the indexes are correct. I'll leave that to you.

A bit more explanation. In the above code, only loops over i1 and i2 are parallelized. This assumes N ** 2 is large enough compared to the number of cores you have on your GPU. If it isn't. All four loops need to be parallelized to get an efficient program. Then the approach will be slightly different.

+2


source







All Articles