For nested loops with CUDA

I have a problem with some for nested loops that I need to convert from C / C ++ to CUDA. Basically I have 4 for nested loops that use the same array and do bit shift operations.

``````#define N 65536

// ----------------------------------------------------------------------------------

int a1,a2,a3,a4, i1,i2,i3,i4;

int Bit4CBitmapLookUp[16] = {0, 1, 3, 3, 7, 7, 7, 7, 15, 15, 15, 15, 15, 15, 15, 15};

int _cBitmapLookupTable[N];

int s = 0;  // index into the cBitmapLookupTable

for (i1 = 0; i1 < 16; i1++)
{
// first customer
a1 = Bit4CBitmapLookUp[i1] << 12;

for (i2 = 0; i2 < 16; i2++)
{
// second customer
a2 = Bit4CBitmapLookUp[i2] << 8;

for (i3 = 0; i3 < 16; i3++)
{
// third customer
a3 = Bit4CBitmapLookUp[i3] << 4;

for (i4 = 0;i4 < 16;i4++)
{
// fourth customer
a4 = Bit4CBitmapLookUp[i4];

// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;

s++;

} // for i4
} // for i3
} // for i2
} // for i1
```

```

This is the code I have to convert to CUDA. I tried different ways, but every time I got the wrong output. Here I am posting my version of the CUDA transform (part from part of the kernel)

``````#define N 16

//----------------------------------------------------------------------------------

// index for the GPU
int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3 = i1;
int i4 = i2;

for(i1 = i2 = 0; i1 < N, i2 < N; i1++, i2++)
{
// first customer
a1 = Bit4CBitmapLookUp_device[i1] << 12;

// second customer
a2 = Bit4CBitmapLookUp_device[i2] << 8;

for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
// third customer
a3 = Bit4CBitmapLookUp_device[i3] << 4;

// fourth customer
a4 = Bit4CBitmapLookUp_device[i4];

// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
s++;
}
}
```

```

I am new to CUDA and I am still involved, but in fact I cannot find a solution for those with nested loops. Thank you in advance.

+3

source to share

As noted, there is an initialization problem. I would recommend that you rewrite your program like this

``````int i1 = blockDim.x * blockIdx.x + threadIdx.x;
int i2 = blockDim.y * blockIdx.y + threadIdx.y;
int i3;
int i4;

while(i1 < N && i2 < N){
a1 = ..;
a2 = ..;
for(i3 = i4 = 0; i3 < N, i4 < N; i3++, i4++){
// third customer
a3 = Bit4CBitmapLookUp_device[i3] << 4;

// fourth customer
a4 = Bit4CBitmapLookUp_device[i4];

// now actually set the sBitmapLookupTable value
_cBitmapLookupTable[s] = a1 | a2 | a3 | a4;
s ++;
}
s += blockDim.x*gridDim.x*blockDim.y*gridDim.y;
i1 += blockDim.x*gridDim.x;
i2 += blockDim.y*gridDim.y;
}
```

```

I have not tested it, so I cannot guarantee that the indexes are correct. I'll leave that to you.

A bit more explanation. In the above code, only loops over i1 and i2 are parallelized. This assumes N ** 2 is large enough compared to the number of cores you have on your GPU. If it isn't. All four loops need to be parallelized to get an efficient program. Then the approach will be slightly different.

+2

source

All Articles