# Cumulative sum in two dimensions over an array in a nested loop - a CUDA implementation?

I was thinking about how to accomplish this operation in CUDA using abbreviations, but I am a bit confused about how to do it. The C code is below. The important part to keep in mind is that the precalculatedValue variable depends on the **loop iterators** . Also, the ngo variable is not unique for every value of m ... eg. m = 0,1,2 can have ngo = 1, while m = 4,5,6,7,8 can have ngo = 2, etc. I've included the sizes of the loop iterators in case that helps provide better implementation suggestions.

```
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int Nobs = 60480;
int NgS = 1859;
int NgO = 900;
// ngo goes from [1,900]
// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
for (n=0; n<NgS; n++) {
Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
}
}
```

In the previous case, when precalculatedValue was only a function of an internal loop variable, I stored the values ββin unique array indices and added them with Thrust after the fact. However, this case puzzled me: m values ββare not uniquely mapped to ngo values. Thus, I see no way to make this code efficient (or even workable) for using the shortcut. Any ideas are appreciated.

source to share

Just a blow ...

I suspect transposing your loops might help.

```
for (n=0; n<NgS; n++) {
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
}
}
```

The reason I did this is because `idx`

it changes faster with `ngo`

(function `m`

) than with `n`

, so doing the inner loop improves consistency. Note. I also made precalculatedValue function (m, n) because you said it - it makes the pseudocode clearer.

Then you can start by leaving the outer loop on the host and making the core for the inner loop (64,480-way parallelism is enough to fill most modern GPUs).

In the inner loop, just start by using atomAdd () to handle collisions. If they are infrequent, the performance on Fermi GPUs shouldn't be too bad. In any case, you will have bandwidth, since the arithmetic intensity of this calculation is low. So when it works, measure the bandwidth you are reaching and compare to the peak for your GPU. If you go, think about further optimization (perhaps parallelizing the outer loop - one iteration per thread block and doing the inner loop using some optimization on shared memory and thread sharing).

Key: Start simple, measure performance, then decide how to optimize.

Note that this calculation is very similar to the histogram calculation, which has similar issues, so you might want to google for GPU histograms to see how they were implemented.

One idea is to sort ( `rInd[m]`

, `m`

) pairs using `thrust::sort_by_key()`

and then (since the duplicates `rInd`

will be grouped together) you can iterate over them and do your contractions without collisions. (This is one way to make histograms.) You can even do it with `thrust::reduce_by_key()`

.

source to share