Summing over one dimension of a 3D array using shared memory

I need to perform the calculation as: A [x] [y] = sum {from z = 0 to z = n} {B [x] [y] [z] + C [x] [y] [z]}, where matrix A has dimensions [height] [width] and matrix B, C has dimensions [height] [width] [n].

The values ​​are mapped to memory with something like:

index = 0;
for (z = 0; z<n; ++z)
    for(y = 0; y<width; ++y)
        for(x = 0; x<height; ++x) {
            matrix[index] = value;


I would like each block to calculate one sum, since each block has its own shared memory. To avoid data races, I use atomicAdd, something like this:

A piece of code in global memory:

dim3 block (n, 1, 1);
dim grid (height, width, 1);



atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]), 
           B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] 
           + C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );


I would like to use shared memory to calculate the sum and then copy that result to global memory.

I'm not sure how to make the shared memory part. Only one number (sum result) will be stored in the shared memory of each block. How do I copy this number to the desired location in matrix A in global memory?


source to share

1 answer

You probably don't need access to shared memory or atomic memory in order to perform the summation you are asking about. If I understood this correctly, your data is in top column order, so the logical operation is to have one thread per matrix entry in the output matrix and each thread traverses the z-axis of the input matrices, summing them as they arrive. The kernel for this might look something like this:

__global__ void kernel(float *A, const float *B, const float *C, 
        const int width, const int height, const int n)
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int tidy = threadIdx.y + blockDim.y * blockIdx.y;

    if ( (tidx < height) && (tidy < width) ) {
        int stride = width * height;
        int ipos = tidx + tidy * height;

        float * oval = A + ipos;
        float sum = 0.f;
        for(int z=0; z<n; z++, ipos+=stride) {
            sum += B[ipos] + C[ipos];
        *oval = sum;


This approach should be optimal for column c data width * height >= n

. There are no performance benefits to using shared memory, and there is no need to use atomic memory operations. If you're having a problem where width * height << n

it might make sense to try block-parallel reduction for summation. But you didn't indicate what the typical dimensions of the problem are. Leave a comment if your problem is more like the last one and I can add a fetch kernel based on the answer.



All Articles