# Theoretical GPU memory bandwidth

As part of profiling algorithms running on the GPU, I feel like I'm hitting the memory bandwidth.

I have a few complex kernels doing complex operations (sparse matrix multiplication, reduction, etc.) and some very simple ones, and it seems that all (significant ones) fall into ~ 79GB / s bandwidth when calculating the total read / recorded data for each of them, regardless of their complexity, while the theoretical GPU bandwidth is 112 GB / s (nVidia GTX 960)

The dataset is very large, working on vectors of ~ 10,000,000 floating point records, so I get good measurements / statistics from `clGetEventProfilingInfo`

between `COMMAND_START`

and `COMMAND_END`

. All data remains in GPU memory during algorithm execution, so there is virtually no host / device memory transfer (nor is it measured by profiling counters)

Even for a very simple kernel (see below) that solves `x=x+alpha*b`

where x and b are huge vectors of ~ 10,000,000 records, I am not getting close to the theoretical bandwidth (112 GB / s), but rather ~ 70% of the maximum ( ~ 79 GB / s)

```
__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
int gid = get_global_id(0);
if(gid < N)
x[gid]+=b[gid]*factor;
}
```

I am calculating the data transfer for this particular kernel in one run as N * (2 + 1) * 4:

- N - vector size = ~ 10,000,000
- 2 downloads and 1 storage per vector recording
- 4 for sizeof float

I was expecting that for such a simple kernel I need to get close to the bandwidth limits, what am I missing?

PS: I am getting similar numbers from CUDA implementation of the same algorithm

source to share

I think a more realistic way of assessing whether you've hit your maximum throughput is to compare what you get with a simple copy of D2D.

For example, your kernel reads x and b once and writes x once, so the upper runtime limit should be 1.5 times the copy time from b to x once. If you find that the time is well over 1.5x, that means you probably have room for improvement. In this kernel, the work is so simple that the overhead (starting and ending a function, calculating an index, etc.) can limit performance. If this is a problem, you can find an increase in work per thread with a grid spacing.

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

In terms of theoretical throughput, at least you should consider ECC overhead if enabled.

source to share