# How close to theoretical GPU memory bandwidth can you get?

Suppose you have a memory-bound GPU core, how close can you get to the stated theoretical GPU bandwidth? Even in Mark Harris Optimizing Parallel Reduction, he only gets 63 GB / sec, which is about 73% of the bandwidth of his test GPU (G80), which he says had a peak bandwidth of 84.6 GB / sec. Can Harris further optimize his kernel? are there other methods that could possibly be extended / not available for presentation? for example, instructions like __shfl? Why hasn't it achieved higher throughput?

This article claims that using a Tesla C2050 test car

"Bandwidth is limited by bandwidth, maintaining about 75% 144GB / s peak memory bandwidth, compared to a practical 85% peak when overheads such as DRAM upgrades are taken into account."

It is right? The authors do not provide a source for the "85% Practical Bandwidth Limit" and I could not find anything that mentions it. If so, what other factors (assuming you have a very well optimized kernel) would prevent you from reaching your theoretical peak throughput?

source to share

Related Topic: GPU Memory Graphics Bandwidth Theoretical and Practical

Running a minimal kernel that only writes data to a 1D large vector:

```
__global__ void kernel( int *out ) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
out[idx] = idx%4;
}
```

on a GeForce GT 710 I got 0.9 theoretical bandwidth

practical 12.9 GB / s.

theoretical (spec) 14.4 GB / s

One thing that can contribute to the slowdown is caching.

source to share