Increase GPU Usage When Using Mathematica CUDADot?

I recently started using Mathematica CUDALink with GT430 and am using CUDADot to multiply 150000x1038 (encs) matrix by 1038x1 (probe) matrix. Both sensors and probes are registered by the memory manager:

mmEncs = CUDAMemoryLoad[encs];
mmProbe = CUDAMemoryLoad[probe];

      

I figured out that the point product of these would be the maximum for the GT430, so I tested the following:

For[i = 0, i < 10, i++,
 CUDADot[mmEncs, mmProbe];
]

      

While it is running, I use MSI's "Afterburner" utility to monitor GPU usage. The following screenshot shows the result:

enter image description here

There is a clear peak for every CUDADot action, and overall I would say this photo indicates that I am using less than 1/4 of the GPU capacity. Two questions:

Q1: Why are the maximum highs 50%? Seems low.

Q2: Why are there such significant periods of inactivity between peaks?

Thanks in advance for any hints! I don't have a wrt prompt for Q1, but perhaps Q2 is due to an unintended memory transfer between the host and the device?

Additional information since posting: CUDAInformation [] reports "Core Count → 64", but the NVIDIA Control Panel reports "CUDA Cores: 96". Is there a chance that CUDALink will not be enough to use the GT430 if it runs on the false assumption that it has 64 cores?

+3


source to share


1 answer


I preface this answer by noting that I have no idea what "MSI Afterburner" actually measures, or at what frequency it is sampling the amount it measures, and I don't think you do either. This means that we do not know which elements of the x or y axis are in your screenshot. This makes any quantification of performance nearly impossible.

1. Why are the maximum maximums 50%? Seems low.

I don't believe you can say that it “seems low” if you don't know what it really measures. If, for example, it is measuring the bandwidth of commands, it might be that the Mathematica dot core is limited by the bandwidth on your device. This means that the memory bandwidth will be the bottleneck for the code bandwidth, not the SM instruction bandwidth. If you plan on using your memory bandwidth, you will see 100%. I would expect the gemv operation to be memory bandwidth limited, so this result is probably not too surprising.



2. Why are there such significant periods of inactivity between peaks?

The CUDA API has device and host latency. On the WDDM platform (so Windows Vist, 7, 8 and any server versions are derived from them), this host-side latency is quite high, and the CUDA driver does batch processing to help amortize this latency. This modification can lead to "gaps" or "pauses" in the GPU. I think this is what you see here. To overcome these limitations, NVIDIA has a dedicated Compute Compute Driver (TCC) for Telsa boards on the Windows platform.

The best way to gauge the effectiveness of this operation would be to cycle time yourself, calculate the average call time, calculate the number of operations (the dot product has a known lower bound that you can work out from the matrix and vector dimensions), and calculate the FLOP / s value. You can compare this to the specs of your GPU to see how well or poorly it performs.

+1


source







All Articles