Calculating Maximum Parallel Work Groups
I was wondering if there is a standard way to programmatically determine the number of maximum parallel workgroups that can run on a GPU.
For example, on an NVIDIA card with 5 compute units (or SM), there can be no more than 8 workgroups (or blocks) per compute unit, so the maximum number of workgroups that can be run simultaneously is 40.
Since I can find the number of compute units with clGetDeviceInfo
, I only need the maximum number of workgroups that can be run on a compute unit.
Thank!
source to share
The maximum number of groups per run / SM is limited by hardware resources. Let me take an example of Intel Gen8 GPU. It contains 16 barrier registers per chunk. Thus, no more than 16 working groups can work simultaneously.
In addition, the amount of local memory available for each sub-section (64 KB). If, for example, a workgroup requires 32KB of shared local memory, only two of those workgroups can run concurrently, regardless of the size of the workgroup.
source to share
I usually use the number of compute units as the number of workgroups. I like to increase the size of the groups to saturate the hardware rather than forcing the gpu to schedule multiple workgroups to run concurrently.
I don't know how to determine the maximum number of groups without looking at the vendor specifications.
source to share