Why does CUDA GPU only need 8 active skews?
As stated in this work :
If the instruction flow generated by the CUDA compiler expresses ILP 3.0 (that is, the average of three instructions can be executed before danger) and the instruction pipeline is 22 stages deep, a total of eight active skews (22/3) may be sufficient to completely hide latency of commands and achievement of maximum arithmetic throughput.
I don't understand why this is enough?
source to share
If the scheduler can successfully issue an instruction from the same base on each instruction cycle for 22 consecutive cycles, then the scheduler has no reason to schedule another warp in place, and that warp alone is enough to fill the pipeline. This would correspond to an ILP of at least 22.
But Real-World Code ™ never demonstrates such a high level of ILP: some instructions, for example, depend on the result of previous ones or memory requests. When the scheduler can no longer execute independent instructions, the execution of this warp stops. The scheduler will choose another warp that is ready to execute and execute as many instructions as possible until this warp also stalls, etc.
So if warp # 1 successfully executes 3 commands, then kiosks, the scheduler picks warp # 2, executes 3 commands ... etc. When the scheduler jumps to warp # 8, there are already 21 instructions in the pipeline for 7 braked warps. Then the execution of one instruction from this warp will be enough to completely fill the pipeline. By the time the pipeline starts to flow, deformation # 1 is ready again, therefore, in order to fill the 22-stage conveyor, 8 distortions with ILP 3 are required.
source to share