What is Static vs Dynamic Scheduling on GPUs?

GTX 4xx, 5xx (Fermi) had dynamic scheduling and GTX 6xx (Kepler) switched to static scheduling.

  • What are static and dynamic scheduling in the context of GPUs?
  • How do static versus dynamic design choices affect the performance of real-world computing resources?
  • Is there anything that can be done in the code to optimize the static or dynamic scheduling algorithm?
+3


source to share


1 answer


I am assuming you are referencing a static / dynamic command schedule on hardware.

Dynamic instruction scheduling means that the processor can reorder individual instructions at runtime. This is usually due to some hardware that will try to predict the best order for what is in the instruction pipeline. On the GPUs you mentioned, this refers to reordering instructions for each individual warp.

The reason for the transition from dynamic scheduler to static scheduler is described in the GK110 architecture document as follows:



We also looked for opportunities to optimize the power in the SMX warp of the scheduler logic. For example, the Kepler and Fermi schedulers contain similar hardware blocks to handle the scheduling function, including:

  • Register scoreboard for long latent operations (textures and load)

  • Scheduling decisions between warps (e.g. choose the best method to move next among eligible candidates)

  • Thread level scheduling (e.g. GigaThread engine)

However, the Fermis scheduler also contains a sophisticated hardware stage to prevent data hazards in the math datapat itself. The multi-port log scoreboard keeps track of any registers that are not yet registered ready for valid data, and a dependency checker analyzes the registers across multiple fully decoded warp instructions against the scoreboard to determine which ones can be issued.

For Kepler, we have recognized that this information is deterministic (the decays of the mathematical pipeline are not variable), and therefore it is possible for the compiler to determine the front when the instructions are ready for release and provide this information in the instructions itself. This allowed us to replace several complex and expensive blocks with a simple hardware block that extracts predefined ones and uses it to mask the skews from the right to the planning stage between warps.

Thus, they are mainly engaged in chip trading, that is, a simpler scheduler to improve efficiency. But this potentially lost efficiency has now been taken up by the compiler, which can predict the best ordering, at least for the math pipeline.

Regarding your last question, that is, what can be done in code to optimize a static or dynamic scheduling algorithm, my personal recommendation would be not to use any inline assembler and just let the compiler / scheduler do its job.

+1


source







All Articles