Why predict a branch and not just execute both in parallel?

I believe that when building processors, branch prediction is the main slowdown when choosing the wrong branch. So why would cpu designers choose a branch instead of just executing both branches and then cut once once you know exactly which one was picked?

My understanding is that this could only go down 2 or 3 branches within a short number of instructions, or the number of parallel steps would get ridiculously large, so at some point you will still need branch prediction, as you will definitely run into larger branches. but couldn't a couple of these situations make sense? It seems to me that this will greatly speed up the process and cost a little tricky complexity.

Even one branch deep would almost half eat the wrong branches, right?

Or maybe it's already somewhat done like this? Affiliates usually only choose two options when you go to build, right?

+3


source to share


1 answer


You are right for fear of exponentially filling the car, but you underestimate its power. The general rule of thumb is that you can expect ~ 20% on average in your dynamic code. This means one branch for every 5 instructions. Most processors today have a deep outer core that fetches and executes hundreds of instructions ahead - take Intell's Haswell, for example, it has 192 ROB entries, which means you can only hold 4 levels of branches at most (at this point you will have 16 "edges "and 31" blocks "including one bifurcation branch, where each block will have 5 instructions that you almost filled your ROB with and another level will exceed it). At this point, you would only advance to an effective depth of ~ 20 instructions,making any level of parallelism level useless.

If you want to diverge at three levels of branches, that means you are going to have 8 concurrent contexts, each of which will only have 24 entries to run forward. And even when you ignore the overhead of rolling back 7/8 of your work, you need to duplicate all the economical HWs (like the registers you have dozens of), and the need to split other resources into 8 parts, as you did with ROB. Also, it is not a memory management accounting that would have to manage complex versioning, forwarding, connectivity, etc.

Forget about energy consumption, even if you can support this wasteful parallelism, spreading your resources that thin would literally choke you before you can push more than a few instructions along each path.



Now let's look at a smarter option for splitting one branch at a time - it starts to look like Hyperthreading - you split / split your main resources in two contexts. This feature has some of the performance benefits it provides, but only because both contexts are not speculative. Be that as it may, I guess the overall score is around 10-30% when using two contexts one after the other at the same time, depending on the combination of workload (numbers from AnandTech review here ) - this is nice if you really intended to run both tasks one by one, but not when you are about to throw away the results of one of them. Even if you ignore the mode switch here, you gain 30% to lose 50% - there is no point in that.

On the other hand, you have the ability to predict industries (modern predictors today can achieve more than 95% success on average) and pay a penalty for a misprediction that is partially hidden already out of order by the engine (some instructions preceding a branch can be executed after it is cleaned, most LLC machines support this). This leaves any deep off-road engine free to roam forward, brooding to its full potential depth and being right most of the time. The chances of blurring some of the work here are geometrically reduced (95% after the first branch, ~ 90% after the second, etc.), but the penalty for the flush is also reduced. It is still much better than the 1 / n global efficiency (for n bifurcation levels).

+6


source







All Articles