Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CPU branch prediction evolution over years (2017) (danluu.com)
210 points by Keyframe on Dec 31, 2022 | hide | past | favorite | 70 comments



>We also didn’t talk at all about how different workloads affect different branch predictors. Predictor performance varies not only based on table size but also based on which particular program is run.

One of my favorite little tradeoffs around this topic is around power usage.

This is because, in many cases, adding more hardware (generally) increases total power usage. I.e. adding more execution units makes the processor faster at the expense of increased power usage. Good branch predictors, however, are very notable exception to this rule because even though the increase instantaneous power usage over a small delta-t, over a much larger one they tend to save a LOT of power by avoiding costly mispredictions. This means that, somewhat counterintuitively, adding more hardware decreases total power, which is really cool!


Similar with a well designed cache. Lots of static power usage, but saves you so many hits (especially instruction caches)


You mean reducing energy usage. Energy is power integrated over time.


Related:

How and why CPUs do “branch prediction” (2017) - https://news.ycombinator.com/item?id=20324092 - July 2019 (33 comments)

A history of branch prediction - https://news.ycombinator.com/item?id=15078574 - Aug 2017 (65 comments)


Do you ever take a day off?


Ever considered that the related threads comment is easily scripted?


Perhaps, but I've never seen anything to reveal scripted moderator communications on HN. Unless Dang is an AGI, which increasingly seems not out of the question ^^.



> One way you might design a CPU is to have the CPU do all of the work for one instruction, then move on to the next instruction, do all of the work for the next instruction, and so on.

If a modern CPU did no branch prediction, how much slower would it be? Just curious to know a ballpark figure, like 5x slower or 100x slower?


One of the other comments around here[0] points to one truly classic StackOverflow question[1] where the answer is "branch prediction". In that example, the CPU is 6x slower at processing the data when the branch predictor is failing repeatedly. It is surely still succeeding some of the time, since that question wasn't specifically built to foil the branch predictor, so that isn't even the upper bound for performance loss.

If you designed a CPU to lack a branch predictor, you would certainly make different choices that would reduce the penalty of that, but there would still be a significant penalty.

Modern CPUs are all pipelined[2] because it allows a sequence of instructions to be executed (to some degree) in parallel. With out-of-order CPUs, the amount of parallelism that is possible for a sequence of instructions increases even more.

Without a branch predictor, you would have to stall the entire pipeline as soon as you see a branch instruction, until that instruction is finished. Obviously if you don't have a branch predictor, you're going to choose to have a shorter pipeline, but the penalty of each branch instruction will still be significant.

There's a lot of nuance to any proper answer to this question, and it's been years since I learned about this or thought deeply about this, so I'm probably not the right person to provide a deeper answer at this point. The performance impact would be significant.

[0]: https://news.ycombinator.com/item?id=34202230

[1]: https://stackoverflow.com/questions/11227809/why-is-processi...

[2]: https://en.wikipedia.org/wiki/Instruction_pipelining


Also, as I understand it, a big part of the "win" of branch prediction is to ensure those caches are filled.

If you can execute ahead, through branches, you can resolve cache misses while the pipeline catches up, effectively reducing the cache miss penalty. If instead you have to stop at a branch, you're always gonna pay the full cache miss penalty, which can be significant.


Of course, now you're opening the speculative execution leak by cache timing can of worms.

Which, again, may not be an issue given your constraints.


That’s mostly true, but the heat generated by calculations which where thrown away still impacts the CPU’s thermal budget and the devices battery life. This coupled with branch prediction overhead makes failed branch predictions far worse than simply stalling would be without them.

However, even very dumb branch prediction tends to be well worth it for the vast majority of code.


A useful (very rough) rule of thumb here is that every 5th instruction is a branch this limits all sorts of things - like branch cost and prediction algorithms and sizes, but also how many instructions it's worth decoding per clock which puts practical limits on system issue rates (a predicted taken branch typically reduces the issue rate)


If the branch predictor fails, does it have to take time undoing the steps it's already done? Or are pipelines long enough that the fail would be detected before registers/pointers are updated?


The results for partially-executed instructions aren't "committed" yet, so the pipeline is flushed (thrown away), and execution starts again with an empty pipeline at the corrected address. IIRC, each stage of a pipeline has its own registers. The registers your code knows about would not have been updated until the end of the pipeline, but that branch instruction is ahead of the speculative execution, so none of the speculative results would be stored before the mistake is identified.

So, there's no work required to "undo" those mistakes, but starting from an empty pipeline still means you're a dozen (or two) clock cycles from getting back to where you would have been with a successful branch prediction, which is why it is important for the CPU to predict correctly as often as it can.


Not undoing AFAIK, as it won't have committed the results before the branch resolves. However it will have to throw away all the work it did and start from the correct branch target. This is called a pipeline stall.

Particularly the Pentium 4 suffered due to a long pipeline hence long stall. The successor to the Pentium 4 was an evolved Pentium III core with a much improved branch predictor and larger cache[1] which helped it outperform the Pentium 4[2].

[1]: https://en.wikipedia.org/wiki/Pentium_M

[2]: https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)


Not much if you adjust the ISA for it!

https://arxiv.org/abs/2007.15919

This paper adds (more or less) a really big branch delay slot to fill the cpu with work while it's waiting for the branch to resolve.

Impossibility of Spectre-like attacks is a neat side-effect.


I've only glanced over that paper, but they're using a processor with a 5-stage pipeline, which is really short by modern standards (Zen 1 uses a 19-stage pipeline, and I couldn't quickly find the number for subsequent versions of Zen). Using a very short pipeline significantly reduces the advantage of control flow speculation (CFS)... but they still showed CFS offering up to a ~50% advantage over their best alternative, if I'm reading that right.

I wish they had included the geometric mean of their benchmarks, but I didn't see it anywhere, and I'm not going to run the numbers right now. Even if the speedup CFS offered on a 5-stage pipeline is "only" 25%, that is still huge... and on a larger pipeline, that delta would grow. "Not much" is drastically different from my interpretation of those results.

I do think security is extremely important, but I'm not convinced that things are currently so terrible that this is the only way forward, as the authors seemed to imply.

OTOH, I would enjoy seeing a return of an Itanium-style ISA that moves a lot of speculation from the hardware to the compiler. I think compilers are in a much better place now than they were when Itanium hit the scene, which did not help Itanium's problems.


In many ways we've seen that return to dumber hardware + smarter compilers in the GPGPU realm, although even there the hardware continues to get more capable over time.

Those applications tend to work, though, on the basis that either the compiler is generating fat binaries to support multiple architecture versions (e.g. Cuda), or some sort of IR, or compilation happens at runtime (e.g. OpenCL). It doesn't really work if you want to generate single binaries that will work performantly on a wide range of hardware versions - particularly important for users answering "how will application X work on future hardware Y", which really gets in the way of general-purpose use.

That's really the great advantage of putting more smarts in the hardware - you can evolve the processor design (often to improve performance) while executing the same binaries.


Thanks for the insights!

In my defense, 1.5X is "not much" when compared with 5 to 100X :-)


There’s no easy answer. Branch prediction is a key enabler for other kinds of speculation. Modern CPUs can have dozens of ops in a partially executed state. This lets them, e.g., simultaneously fetch many lines of memory that are going to be needed soon, under the assumption that a branch will/won’t be taken. If you can’t predict branches, you won’t be able to keep nearly so many ops in flight, and the whole apparatus for speculative execution starts to make a lot less sense. Everything is workload dependent, but I suspect you’re more in the 100x than 5x range after the knock-on effects.


Some back of the envelope math:

- there is a branch instruction every 10 instructions or so

- if you can't predict, assume you guess wrong half of the time

So every 20 instructions or so, you take the wrong branch. Sounds like it could be a significant overhead, from 10% - 50% depending on what the code inside the loop is doing.

For instance, if your code is doing only arithmetic operations, no load/store, such as computing factorial or other silly integer math, 20 instructions would probably take 20 cycles or less to execute, but the branch failure would cause a similar delay. So that's 50% overhead. In the other case, assume the 20 instructions are fairly slow (lots of memory accesses), then your branch penalty won't feel so bad.


Hopefully others can correct me here, but my intuition is that the lower bound of performance would be determined by the pipeline length. I.e. if your instruction pipeline was 8 stages, worst-case scenario is flushing the pipeline on every instruction, which would cause it to run at 1/8th speed. (At one point I believe the Pentium 4 had a twenty-stage pipeline, yikes!).

In practice, branches aren't every instruction, so the impact will be less.


> At one point I believe the Pentium 4 had a twenty-stage pipeline, yikes!

Modern pipeline lengths aren't documented publicly as often as they could be, but Zen 1 was a 19 stage pipeline, and it performed great for the time. But, I think Pentium 4's full pipeline was 31 stages. Regardless, I think CPU designers these days have better tools and techniques for dealing with long pipelines.


Wouldn't any kind of measurement have to rely on a heuristic? We would need a baseline as to how often a branch prediction is accurate. If the branch prediction is always wrong, we can assume the speed would be the same. If the branch prediction is always correct, then it should be a pretty big speed up. I would be interested to see a paper on about this topic, so if anyone has one, please share!


> If the branch prediction is always wrong, we can assume the speed would be the same.

Not sure, because in that case the wrongly-predicting CPU is doing lots of wasted work, which contributes to eg waste heat. A non-predicting CPU could do more real work, than a wrong-predicting one?


classic StackOverflow question “Why is processing a sorted array faster than processing an unsorted array?” https://stackoverflow.com/questions/11227809/why-is-processi...


Is the moral of CPU branch prediction that generating dynamically branch-free code is preferable to generating statically branch-free?


I would think it is less on the "don't have branches" but more "try to have a happy/common path." This is oddly at odds with how humans work as operators. For human operators, you typically want each interaction to be as unpredictable as possible, to keep their attention. Falling into rote action is a sure way to get a mistake out of a person. (Or to get them to quit.)

However, for computers, it is good to view them as a machine. As most machines, they are best if they are doing the exact same thing over and over. And over.

As such, if you can contrive for the machine to always take the same branch the vast majority of the time, you will get it to work faster. Whether this is done with a dynamic arrangement of the data, or a static arrangement of the code depends largely on the specifics. Consider sorting networks as a fun exploration of the idea. They can be ridiculously fast, as they always do the same amount of work. (Provided you have enough execution units.)


The more predictable a work process becomes, the quicker people can do it too.

Our issue is with correctness, that is something we not even think about in computers. So, it's not really surprising that this completely different issue has different constraints than speed.


> The more predictable a work process becomes, the quicker people can do it too.

The most highly-voted question on Stack Overflow[1] relates to branch predictability. There's an almost 10× speed-up when the branch predictor of the asker's CPU is fully engaged.

Modern branch predictors are extremely powerful, with TAGE and perceptrons achieving 99.5%+ prediction accuracy.

https://stackoverflow.com/questions/11227809/why-is-processi...


I suspect it is a curve?


This is related to an idea from data oriented design: Split up your data such that each piece of data needs exactly the same transform, and do all those at once.


The way Linus put it in an old newsgroup posting from his Transmeta days was something like: "the job of a cpu is to generate the minimum number of cache misses as early as possible."


That’s certainly true, a correctly predicted branch costs little more than a nop or unconditional jmp.

Code doesn’t have to be branch free, just branch in a predictable way.

Statically branch free code is great when you have things that are genuinely unpredictable, think branches driven by data. You might reach for min/max as an example of that, but in most cases these tend to be highly predictable. Control flow in a decompressor is probably a better example (eg. literal vs copy).


A correctly predicted branch, in a sense can cost less than a nop in that it does actual work and requires no additional effort from the backend.


Nops do not hit the backend. And conditional branches do eventually make their way to the backend—at some point, you have to actually check that the condition was satisfied. Refer to agner (https://www.agner.org/optimize/instruction_tables.pdf): on intel, conditional jumps can be filled by ports 0 and 6.


I mean, not really. All branches will need to hit the backend in order to verify that they were correctly predicted. RISC-V conditional branches, for example, require doing register register comparisons for verification which means they must be entered into the ROB and issued like any other instruction. This consumes real backend resources and is significantly more than a nop (which can be eliminated by the front end).


The "in a sense" was doing a lot of heavy lifting. They still hit the backend but the idea is that a nop also hits the backend but does no work.

Programmers, I think, sometimes still think of a very backend heavy idea of branch prediction (e.g. old static predictors)


By "hitting the backend" do you mean generating μ-ops? Because nops don't generate μ-ops yet branches definitely do.


Well, reducing the share number of branches in your code has exactly the same impact as improving the predictor's performance. So, no I don't think you can get this conclusion from the article.

The article is only biased into improving the CPU and ignoring any opportunity of improving your program. What is understandable, because it's about CPU architectures.

In practice, you will want to improve whatever part you have the freedom to change. You usually can not choose.


The branches still take (instruction) cache. So it is beneficial to reduce the number of branches, even if they could have been predicted accurately.


That said, I have seen people tricked into thinking branches are only if statements. If using polymorphism to reduce if statements, you'll still want to reduce the variations of the types sent through the code. Well, strictly you just want the variations to be predictable, I suppose?


I've been wondering about this recently. Modern branch predictors are really good at predicting correlated conditional branches. That is, if you have two if statements near each other that use the same condition, modern branch predictors have a very good chance of learning to predict the second one perfectly based on what the first one did.

Is the same true for indirect calls, i.e. virtual function calls? That could be quite the powerful optimization, but it's probably really hard to do.


My understanding is that this is, essentially, the optimization that you get with many common "entity component system" frameworks. That is, you typically try to keep homogeneous collections of entities so that when you are processing them, you are doing a similar processing in a loop.

Amusingly, as I try to google to make sure I am using the right terms, I find https://news.ycombinator.com/item?id=28200030 as a good result. Seems to cover the general idea and hints at other terms to look into. In particular, have fun with https://en.wikipedia.org/wiki/AoS_and_SoA. :D

Edit: I should add that I'm pretty sure my memory conjuring up ECS stuff was just mistaken, btw. https://en.wikipedia.org/wiki/Data-oriented_design is probably what I actually had in mind.


Things are getting way too high-level here. Nobody will be able to answer your question because it depends on an overwhelming amount of details.

Your compiler capabilities will have a much larger influence on your code performance than the lower level properties of your GPU. As always, it's a matter of profiling and discovering what is actually happening. Most times when people use polymorphism, it's free.


Agreed it is tough to answer. Highly polymorphic code is often not that diverse, for one. And many languages that encourage polymorphism are run with a jit. Which can be seen as a higher level branch predictor. And, predictably, benefits code paths that don't have a lot of diversity in the types being sent through code.


Avoiding either > Easily predicted branch > conditional move > difficult to predict branch


> Avoiding either > Easily predicted branch

Branchless code can easily out-cost a well-predicted branch (and its content if any).


This is where I was trying to use sorting networks as a fun example. Without independent compare/swap units, they will go slower than sequential sorting. Despite not necessarily "branching."


Of course, perhaps this should be reduced to "keep an eye out for silly branches"


Wow, I really enjoyed reading this. It really made me miss the old CPU Praxis articles that made me fall in love with the old ARS-Technica.

I’d be curious what branch prediction algorithms more recent CPY architectures have employed. Arm in general, Apples various Arm chips, etc.


I could have sworn that PPro used 2-bit saturating counters, but I checked and: "The P6 rejects the commonly used Smith algorithm, which maintains four states using two bits, in favor of the more recent Yeh method[1]. This adaptive algorithm uses four bits of branch history and can recognize and predict repeatable sequences of branches, for example, taken–taken–not taken."

One other thing I wondered about is if there have been improvements to the recovery of mispredicted branches that have not yet been retired. It seems like the huge OOO-windows of the latest Apple and Intel CPUs would see diminishing returns otherwise. Anyone knows?


Misprediction recovery is an important area of optimization. Mispredictions tend to flush and begin refetching ASAP, so rather than wait until retire, they will do that when they execute, including out of order (a newer branch could flush before an older one is resolved, and later itself could be flushed by that older one).

There are also several levels of predictor and mistakes in the low latency but less accurate ones can cause recovery events if they are overridden a few cycles later by slower predictors.

Flush / recovery is made as light weight as possible, only flushing instructions after the branch, and only the parts of the pipeline that the oldest one has reached.

There is some literature on controlling speculation if you are not confident. Say if you have a really poorly predicted indirect branch target where a mispredict is not 50/50 but 99/1, and it depends on a very slow source, then you might not want to speculate 500 instructions ahead of that but say okay let's just wait for it to resolve. Or, if you have a 50/50 hard to predict branch, you might kick off some prefetching down the other path when you see it so recovery is faster (but this will cost more power). I don't know if anybody does anything with these "hard to predict branches" today. Apparently it's actually quite difficult to predict whether a branch is hard to predict, and the resources to make that prediction can be put to better use just trying to predict them instead of predicting that you can't predict them :) Although that might change. Point is al kinds of misprediction optimization are actively being pursued.


Yeah, these days it's likely for only the frontend to stall during misprediction recovery. Already decoded µops from before the branch still make their way through the backend as normal. If you have enough of those (long dependency chains or the branch was reordered early enough), it's possible that the misprediction has effectively no penalty.

As a corollary, this means that when possible you want the dependency chain for branches to be as independent and short as possible, so the CPU can reorder and resolve branches as early as possible.


The original pentium famously used a 2 bit counter except, IIRC, it was buggy and it didn't actually saturate. It was fixed on the Pentium MMX.

A simple 2 bit predictor wouldn't have been enough for the PPro as the deeper pipeline and the new OoO window required a better predictor.


Easy to follow up until "two-level adaptive, global", then it became harder to read. But I'm still going. Note that there is a typo here: "P(A not taken) + P(B taken)". Suppose to be: "P(A not taken) * P(B taken)".


Is branch prediction even a good idea? I remember it blowing my mind when I learned about the existence of this stuff from Spectre. Should CPUs even try to be this "clever"? It seems like something better left to the kernel.


Branch prediction is pretty fundamental to CPU performance; unless we want much slower CPUs it’s necessary.

Although it’s interesting to think about how software could help with branch prediction, the simple answer about the kernel thing is that software is the wrong place in the architecture for branch prediction to happen. Maybe there’s some crazy alternative way things could be done, but in anything like the current computing model branch prediction has to happen in hardware.


> Maybe there’s some crazy alternative way things could be done

"Branchless programming" is an umbrella term for techniques in existing programming languages to avoid branches.


It almost always doesn’t work; if a branch can eliminate a single uncached memory read you’d want to keep that branch.

It’s mostly used for things like min/max operations in cases when you don’t have cmov. Or bitwise SIMD programming.


    CPU Single Thread Rating

    Name                                pts     Date
    Intel Core2 Duo E7600 @ 3.06GHz   1,270     Q2 2009
    Intel Core Solo T1350 @ 1.86GHz     413     Q1 2009
    Intel Core Solo U1300 @ 1.06GHz     340     Q3 2009
    Intel Atom D425 @ 1.80GHz           320     Q1 2011
These Atoms were quite gutted to have a low die size and low TDP, and the first things what were cut are cache and branch prediction. You can see how D425 from 2011 is on the same level as two year earlier and a almost a half GHz slower U1300.

> In the Bonnell microarchitecture, internal micro-ops can contain both a memory load and a memory store in connection with an ALU operation, thus being more similar to the x86 level and more powerful than the micro-ops used in previous designs.[3] This enables relatively good performance with only two integer ALUs, and without any instruction reordering, speculative execution or register renaming. A side effect of having no speculative execution is invulnerability against Meltdown and Spectre.

https://www.cpubenchmark.net/compare/948vs579vs593vs609/Inte...

https://en.wikipedia.org/wiki/Bonnell_(microarchitecture)


Bonnell has a branch predictor: https://en.wikichip.org/wiki/intel/microarchitectures/bonnel...

In-order execution does not mean that there is no branch prediction, it just means that the number of instructions speculated is much smaller.


Yes, it would be quite strange not to add it at all, especially since it was available since Pentium days.

No, it is not comparable with a 'proper' CPUs because it was still gutted hard.

IMMSMV all that stuff consumes a lot of die space, so it makes sense to throw it out if the performance is enough.


Could the compiler put PGO (profile guided optimization) results into a table embedded into the executable binary and feed that to the CPU?


Some architectures such as Itanium, MIPS, and x86 have branch hint instructions that allow the compiler to indicate whether branches are taken. I'm not sure about Itanium, but they weren't really beneficial on MIPS or x86; the MIPS branch hint instructions are deprecated and Intel's compiler stopped emitting branch hints at least a decade ago.


The CPU can’t necessarily use any per-branch hints, including special “unlikely” branch instructions or implying one direction is less likely than the other, because not all designs have that info available at the predictor. It especially usually doesn’t know the address of the branch.


Yes, the “table” is encoded into the code itself ;)


I noticed that every scheme after BTFNT kept state. Did anyone ever come up with a better stateless branch predictor than it?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: