Except for GPU manufacturers, who figured out the right way to do it 20 years ag...

dzaima · 2025-07-05T16:12:24 1751731944

That's because the things GPUs do just isn't what CPUs do. GPUs don't have to deal with ad-hoc 10..100-char strings. They don't have to deal with 20-iteration loops with necessarily-serial dependencies between invocations of the small loops. They don't have to deal with parallelizing mutable hashmap probing operations.

Indeed what GPUs do is good for what GPUs do. But we have a tool for doing things that GPUs do well - it's called, uhh, what's it, uhh... oh yeah, GPUs. Copying that into CPUs is somewhere between just completely unnecessary, and directly harmful to things that CPUs are actually supposed to be used for.

The GPU approach has pretty big downsides for anything other than the most embarassingly-parallel code on very massive inputs; namely, anything non-trivial (sorting, prefix sum) will typically require log(n) iterations, and somewhere between twice as much, and O(n*log(n)) memory access (and even summing requires stupid things like using memory for an accumulator instead of being able to just use vector registers), compared to the CPU SIMD approach of doing a single pass with some shuffles. GPUs handle this via trading off memory latency for more bandwidth, but any CPU that did that would immediately go right in the thrash because that'd utterly kill scalar code performance.

smallmancontrov · 2025-07-05T16:23:27 1751732607

This reads like you haven't tried CUDA. The whole point of CUDA is that your CUDA code has single-threaded semantics. The problem you assumed it has is the problem it doesn't have, and the fact that it doesn't have this problem is the reason why it works and the reason why it wins.

EDIT: ok, now you're talking about the hardware difference between CPUs and GPUs. This is relevant for the types of programs that each can accelerate -- barrel processors are uniquely suited to embarrassingly parallel problems, obviously -- but it is not relevant for the question of "how to write code that is generic across block size, boundary conditions, and thread divergence." CUDA figured this out, non-embarassingly-parallel programs still have this problem, and they should copy what works. The best time to copy what works was 20 years ago, but the second best time is today.

dzaima · 2025-07-05T16:26:48 1751732808

It has single-threaded semantics per element. Which is fine for anything that does completely independent computation for each element, but is quite annoying for everything else, requiring major algorithmic changes. And CPU SIMD is used for a lot of such things.

smallmancontrov · 2025-07-05T16:33:57 1751733237

"Completely independent" except for anything that can be expressed using branches, queues, and locks. Which is everything. Again, are you sure you've tried CUDA? Past, like, the first tutorial?

dzaima · 2025-07-05T17:00:57 1751734857

I haven't used CUDA (don't have nvidia gpu), but I've looked at examples of code before. And it doesn't look any more simple than CPU SIMD for anything non-trivial.

And pleasepleaseplease don't have locks in something operating over a 20-element array, I'm pretty damn sure that's just simply gonna be a suboptimal approach in any scenario. (even if those are just "software" locks for forcing serialized computation that don't actually end up in any memory atomics or otherwise more instructions, as just needing such is hinting to me of awful approaches like log(n) loops over a n=20-element array, or some in-memory accumulators, or something awful)

As an extreme case of something I've had to do in CPU SIMD that I don't think would be sane in any other way:

How would I in CUDA implement code that does elementwise 32-bit integer addition of two input arrays into a third array (which may be one of the inputs), checking for overflow, and, in the case of any addition overflowing (ideally early-exiting on such to not do useless work), report in some way how much was processed, such that further code could do the addition with a wider result type, being able to compute the full final wider result array even in the case where some of the original inputs aren't available due to the input overlapping the output (which is fine as for those the computed results can be used directly)?

This is a pretty trivial CPU SIMD loop consisting of maybe a dozen intrinsics (even easily doable via any of the generalized arch-independent SIMD libraries!), but I'm pretty sure it'd require a ton of synchronization in anything CUDA-like, and probably being forced to do early-exiting in way larger blocks, and probably having to return a bitmask of which threads wrote their results, as opposed to the SIMD loop having a trivial guarantee of the processed and unprocessed elements being split exactly on where the loop stopped.

(for addition specifically you can also undo the addition to recover the input array, but that gets way worse for multiplication as the inverse there is division; and perhaps in the CUDA approach you might also want to split into separately checking for overflow and writing the results, but that's an inefficient two passes over memory just to split out a store of something the first part already computes)

dzaima · 2025-07-06T09:13:51 1751793231

re edit - While the hardware differences are significant, some by necessity, some by tradition, it's not my point.

My specific point is that "how to write code that is generic across block size, boundary conditions, and thread divergence." is just not the correct question to ask for many CPU SIMD use-cases. Many of those Just Do Not Fit That Paradigm. If you think you can just squeeze CPU SIMD usage into that box then I don't think you've actually done any CPU SIMD beyond very trivial things (see my example problem in the other thread).

You want to take advantage of block size on CPUs. It's sad that GPU programmers don't get to. In other places I've seen multiple GPU programmers annoyed at not being able to do the CPU SIMD programming paradigm of explicit registers on GPUs. And doing anything about thread divergence on CPUs is just not gonna go well due to the necessary focus on high clock rates (and as such having branch mispredictions be relatively ridiculously expensive).

You of course don't need any of fancy anything if you have a pure embarassingly-parallel problem, for which GPUs are explicitly made. But for these autovectorization does actually already work, given hardware that has necessary instructions (memory gathers/scatters, masked loads/stores if necessary; and of course no programming paradigm would magically make it work for hardware that doesn't). At worst you may need to add a _Pragma to tell the compiler to ignore memory aliasing, at which point the loop body is exactly the same programming paradigm as CUDA (with thread synchronization being roughly "} for (...) {", but you gain better control over how things happen).

jandrewrogers · 2025-07-05T17:06:55 1751735215

Intel actually built and launched this 15 years ago. A GPU-like barrel processor with tons of memory bandwidth and wide vector instructions that ran x86. In later versions they even addressed the critical weakness of GPUs for many use cases (poor I/O bandwidth).

It went nowhere, aside from Intel doing Intel things, because most programmers struggle to write good code for those types of architectures so all of that potential was wasted.