> there's no fundamental reason why the same source code shouldn't be able to co...

mort96 · 2025-07-05T13:35:43 1751722543

I'm saying that the shape of the SIMD is pretty much the same across platforms. Vector width differs between architectures, and whether the vector width is determined at compile time or at runtime differs between architectures, but you'll have to convince me that the vector width is such an essential component of the abstract description of the computation that you fundamentally can't abstract it away. (In fact, the success of RVV and ARM SVE should tell us that we can describe SIMD computation in a vector width-independent way.)

All vector instruction sets offer things like "multiply/add/subtract/divide the elements in two vector registers", so that is clearly not the part that's impossible to describe portably.

anonymoushn · 2025-07-07T13:59:37 1751896777

It's really not. As an example, for string processing tasks (including codecs which various server software spends a significant percentage of its runtime on), NEON includes a deinterleaving load into 4 registers and byte-wise shuffles that accept 2, 3, or 4 registers worth of lookup table. These primitives are quite different from those available on AVX2 or AVX-512, and the fact that they are available and cheap to use means you end up with somewhat different algorithms for the two types of targets. Even the practice of using the toys available in AVX2 well for this sort of task is somewhat obscure. Folks who have worked on codec-type stuff but primarily used AVX-512 often have trouble figuring out how to do most of the same things in similar instruction counts if masked versions of the instructions are not available.

janwas · 2025-07-07T14:23:55 1751898235

I made the same argument a while ago but a coworker changed my mind.

Can you afford to write and maintain a codepath per ISA (knowing that more keep coming, including RVV, LASX and HVX), to squeeze out the last X%? Is there no higher-impact use of developer time? If so, great.

If not, what's the alternative - scalar code? I'd think decent portable SIMD code is still better than nothing, and nothing (scalar) is all we have for new ISAs which have not yet been hand-optimized. So it seems we should anyway have a generic SIMD path, in addition to any hand-optimized specializations.

BTW, Highway indeed provides decent emulations of LD2..4, and at least 2-table lookups. Note that some Arm uarchs are anyway slow with 3 and 4.

anonymoushn · 2025-07-07T16:07:54 1751904474

For now, at work, it's just some parts with AVX-512, some parts with AVX-512 that we can't really use, so we should use AVX2, and some parts with NEON and SVE. So the implementations for SSE basically are a courtesy to outside users of the libraries, and there are no RVV implementations.

If we were already depending on highway or eve, I would think it's great to ship the generic SIMD version instead of the SSE version, which probably compiles down to the same thing on the relevant targets. This way, if future maintainers need to make changes and don't want to deal with the several implementations I have left behind, the presence of the generic implementation would allow them to delete them rather than making the same changes a bunch of times.

janwas · 2025-07-07T18:09:07 1751911747

Makes sense :) Generic or fallback versions are also useful for correctness testing and benchmarking.

exDM69 · 2025-07-05T14:00:17 1751724017

> if you don't describe your code and dataflow in a way that caters to the shape of the SIMD

But when I do describe code, dataflow and memory layout in a SIMD friendly way it's pretty much the same for x86_64 and ARM.

Then I can just use `a + b` and `f32x4` (or its C equivalent) instead of `_mm_add_ps` and `_mm128` (x86_64) or `vaddq_f32` and `float32x4_t` (ARM).

Portable SIMD means I don't need to write this code twice and memorize arcane runes for basic arithmetic operations.

For more specialized stuff you have intrinsics.

owlbite · 2025-07-05T14:26:08 1751725568

So we write a lot of code in this agnostic fashion using typedef's and clang's vector attribute support, along with __builtin_shufflevector for all the permutations (something along similar lines to Apple's simd.h). It works pretty well in terms of not needing to memorize/lookup all the mnemonic intrinsics for a given platform, and letting regular arithmetic operations exist.

However, we still end up writing different code for different target SOCs, as the microarchitecture is different, and we want to maximize our throughput and take advantage of any ISA support for dedicated instructions / type support.

One big challenge is targeting in-order cores the compiler often does a terrible job of register allocation (we need to use pretty much all the architectural registers to allow for vector instruction latencies), so we find the model breaks down somewhat there as we have to drop to inline assembly.

exDM69 · 2025-07-05T14:50:57 1751727057

Your experience matches mine, you can get a lot done with the portable SIMD in Clang/GCC/Rust but you can't avoid the platform specific stuff when you need specialized instructions.

Depends on the domain you work in how much you need to resort to platform specific intrinsics. For me dabbling in computer graphics and game physics, almost all of the code is portable except for some rare specialized instructions here and there.

For someone working in specialized domains (like video codecs) or hardware (HPC super computers) the balance might be the other way around.