Shuffle, pack/unpack, movemask, blends (SSE/AVX) and interleaved load/stores, by...

Const-me · on Aug 4, 2020

> the question is: how should we write code today?

The only thing that matters today is how fast the code works on today’s hardware. To illustrate, see how I have emulated neon’s vst3q_s16 with SSE: https://github.com/Const-me/DtsDecoder/blob/master/Utils/sto... However, not all of them can be emulated in a fast enough manner.

> specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.

I’ll consider that approach when I’ll see that compiler/optimizer working, and producing output comparable to manually-written code. Until then I think that’s a “sufficiently smart compiler” class of problems. These are rarely ever solved, IMO. For example, we don’t have generally useful auto-vectorizers in C compilers despite two decades of R&D, even the best of them (clang, intel) are still very limited.

floatboth · on Aug 4, 2020

> how should we write code today?

Write in a GPU-style "compute shader" language and let the compiler pick the fastest ways of doing things on each ISA?

https://github.com/ispc/ispc

dragontamer · on Aug 4, 2020

Well, such a feature doesn't exist in ispc. I guess I'm proposing a hypothetical feature that doesn't exist yet in any compiler. But it'd be nice if it existed...

For it to exist, we need a combination of new assembly instructions as well as a smart enough compiler.