Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Shuffle, pack/unpack, movemask, blends (SSE/AVX) and interleaved load/stores, byte swap (NEON) are "just" data-movement instructions.

All of them can be implemented (with obvious slowdowns) with a conditional write to memory, then a conditional read from memory. Yeah, its inefficient to do it like this, but this "write then read" pattern really gives us an idea of what's really going on between the registers in a pack/pshufb/whatever instruction.

On AMD GPUs, there's a fully arbitrary crossbar between SIMD-lanes allowing for arbitrary movement. The two instructions are just "permute" and "b-permute" (backwards permute), roughly correlating to gather and scatter respectively.

On NVidia GPUs, perm and bperm are both implemented in PTX, but instead read/write to L1 or __shared__ memory. NVidia GPUs likely have a crossbar to L1 memory to make this instruction very fast.

---------

The solution is to implement perm and bperm on AVX. Its already half-implemented: pshufb is equivalent to GPU-permute. CPUs are just missing the backwards permute.

I'm pretty confident that pack/unpack, blends, interleaved load/stores, and more could all be implemented as pshufb and a hypothetical "backwards pshufb". Version 1.0 could be an NVidia-like "write to L1 cache" sort of implementation too, if full crossbars are too expensive at the hardware layer.

-----------

So the question is: how should we write code today? CPUs of today do not implement this feature, but CPUs of the future might. I think specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.



> the question is: how should we write code today?

The only thing that matters today is how fast the code works on today’s hardware. To illustrate, see how I have emulated neon’s vst3q_s16 with SSE: https://github.com/Const-me/DtsDecoder/blob/master/Utils/sto... However, not all of them can be emulated in a fast enough manner.

> specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.

I’ll consider that approach when I’ll see that compiler/optimizer working, and producing output comparable to manually-written code. Until then I think that’s a “sufficiently smart compiler” class of problems. These are rarely ever solved, IMO. For example, we don’t have generally useful auto-vectorizers in C compilers despite two decades of R&D, even the best of them (clang, intel) are still very limited.


> how should we write code today?

Write in a GPU-style "compute shader" language and let the compiler pick the fastest ways of doing things on each ISA?

https://github.com/ispc/ispc


Well, such a feature doesn't exist in ispc. I guess I'm proposing a hypothetical feature that doesn't exist yet in any compiler. But it'd be nice if it existed...

For it to exist, we need a combination of new assembly instructions as well as a smart enough compiler.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: