If you're doing new development and not opposed to using C++, I recommend xsimd,...

nemequ1729 · on Aug 4, 2020

Apart from the fact that you have to rewrite your code, the big disadvantage to something like this is that it's a bit slower.

With SIMDe you're still free to use functions like `_mm_maddubs_epi16` and they'll be really fast on x86, but still work everywhere.

With xsimd (and similar libraries) you're generally limited to the lowest common denominator.

FWIW, if abstraction layers are your thing you might want to look at std::experimental::simd (https://github.com/VcDevel/std-simd) instead. Google's Highway (https://github.com/google/highway) is also pretty interesting.

alfalfasprout · on Aug 5, 2020

Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?

If so, that's not necessarily true either. A few patterns exist here. One is what the intel compilers do where you conditionally call variants of a function based on the instruction set. Another is to compile SIMD-accelerated functionality into shared libs that are dynamically loaded at launch based on the instruction set.

nemequ1729 · on Aug 5, 2020

> Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?

No, what I mean is that since xsimd is an abstraction layer you can't really use the "full" ISA extension; you're limited to composing operations based on a simpler subset that is supported across multiple architectures.

For example, consider `_mm_maddubs_epi16`, which is a favorite example of mine because it's so specific… I honestly have no idea when this is useful, but I'm sure Intel had a particular use case in mind when they added it. It adds a 8-bit signed integer to an unsigned 8-bit integer, producing a signed 16-bit integer result for each lane. Then it performs saturated addition on each horizontal pair and returns the result.

Now I'm not that familiar with xsimd's API, but I can't imagine they have a single function that does all that. It's much more likely that you have to call a few functions in xsimd; maybe one for each input to widen to 16 bits, then at least one addition. For pairwise addition there might be a function, if not you'll need some shuffles to extract the even and odd values. Then perform saturated addition on those, which [isn't supported by xsimd](https://github.com/xtensor-stack/xsimd/issues/314), so you'll need a couple of comparisons and blends to implement that.

That's basically what we have to to in SIMDe in the fallback code; I don't have a problem with that at all. However, even if you're targeting SSSE3 xsimd it's pretty unlikely xsimd will be able to fuse that into a single `_mm_maddubs_epi16`.

OTOH, in SIMDe we can also add optimized implementations of various functions, and `_mm_maddubs_epi16` is no exception. There is already an AArch64 implementation which should be pretty fast, and a ARMv7 NEON implementation which isn't too bad.

With SIMDe what you get isn't the lowest common denominator of functionality, it's the union of everything that's available. SIMDe's `_mm_maddubs_epi16` may not be any faster than xsimd if you're not targeting SSSE3, but if you are targeting SSSE3 or greater SIMDe is going to be a lot faster.

SIMDe's approach isn't without drawbacks, of course. For one, it can be hard to know whether a particular function will be fast or slow on a given architecture, whereas lowest-common-denominator libraries will pretty much be fast everywhere but functionality will be a bit more basic. It's also a lot more work… there are around 6500 SIMD functions in x86 alone, and IIRC NEON is at around 2500.

lorenzhs · on Aug 4, 2020

SIMDe is sort of the reverse: running existing code using SIMD intrinsics of platform A (e.g., Intel SSE) on another platform B (e.g., ARM). It's a great boon for portability of existing SIMD code.

alfalfasprout · on Aug 5, 2020

+1 to xsimd. It's part of the amazing xtensor ecosystem and makes writing SIMD-accelerated C++ dead simple (though if you're doing lin alg stuff just use xtensor).

pjmlp · on Aug 4, 2020

Why not SYCL then? Other than possible available backends.

jb1991 · on Aug 4, 2020

Because OpenCL and SIMD are not the same thing?

pjmlp · on Aug 4, 2020

Apparently someone must learn that SYCL can do both, it is only a matter of what host devices are available, and yes it is runtime specific.