If you're doing new development and not opposed to using C++, I recommend xsimd, which provides a higher-level interface to architecture-specific SIMD instructions: https://github.com/xtensor-stack/xsimd
Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?
If so, that's not necessarily true either. A few patterns exist here. One is what the intel compilers do where you conditionally call variants of a function based on the instruction set. Another is to compile SIMD-accelerated functionality into shared libs that are dynamically loaded at launch based on the instruction set.
> Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?
No, what I mean is that since xsimd is an abstraction layer you can't really use the "full" ISA extension; you're limited to composing operations based on a simpler subset that is supported across multiple architectures.
For example, consider `_mm_maddubs_epi16`, which is a favorite example of mine because it's so specific… I honestly have no idea when this is useful, but I'm sure Intel had a particular use case in mind when they added it. It adds a 8-bit signed integer to an unsigned 8-bit integer, producing a signed 16-bit integer result for each lane. Then it performs saturated addition on each horizontal pair and returns the result.
Now I'm not that familiar with xsimd's API, but I can't imagine they have a single function that does all that. It's much more likely that you have to call a few functions in xsimd; maybe one for each input to widen to 16 bits, then at least one addition. For pairwise addition there might be a function, if not you'll need some shuffles to extract the even and odd values. Then perform saturated addition on those, which [isn't supported by xsimd](https://github.com/xtensor-stack/xsimd/issues/314), so you'll need a couple of comparisons and blends to implement that.
That's basically what we have to to in SIMDe in the fallback code; I don't have a problem with that at all. However, even if you're targeting SSSE3 xsimd it's pretty unlikely xsimd will be able to fuse that into a single `_mm_maddubs_epi16`.
OTOH, in SIMDe we can also add optimized implementations of various functions, and `_mm_maddubs_epi16` is no exception. There is already an AArch64 implementation which should be pretty fast, and a ARMv7 NEON implementation which isn't too bad.
With SIMDe what you get isn't the lowest common denominator of functionality, it's the union of everything that's available. SIMDe's `_mm_maddubs_epi16` may not be any faster than xsimd if you're not targeting SSSE3, but if you are targeting SSSE3 or greater SIMDe is going to be a lot faster.
SIMDe's approach isn't without drawbacks, of course. For one, it can be hard to know whether a particular function will be fast or slow on a given architecture, whereas lowest-common-denominator libraries will pretty much be fast everywhere but functionality will be a bit more basic. It's also a lot more work… there are around 6500 SIMD functions in x86 alone, and IIRC NEON is at around 2500.
SIMDe is sort of the reverse: running existing code using SIMD intrinsics of platform A (e.g., Intel SSE) on another platform B (e.g., ARM). It's a great boon for portability of existing SIMD code.
+1 to xsimd. It's part of the amazing xtensor ecosystem and makes writing SIMD-accelerated C++ dead simple (though if you're doing lin alg stuff just use xtensor).