I sometimes write code portable across SSE4 and NEON, and I'm not sure this is going to work fast enough for that. There're important unique features.
SSE has shuffles, pack/unpack, movemask, 64-bit doubles, testzero, float rounds, blends, integer averages, float square roots and dot product.
NEON has interleaved RAM load/stores, vectors operators with scalar other argument, byte swap, rotate, bit scan and population count, and versions of all instructions processing 8-byte long vectors.
That's enough differences that I have to adjust both algorithms and data structures to be portable between them. I'm not convinced it's possible to do automatically.
> I'm not sure this is going to work fast enough for that.
From reading the SIMD Everywhere description, it seems to me that SIMDe is a way to allow code that targets only a platform to work on other platforms as well. As a nice byproduct, you get _some_ speed up if architecture targeted by the code is similar to the architecture that will run the code.
Portability is the main focus, not speed.
Obviously, once you have a good emulation of an architecture the first question is going to be: can I make it faster?
I wouldn't say that portability is the main focus. The first step is to get portable implementations up and running, but a huge number of functions have optimized implementations for NEON, AltiVec/VSX, and WASM SIMD 128, and we're working on adding more. We go to a lot of trouble to get good performance on multiple architectures, basically writing each implementation several times and using ifdefs to switch depending on what the fastest version available to a given architecture will be.
Even just for the portable implementations, we use a lot of hints to help the compiler auto-vectorize. Almost every portable implementation has a loop which uses a pragma to try to get the compiler do the right thing (OpenMP SIMD, clang loop-specific pragmas, GCC ivdep, etc.). On top of that we take advantage of lots of compiler-sepecific features to speed things up where possible, including GCC-style vector extensions, __builtin_shuffle/__builtin_shufflevector, and __builtin_convertvector.
SIMDe never going to be as fast as someone who knows what they're doing writing an optimized implementation for a given target. However, it should be as fast (or faster) than someone who is just trying to do a direct port where they just try to match the existing code as closely as possible.
I don't have experience with SIMD on platform other than x86/amd64, but a few of data-shuffling type functions [0] have SIMD version that is not that faster than scalar C implementation, and the overhead of translation might make then slower.
This doesn't quite apply to SIMDe; the problem that post is talking about is really at a higher level… whether it is faster to do a bunch of shuffles or use some scalar code. Once SIMDe is called you've already made your decision, and at that level the hardware-based shuffles are much faster than scalar code. For example, see the decompression speed benchmarks for LZSSE-SIMDe (<https://github.com/nemequ/LZSSE-SIMDe>) (they're in the README).
It sounds like what that post really needs is a fast 16-bit gather operation. AVX2 has some 32-bit gather functions which you may be usable (2 gathers + a blend could emulate 16-bit gathers). For NEON, you could probably use one of the `vtbl` functions; they're all 8-bit, but that just means you have separate index entries for high and low bytes… it's a bit more code, but there shouldn't be any runtime overhead.
That is a common problem. SIMD can be slower than non-SIMD. Consider this problem:
The goal was to emulate a 4-way PowerPC TLB on x86-64. Four uint32_t values had to be compared to find a match. The data structure was roughly "uint32_t array[512][4][4]", laid out so that the 4 uint32_t values would be adjacent.
It simply didn't perform OK. Getting the equality test results out of SIMD was lengthy, awkward, and slow.
That task was so perfect for SIMD, and yet SIMD failed at it. The data was the exactly correct size of an SSE XMM register. It was aligned. The task was a simple parallel operation.
Based on your description, here’s what you should do to vectorize that code.
1. If you don’t have AVX, a good way to broadcast integer from scalar register to vector is _mm_cvtsi32_si128 followed by _mm_shuffle_epi32( v, 0 )
2. To compare them for equality, _mm_cmpeq_epi32
3. Getting index of the first match is 2 instructions, MOVMSKPS and BSF.
Getting compiler to emit them is a bit awkward, though. You first need _mm_castsi128_ps to be able to call _mm_movemask_ps. Test the integer for 0 afterwards, if zero, none of the 4 lanes were equal.
The portable way to emit BSF is only introduced in C++/20. In the current version of the language you have to use preprocessor to detect compiler, use _BitScanForward for msvc, __builtin_ctz for gcc/clang.
If you want count of matches, replace BSF with POPCNT. Again, in current version of the language it’s compiler specific, __popcnt for msvc, __builtin_popcount for gcc/clang.
P.S. If you only need a single boolean saying if none of the 4 lanes matched / any of the lanes matched, use _mm_test_all_zeros / _mm_test_mix_ones_zeros instead of _mm_movemask_ps. Or if you want to test more than 1 cache entry, leave the comparison result in a vector register, compare more entries, combine results with bitwise instructions.
Update: If you don’t need index or count of matches but want to individually test all 4 matches with scalar code, on old CPUs _mm_movemask_epi8 is slightly faster because cross-domain latency, test the result for bits 1, 0x10, 0x100, 0x1000.
I wouldn't characterize this as "perfect" for SIMD!
Perfect for SIMD usually means a significant amount of calculation that can be done vector-wise (you could include contiguous data movement in that definition).
Here, you are doing exactly one (cheap) calculation: the compare, and one vectorized load, and you want to feed the results to a branch, presumably.
You are only saving a few instructions versus scalar and pay a vector to GP penalty.
The penalty is quite small, 1-3 cycles each direction. RAM latency is 1-2 orders of magnitude more than that, even L1D level of cache is many cycles away. Replacing multiple scalar RAM loads with 1 vector load is usually a good idea performance wise. This is true even if you’ll then use extract instructions to access the lanes. Extract latency is 2-3 cycles, much faster than RAM.
I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.
Yes, the penalty is small, but the total amount of vectorized work is also very small!
L1D is not many cycles away: it is 4 or 5 for scalar loads, 6 or 7 for xmm or ymm loads. If the load misses, it doesn't much matter if it's a scalar or vector load: the time to fetch the cache line is the same.
So a scalar load of 5 cycles looks much better, latency-wise, than a vector load of 6 cycles, plus an extract of 1-3 cycles.
Of course, you need only 1 vector load vs 4 GP loads, but the latencies are overlapped.
Furthermore, the extracts can happen on a single port: so even though you have 512 bits/cycle of "contiguous" vector load bandwidth, you then suck those loads though a 32 bits/cycle extract straw [1]? 32-bit GP loads have 64 bits/cycle bandwidth and the value goes directly to the GP register, or even micro-fused with the ALU op.
So no, it is not an obvious win to load 4x32-bit values with a vector load and then bring them over to GP registers. Even if it might sometimes be slightly better, this is hardly "perfect" for vectorization, rather I'd say it is "quite poor candidate for vectorization".
Also, if the goal is to set a flag and jump on it, you'll still end up needing a scalar comparison anyway, so actually for the computation part there is no savings.
Don't forget the thing you are comparing to: presumably it starts in a GP register, so you need some kind of GP->SIMD move and then a broadcast to prepare the comparison.
> I think what might have happened, GP tried to use SSE for dealing with individual lanes. Better approach for that use case is moving the comparison results to scalar register with a single movmskps, pmovmskb, or ptest instruction, just once for the complete vector.
Right, well who know what they tried to do or how the surrounding code works. I agree the approach you suggest sounds like it should be a slight win sometimes, but the key word is "slight". If the surrounding code is general purpose code and the inputs and outputs come from and go to GP registers, this is just "too small" to vectorize well. It's a common misconception that say comparing values is the bulk of the work, so of course vectorization will be a 4x win, but actually all the surrounding stuff takes most of the work, much more than a comparison which can execute 4 per cycle on the scalar side.
---
[1] You can try other tricks like extracting 64-bits and then messing around in the GP reg to split the halves, but it's basically a wash.
Hi there! It looks like you've been shadowbanned, I vouched a few of your comments because they didn't seem terrible, but I didn't scroll very far. It might be worth sending a mail to dang to ask him to take a look at your account or something!
It sounds like you actually know what you're doing, so in this case you're probably right, at least if all you do is compile your x86 code with SIMDe.
That said, SIMDe also provides support for other architectures, notably WASM SIMD 128 and AltiVec/VSX, as well as portable implementations which work everywhere, including on CPUs I'd never heard of until people told me SIMDe was working well on them (I'm thinking of Kalray, which supports vectors but doesn't have an API and instead relies on compiler auto-vectorization support).
One use case for SIMDe which may be interesting for you is that you can freely mix calls to different APIs. Say, for example, that you already have a bunch of x86 code written and want a NEON port. You can add SIMDe and you get a NEON port basically for free, then you can start adding some ifdefs to add optimizations for NEON without having to rewrite the whole thing. SIMDe doesn't in any way prevent you from optimizing your NEON (or whatever) port.
The way I tend to look at it is that SIMDe never makes your code slower, only more portable.
I might give it a try next time I’ll need to do something for AMD64 + NEON. Not going to rewrite code I have already written, debugged and shipped.
Also, about this
> we have an extensive test suite to verify our implementations
Don’t forget about MXCSR register in that suite, esp. the rounding bits of that. I avoid changing it as much as possible ‘coz the state is preserved across context switches and causes funny things in OpenMP and other thread pools, but not all people are aware of that. Also, there’s non-trivial amount of code written for SSE < 4.1 (the 4.1 introduced proper rounding instruction, roundps) where you sometimes forced to mess with MXCSR rounding bits because the alternatives are much slower.
> Not going to rewrite code I have already written, debugged and shipped.
I wouldn't, either. At least unless you want a WASM/AltiVec/etc. version. But if you already have good implementations SIMDe probably won't help.
OTOH, if all you have is a x86 implementation and a portable fallback, the SIMDe version of the x86 implementation will probably be faster than your portable version. That's what happened with MMseqs2 (<https://github.com/soedinglab/MMseqs2>).
> Don’t forget about MXCSR register in that suite
Yeah, rounding is definitely PITA. It's actually something I completely screwed up on in the beginning of the project and had to go back and correct :(. We do have some tests now which fiddle with the rounding mode to verify correctness, but could definitely use more, and obviously we can't always set a dedicated register to control behavior, so on some platforms `_mm_getcsr`/`_mm_setcsr`/`_MM_GET_ROUNDING_MODE`/`_MM_SET_ROUNDING_MODE` becomes `fegetround`/`fesetround`, which probably won't be a problem but still makes me uncomfortable.
The other area where we could really use more tests is replicating behavior for NaNs. By default we try to replicate the behavior of the function we're trying to emulate, but we currently only test NaN handling on a few functions :(. If you use -ffast-math or -ffinite-math-only we disable that code (compilers define __FINITE_MATH_ONLY__), though, and just use the fastest implementation we can.
Shuffle, pack/unpack, movemask, blends (SSE/AVX) and interleaved load/stores, byte swap (NEON) are "just" data-movement instructions.
All of them can be implemented (with obvious slowdowns) with a conditional write to memory, then a conditional read from memory. Yeah, its inefficient to do it like this, but this "write then read" pattern really gives us an idea of what's really going on between the registers in a pack/pshufb/whatever instruction.
On AMD GPUs, there's a fully arbitrary crossbar between SIMD-lanes allowing for arbitrary movement. The two instructions are just "permute" and "b-permute" (backwards permute), roughly correlating to gather and scatter respectively.
On NVidia GPUs, perm and bperm are both implemented in PTX, but instead read/write to L1 or __shared__ memory. NVidia GPUs likely have a crossbar to L1 memory to make this instruction very fast.
---------
The solution is to implement perm and bperm on AVX. Its already half-implemented: pshufb is equivalent to GPU-permute. CPUs are just missing the backwards permute.
I'm pretty confident that pack/unpack, blends, interleaved load/stores, and more could all be implemented as pshufb and a hypothetical "backwards pshufb". Version 1.0 could be an NVidia-like "write to L1 cache" sort of implementation too, if full crossbars are too expensive at the hardware layer.
-----------
So the question is: how should we write code today? CPUs of today do not implement this feature, but CPUs of the future might. I think specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.
> the question is: how should we write code today?
The only thing that matters today is how fast the code works on today’s hardware. To illustrate, see how I have emulated neon’s vst3q_s16 with SSE: https://github.com/Const-me/DtsDecoder/blob/master/Utils/sto... However, not all of them can be emulated in a fast enough manner.
> specifying the memory-moves explicitly, and then working on a "pshufb compiler/optimizer" of sorts is what we need.
I’ll consider that approach when I’ll see that compiler/optimizer working, and producing output comparable to manually-written code. Until then I think that’s a “sufficiently smart compiler” class of problems. These are rarely ever solved, IMO. For example, we don’t have generally useful auto-vectorizers in C compilers despite two decades of R&D, even the best of them (clang, intel) are still very limited.
Well, such a feature doesn't exist in ispc. I guess I'm proposing a hypothetical feature that doesn't exist yet in any compiler. But it'd be nice if it existed...
For it to exist, we need a combination of new assembly instructions as well as a smart enough compiler.
Are compilers smart enough to interleave load/stores? I've noticed when using NEON intrinsics that GCC occasionally did clever things (and of course sometimes I got huge speed improvements by making small changes, like adding a PLD a few instructions before a hunk of memory is used)
If you're doing new development and not opposed to using C++, I recommend xsimd, which provides a higher-level interface to architecture-specific SIMD instructions: https://github.com/xtensor-stack/xsimd
Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?
If so, that's not necessarily true either. A few patterns exist here. One is what the intel compilers do where you conditionally call variants of a function based on the instruction set. Another is to compile SIMD-accelerated functionality into shared libs that are dynamically loaded at launch based on the instruction set.
> Not sure where you got that from... xsimd will detect your instruction set automatically. Do you mean that if you're distributing a single binary then you'll need to compile for the lowest common denominator?
No, what I mean is that since xsimd is an abstraction layer you can't really use the "full" ISA extension; you're limited to composing operations based on a simpler subset that is supported across multiple architectures.
For example, consider `_mm_maddubs_epi16`, which is a favorite example of mine because it's so specific… I honestly have no idea when this is useful, but I'm sure Intel had a particular use case in mind when they added it. It adds a 8-bit signed integer to an unsigned 8-bit integer, producing a signed 16-bit integer result for each lane. Then it performs saturated addition on each horizontal pair and returns the result.
Now I'm not that familiar with xsimd's API, but I can't imagine they have a single function that does all that. It's much more likely that you have to call a few functions in xsimd; maybe one for each input to widen to 16 bits, then at least one addition. For pairwise addition there might be a function, if not you'll need some shuffles to extract the even and odd values. Then perform saturated addition on those, which [isn't supported by xsimd](https://github.com/xtensor-stack/xsimd/issues/314), so you'll need a couple of comparisons and blends to implement that.
That's basically what we have to to in SIMDe in the fallback code; I don't have a problem with that at all. However, even if you're targeting SSSE3 xsimd it's pretty unlikely xsimd will be able to fuse that into a single `_mm_maddubs_epi16`.
OTOH, in SIMDe we can also add optimized implementations of various functions, and `_mm_maddubs_epi16` is no exception. There is already an AArch64 implementation which should be pretty fast, and a ARMv7 NEON implementation which isn't too bad.
With SIMDe what you get isn't the lowest common denominator of functionality, it's the union of everything that's available. SIMDe's `_mm_maddubs_epi16` may not be any faster than xsimd if you're not targeting SSSE3, but if you are targeting SSSE3 or greater SIMDe is going to be a lot faster.
SIMDe's approach isn't without drawbacks, of course. For one, it can be hard to know whether a particular function will be fast or slow on a given architecture, whereas lowest-common-denominator libraries will pretty much be fast everywhere but functionality will be a bit more basic. It's also a lot more work… there are around 6500 SIMD functions in x86 alone, and IIRC NEON is at around 2500.
SIMDe is sort of the reverse: running existing code using SIMD intrinsics of platform A (e.g., Intel SSE) on another platform B (e.g., ARM). It's a great boon for portability of existing SIMD code.
+1 to xsimd. It's part of the amazing xtensor ecosystem and makes writing SIMD-accelerated C++ dead simple (though if you're doing lin alg stuff just use xtensor).
I'd like to commend the authors for embarking on this. Complex ISA's are an unfortunate reality for performance as advances in cycles-per-second on a single core are negligible. The divergence of these increasingly complex ISA's among platforms weigh heavily on competitive application developers.
As someone who is interested in writing cross-platform SIMD code the most valuable asset to me is a library or compiler that can generate the instructions dynamically from otherwise normal'ish looking C/C++. This is the most powerful mode of development in my experience. Clang already does this remarkably. I can write with standard C++ syntax (albeit awkwardly) and maybe using a few custom types with `__attribute__((vector_size(x)))` and not have to involve explicit intrinsics except perhaps for a very small number of leaf operations that cannot be expressed. At this time Clang has the upper-hand on GCC: the latter cannot generate code which scales between platforms utilizing different vector sizes. For example, if you try to perform an operation on a 256-bit vector using a 128-bit target: Clang will seamlessly generate two 128-bit operations; GCC will fall back entirely to scalar. My assumption is that developments in Clang for ARM's SVE have carried over to generating scalable code for other platforms, but nevertheless it is remarkable.
I don't believe that writing functions comprised of hand-crafted lists of intrinsics is the best way forward. Undoubtedly it's worked for projects, even quite well to ship stable software -- but it scales and adapts poorly in a fast-developing and diverse market of hardware. For example, years ago I wrote a simple `tolower(string)` implementation using an assemblage of 128-bit standard-Intel SSE2 statements and today the instructions it produces are exactly the same as the day that I wrote it. All I can hope for is that 256-bit capable architectures can execute two of my operations at once. That's not ideal.
> I'd like to commend the authors for embarking on this.
As the crazy bastard who started SIMDe, thanks!
I'm very concerned about portability, so unfortunately just using clang isn't really an option. Most of my code has to work not only on GCC and clang, but also on MSVC (/me cries) and ICC, and I generally try to make it work on other compilers (PGI, IAR, etc.).
In my experience you're right, clang does do much better with vectors that don't match the hardware by default, but in SIMDe we actually have explicit fallbacks which call shorter functions twice and the result is pretty good on both compilers. For example, here is what `_mm256_add_ps` looks like on GCC and clang when targeting SSE2: <https://godbolt.org/z/n68Ecn>.
Length-agnostic instruction sets like SVE are definitely very interesting, but honestly I'd rather see them de-emphasize non-portable APIs like SVE and instead work on improving the compiler's ability to recognize the relevant code patterns to work with things like OpenMP SIMD (which, to be clear, does not require the OpenMP runtime). I'd also be happy to see more builtins which work cross-platform… for example, I'd love to see builtins for saturated operations which could easily be auto-vectorized by the compiler when used in an OpenMP SIMD loop.
Currently for new SIMD code, I start with an OpenMP SIMD implementation, then profile. If I see any spots which perform particularly poorly and/or are particularly hot, I'll work on some optimized implementations for those spots using intrinsics. In the future I hope to need to hand-optimize less code, but for now I think this offers a pretty good trade-off between portability, performance, and development time.
Very cool. This is effectively a replacement for the now unfortunately abandoned Yeppp library which I used with C# - though modern .NET has SIMD now too. https://news.ycombinator.com/item?id=10232395
Some matrix math libraries like Eigen[1] support vectorization via SSE, AVX, NEON, etc. and also use cache friendly algorithms for larger matrices. Highly recommended when you don't need to go quite as low level as individual instructions.
If you need to work on large matrices, Eigen is highly recommended. If you need to do a stream of custom processing in small steps, Eigen is not a good fit. Eigen is a super complicated, template-heavy library that can compile very heavyweight tasks into very efficient code (eventually).
> Eigen is a super complicated, template-heavy library
I agree, but that feature allows to apply optimizations by specializing these templates.
Works for both micro-optimizations (in their pbroadcast4<__m256d> they do 4 loads, on many CPUs AVX2 can do better with a single load + shuffles) and replacing large parts of Eigen (I was able to improve performance of conjugate gradient solver by moving the sparse matrix into a SIMD-optimized structure).
That library is not even remotely similar to SIMDe. The goal of SIMDeez is to provide an abstraction over different SIMD instruction sets (different versions of the x86 SIMD instructions: SSE2, SSE4.1, AVX2). The goal of SIMDe is to let you run code using platform-specific intrinsics on machines that don't have these instructions, e.g. run code with SSE/AVX intrinsics on an ARM-based CPU. They're very different things.
Careful when invoking (some) AVX-512 instructions, because having a process using them on just one hardware thread can cripple your entire system and hurt overall performance in workloads where the kernel or another process on the system is doing a lot of the work.
While that's true, I don't see how this is relevant to SIMDe? SIMDe lets you compile code using (e.g.) SSE/AVX intrinsics for ARM targets, using the target platform's SIMD intrinsics when possible. It doesn't even officially include support for AVX512 yet.
It does include support for AVX-512, it's just still a work in progress. AVX-512 is enormous (IIRC ~ 4k functions), so it will be a while before it is fully supported.
If you're targeting AVX-512 but don't have hardware that supports AVX-512 (like all AMD CPUs), SIMDe can be quite nice. The result is much faster than Intel's SDE and it's just native code that you can use your normal debugger on.
While that may be technically true, since Intel is the only CPU vendor implementing AVX-512, the reason that severe downclocking is warranted is rooted in physics, and may not be resolvable given this ISA design. General vector ISAs like RISC-V "V" may have less trouble, since it is clearer how to design a vector unit not to interfere with scalar access to memory/caches.
also, in the article they specifically mention adding shims for several AVX-512 ISAs:
> We also have rapidly progressing implementations of many other extensions including NEON, AVX2, SVML, and several AVX-512 extensions (AVX-512F, AVX-512BW, AVX-512VL, etc.).
if it's not obvious to you, this means that on machines with AVX-512 hardware will be invoking those instructions directly.
Not sure how many cloud vendors are running CPU's with AVX-512 support, but I imagine that if they do you either have to pay for exclusive use of a single CPU (all cores), or thath they indeed disable AVX-512 instructions by patching the cpuid flags?
Note that the downclocking is only really a problem with AVX-512 and much less so with AVX2 and (AFAIK) not at all with AVX.
The cloud vendors (at least the ones I checked) have SKUs with fairly limited downclocking, but more importantly the license based downclocking is per-core. So as long as you get at least a core to yourself, the noisy neighbor problem is reduced in that you affect your own CPU speed only.
For more fine-grained sharing, sure, if the key before you was running AVX-512 you'll get a lower frequency for the initial part of your timeslice, but it should pop back to full speed in under 1 ms.
SSE has shuffles, pack/unpack, movemask, 64-bit doubles, testzero, float rounds, blends, integer averages, float square roots and dot product.
NEON has interleaved RAM load/stores, vectors operators with scalar other argument, byte swap, rotate, bit scan and population count, and versions of all instructions processing 8-byte long vectors.
That's enough differences that I have to adjust both algorithms and data structures to be portable between them. I'm not convinced it's possible to do automatically.