Rust Faster SIMD edition

06 September 2018

It’s been a while since I’ve been playing the benchmarksgame with Rust. But I recently found an interesting crate called packed_simd which had a SIMD-ified version of some benchmarks, so as Rust stable now has stdsimd, we should be able to speed up our benchmarks quite a bit.

If only that was so easy.

For a start, I took the n-body benchmark, because I knew the current entry wasn’t the fastest. I tried, and failed, to reproduce their algorithm using stdsimd types. On opt_level=2 I got almost correct to somewhat off results, whereas on opt_level=3 the optimizer decided that my calculations should be rearranged to yield -Inf on every step. I then decided to take a step back and rip out the stdsimd types and functions and replace them with simple scalar implementations.

So basically I took their algorithm and instead of using packed_simd I just implemented the types struct F64x4(f64, f64, f64, f64) and struct F64x2(f64, f64) by hand. This time I got correct results. So I figured I should see how much slower the autovectorized code really is – and found a surprise: It’s a good deal faster than our current entry and should perform about the same as the gcc enty if the benefits translate linearly to the benchmarksgame machine.

Awesome!

I should note that this solution is good for now, as it’s possible, nay, likely that future Rust/LLVM versions fail to autovectorize this particular algorithm. I might still change the code to use stdsimd later in order to avoid regressions like the one we had when upgrading from Rust 1.7 to 1.8.

Update: Unfortunately, the autovectorization fails to work for the benchmarksgame server’s older CPU as it lacks 256 bit types. Pity. I’ll see if I find another solution.

Spectral Form

The [spectral_norm] benchmark is similarly relying on autovectorization and could benefit from the same trick. Out the [f64; 2], in the F64x2 from the n-body benchmark and we run a whopping 27% faster!

Previous: 0m1,076s Real, 0m4,216s User, 0m0,007s System

With F64x2: 0m0,783s Real, 0m3,121s User, 0m0,002s System

Naturally, once we have an SIMD implementation of n-body, we can adapt this code, too.

Update: leonardo pointed out that an #[inline(never)] is no longer needed. Removing it roughly doubled the performance on my system.

Update 2: The #[inline(never)] actually was needed to get LLVM to vectorize the hot method on the core 2; I have reinstated it and also added a manual SIMD version which is now tied with the GCC one for the fastest benchmark.

Single Instruction, Multiple Pfannkuchen

For the fannkuch_redux benchmark, Cristi Cobzarenco’s rayon-powered version was the one to beat, and this time, I couldn’t rely on autovectorization, as the current LLVM optimizer doesn’t vectorize shuffles. So I again took the packed_simd version of the algorithm and implemented my own U8x16 type underneath (I uppercased the U to silence the lowercase-types lint) using stdsimd’s __m128 type.

What unsurprisingly resulted was roughly double the performance per core from Cristi’s version, but still slower on four cores. rayon is very good. So I needed to rayon-ify my SIMD version following Cristi’s example. However, both algorithms diverged on a few counts, so I couldn’t simply merge in everything.

The simple algorithm from packed_simd updates the checksum on every flip, whereas the rayon’d version keeps a flip_count and updates checksum and maxflips afterwards. I find the latter version improves separation of concerns.
The simple algorithm first generates masks for all rotations and flips.
The simple algorithm uses a State struct with multiple methods that don’t easily correspond to the rayon’d version’s state. On the other hand, the latter uses [u32; 16] arrays instead of the smaller U8x16 types used by the SIMD version.
Despite having a good deal of comments, the rayon’d version put most things directly into the fannkuch function, which made it much harder to understand.
Owing to the chunk splitting of the rayon’d algorithm, it needs to do stuff differently to the naive version, which I found out when trying to substitute the predefined SIMD shuffle masks for rotate/flip implementations in the rayon’d version.

In the end I decided to use Cristi’s version as a starting point and try to replace scalar operations with SIMD code one by one, keeping the old code around to be able to check for correctness.

The easiest way to get a speedup was to just use SIMD for flipping, replacing this:

temp[1..i].reverse();

with this:

U8x16::from_slice_unaligned(&temp).permute_dyn(flip_masks[first_value])
                                  .write_to_slice_unaligned(&mut temp);

Previous: 0m9,046s Real, 0m33,847s User, 0m0,017s System

SIMD-flip: 0m6,990s Real, 0m27,356s User, 0m0,005s System

Roughly double the user time the single-core version takes, but half the wall clock time on four cores. Can we go faster? You bet. The old version had an optimization to only call reverse with more than two values, which required me to store and load the U8x16. As flipping our pancake was now SIMD-fast, this no longer gave us any gains, and ripping it out allowed me to stay in SIMD land in the hottest loop, which now looked like the following:

flip = flip.permute_dyn(flip_masks[flip.extract0()]);
flip_count += 1;

One interesting wrinkle here is that the _mm_extract_epi8 intrinsic isn’t inlined, so the resulting assembly has a callq core::coresimd::x86::sse41::_mm_extract_epi8. However, after changing the loop to only do the extraction once, this accounts for less than 0.1% runtime, so no need to worry about this.

SIMD extract once: 0m5,576s Real, 0m22,021s User, 0m0,006s System

Next step: Moving the permutation to SIMD code. This allows us to keep the current permutation in U8x16, so the only non-SIMD shuffling that remains is the calculation of the initial permutation. I may have some ideas to optimize this, but for now I’ll leave it as is:

SIMD loop: 0m5,129s Real, 0m20,240s User, 0m0,003s System

It’s going to be interesting to see how this version stacks up on the benchmarksgame server.

Update: My first version used an SSE4.1 instruction which the benchmarksgame server didn’t like. Also following Andrew Gallant’s advice, I added proper #[cfg(..)]s and now the inlining issue went away, improving perf even more:

cfg’d: 0m4,640s Real, 0m18,324s User, 0m008s System

Update 2: The maintainer removed my SIMD fannkuch_redux entry to avoid a SIMD arms race. I understand the reasoning, and have submitted a scalar version that uses the knowledge I gained from writing the SIMD one. We’ll see how fast it is.