Rust Faster SIMD edition
It’s been a while since I’ve been playing the benchmarksgame with Rust. But I
recently found an interesting crate called packed_simd
which had a
SIMD-ified version of some benchmarks, so as Rust stable now has stdsimd
, we
should be able to speed up our benchmarks quite a bit.
If only that was so easy.
For a start, I took the n-body
benchmark, because I knew the current entry
wasn’t the fastest. I tried, and failed, to reproduce their algorithm using
stdsimd
types. On opt_level=2
I got almost correct to somewhat off results,
whereas on opt_level=3
the optimizer decided that my calculations should be
rearranged to yield -Inf
on every step. I then decided to take a step back
and rip out the stdsimd types and functions and replace them with simple
scalar implementations.
So basically I took their algorithm and instead of using packed_simd
I
just implemented the types struct F64x4(f64, f64, f64, f64)
and
struct F64x2(f64, f64)
by hand. This time I got correct results. So I
figured I should see how much slower the autovectorized code really is – and
found a surprise: It’s a good deal faster than our current entry and should
perform about the same as the gcc enty if the benefits translate linearly to
the benchmarksgame machine.
Awesome!
I should note that this solution is good for now, as it’s possible, nay, likely that future Rust/LLVM versions fail to autovectorize this particular algorithm. I might still change the code to use stdsimd later in order to avoid regressions like the one we had when upgrading from Rust 1.7 to 1.8.
Update: Unfortunately, the autovectorization fails to work for the benchmarksgame server’s older CPU as it lacks 256 bit types. Pity. I’ll see if I find another solution.
Spectral Form
The [spectral_norm
] benchmark is similarly relying on autovectorization and
could benefit from the same trick. Out the [f64; 2]
, in the F64x2
from the
n-body
benchmark and we run a whopping 27% faster!
Previous: 0m1,076s Real, 0m4,216s User, 0m0,007s System
With F64x2: 0m0,783s Real, 0m3,121s User, 0m0,002s System
Naturally, once we have an SIMD implementation of n-body
, we can adapt this
code, too.
Update: leonardo pointed out that an #[inline(never)]
is no longer needed.
Removing it roughly doubled the performance on my system.
Update 2: The #[inline(never)]
actually was needed to get LLVM to vectorize
the hot method on the core 2; I have reinstated it and also added a manual SIMD
version which is now tied with the GCC one for the fastest benchmark.
Single Instruction, Multiple Pfannkuchen
For the fannkuch_redux
benchmark, Cristi Cobzarenco’s rayon-powered version
was the one to beat, and this time, I couldn’t rely on autovectorization, as
the current LLVM optimizer doesn’t vectorize shuffles. So I again took the
packed_simd
version of the algorithm and implemented my own U8x16
type
underneath (I uppercased the U
to silence the lowercase-types lint) using
stdsimd
’s __m128
type.
What unsurprisingly resulted was roughly double the performance per core from Cristi’s version, but still slower on four cores. rayon is very good. So I needed to rayon-ify my SIMD version following Cristi’s example. However, both algorithms diverged on a few counts, so I couldn’t simply merge in everything.
- The simple algorithm from
packed_simd
updates the checksum on every flip, whereas the rayon’d version keeps aflip_count
and updates checksum and maxflips afterwards. I find the latter version improves separation of concerns. - The simple algorithm first generates masks for all rotations and flips.
- The simple algorithm uses a
State
struct with multiple methods that don’t easily correspond to the rayon’d version’s state. On the other hand, the latter uses[u32; 16]
arrays instead of the smallerU8x16
types used by the SIMD version. - Despite having a good deal of comments, the rayon’d version put most things
directly into the
fannkuch
function, which made it much harder to understand. - Owing to the chunk splitting of the rayon’d algorithm, it needs to do stuff differently to the naive version, which I found out when trying to substitute the predefined SIMD shuffle masks for rotate/flip implementations in the rayon’d version.
In the end I decided to use Cristi’s version as a starting point and try to replace scalar operations with SIMD code one by one, keeping the old code around to be able to check for correctness.
The easiest way to get a speedup was to just use SIMD for flipping, replacing this:
temp[1..i].reverse();
with this:
U8x16::from_slice_unaligned(&temp).permute_dyn(flip_masks[first_value])
.write_to_slice_unaligned(&mut temp);
Previous: 0m9,046s Real, 0m33,847s User, 0m0,017s System
SIMD-flip: 0m6,990s Real, 0m27,356s User, 0m0,005s System
Roughly double the user time the single-core version takes, but half the wall
clock time on four cores. Can we go faster? You bet. The old version had an
optimization to only call reverse
with more than two values, which required
me to store and load the U8x16
. As flipping our pancake was now SIMD-fast,
this no longer gave us any gains, and ripping it out allowed me to stay in SIMD
land in the hottest loop, which now looked like the following:
flip = flip.permute_dyn(flip_masks[flip.extract0()]);
flip_count += 1;
One interesting wrinkle here is that the _mm_extract_epi8
intrinsic isn’t
inlined, so the resulting assembly has a
callq core::coresimd::x86::sse41::_mm_extract_epi8
. However, after changing
the loop to only do the extraction once, this accounts for less than 0.1%
runtime, so no need to worry about this.
SIMD extract once: 0m5,576s Real, 0m22,021s User, 0m0,006s System
Next step: Moving the permutation to SIMD code. This allows us to keep the
current permutation in U8x16
, so the only non-SIMD shuffling that remains is
the calculation of the initial permutation. I may have some ideas to optimize
this, but for now I’ll leave it as is:
SIMD loop: 0m5,129s Real, 0m20,240s User, 0m0,003s System
It’s going to be interesting to see how this version stacks up on the benchmarksgame server.
Update: My first version used an SSE4.1 instruction which the benchmarksgame
server didn’t like. Also following Andrew Gallant’s advice, I added proper
#[cfg(..)]
s and now the inlining issue went away, improving perf even more:
cfg’d: 0m4,640s Real, 0m18,324s User, 0m008s System
Update 2: The maintainer removed my SIMD fannkuch_redux entry to avoid a SIMD arms race. I understand the reasoning, and have submitted a scalar version that uses the knowledge I gained from writing the SIMD one. We’ll see how fast it is.