Llogiq on stuff

Benchmarking in Rust

Recently on r/rust, I had an interesting discussion regarding some benchmarks where different map implementations led to wildly different results.

As performance-related programming is one of my main job-related activities, a look into rust’s facilities to that effect would be useful both to me and others. So I set out to craft a benchmark that could lead to an even faster implementation.

Rust’s libtest has a benchmark harness. So if a program gets compiled with --test, one can simply call it with --bench to run all contained benchmarks, where any #[bench] fn ..(b: &mut test::Bencher) function constitutes a benchmark.

Note: This needs a nightly Rust. If you intend to put benchmarks next to your crate’s code, I suggest putting them in the benches/ subfolder, so nightly Rust will pick it up while stable and/or beta will stay undisturbed.

So without further ado, here’s my first benchmark:

extern crate test;
extern crate rand;

use test::Bencher;
use rand::Rng;
use std::mem::replace;

#[bench]
fn empty(b: &mut Bencher) {
    b.iter(|| 1)
}

#[bench]
fn setup_random_hashmap(b: &mut Bencher) {
    let mut val : u32 = 0;
    let mut rng = rand::IsaacRng::new_unseeded();
    let mut map = std::collections::HashMap::new();

    b.iter(|| { map.insert(rng.gen::<u8>() as usize, val); val += 1; })
}

#[bench]
fn setup_random_vecmap(b: &mut Bencher) {
    let mut val : u32 = 0;
    let mut rng = rand::IsaacRng::new_unseeded();
    let mut map = std::collections::VecMap::new();

    b.iter(|| { map.insert((rng.gen::<u8>()) as usize, val); val += 1; })
}

#[bench]
fn setup_random_vecmap_cap(b: &mut Bencher) {
    let mut val : u32 = 0;
    let mut rng = rand::IsaacRng::new_unseeded();
    let mut map = std::collections::VecMap::with_capacity(256);

    b.iter(|| { map.insert((rng.gen::<u8>()) as usize, val); val += 1; })
}

The empty benchmark is there as a baseline. An anecdote: In my first compilation of the benchmark, I forgot to add -O to the rustc command line, and wound up with a few ns/iter on an empty benchmark. Thus, I now always have an empty benchmark in my list, to make sure I benchmark an optimized version.

One thing you notice it that there is a lot of code duplication. Normally I’ll try to avoid that but in this simple case, I’m ok with it.

Running this results in the following:

rustc -C opt-level=3 bench.rs --test && ./bench --bench

running 7 tests
test empty                    ... bench:           0 ns/iter (+/- 0)
test setup_random_hashmap     ... bench:          47 ns/iter (+/- 4)
test setup_random_vecmap      ... bench:           9 ns/iter (+/- 0)
test setup_random_vecmap_cap  ... bench:           9 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

Here we see that vecmap really blows hashmap away. However, vecmap internally uses a Vec<Option<T>> which I believe should be quite inefficient. Can we do better?

First I tried with a DefaultVecMap, which gets one default value that takes the place of Option::None, so we can store the values directly without the overhead of Option. Note that this is not a complete implementation, I just measure insert performance here, so this is what I implement:

#[derive(Clone)]
pub struct DefaultVecMap<V: PartialEq<V>> {
    default: V,
    v: Vec<V>,
}

impl<V: PartialEq<V> + Copy> DefaultVecMap<V> {
    pub fn new(default: V) -> DefaultVecMap<V> {
        DefaultVecMap { default: default, v: vec![], }
    }

    pub fn with_capacity(default: V, capacity: usize) -> DefaultVecMap<V> {
       let dc = default.clone();
       DefaultVecMap { default: default, v: vec![dc; capacity], }
    }

    pub fn insert(&mut self, key: usize, value: V) -> Option<V> {
        debug_assert!(&value != &self.default);
        let len = self.v.len();
        if len <= key {
            let default = &self.default;
            let mut l = if len == 0 { 1 } else { len };
            while l <= key && l > 0 { l = l << 2; }
            self.v.extend((0..l - len + 1).map(|_| default));
            self.v[key] = value;
            None
        } else {
            let old_value = replace(&mut self.v[key], value);
            if old_value == self.default { None } else { Some(old_value) }
        }
    }
}

This gets us:

test setup_default_vecmap     ... bench:           9 ns/iter (+/- 0)
test setup_default_vecmap_cap ... bench:           9 ns/iter (+/- 0)

This is equal to vecmap performance. Oh, well. Noting that the benchmark that started this endeavor only uses non-zero u32 values, I created the following:

pub struct ByteNonMaxU32Map {
    v: Vec<u32>
}

impl ByteNonMaxU32Map {
    pub fn new() -> ByteNonMaxU32Map {
        ByteNonMaxU32Map { v: vec![], }
    }
    
    pub fn insert(&mut self, key: usize, value: u32) -> Option<u32> {
        let len = self.v.len();
        if len <= key {
            self.v.extend((0..key - len + 1).map(|_| std::u32::MAX));
            self.v[key] = value;
            None
        } else {
            let old_value = replace(&mut self.v[key], value);
            if old_value == std::u32::MAX { None } else { Some(old_value) }
        }
    }
}

which gives us a nice 33% speedup on insert:

test setup_nonzero_u32_map    ... bench:           6 ns/iter (+/- 0)

Nice! Specialization for the win!

Bonus round: Just when I had finished writing this, along comes someone on rust users and asks why while loops are faster than for-range loops, which sets off alarm bells in my head.

I’ll spare you the gory details, but the benchmark misused test::black_box to keep LLVM from inlining the whole range construction. Changing the benchmark to let LLVM do its thing (while being reasonably sure that it wasn’t optimizing away the code I wanted to measure) required adding another mutable variable.

With this, the for-loop even performed a little better than the while loop, because LLVM did actually unroll it a bit to make use of wider instructions. You can see the details on the users thread.

Moral of the story: #[bench] makes it easy to write benchmarks. However, writing good benchmarks is still hard, and being sure that we measure the right thing is a science unto itself, where being able to read assembly is essential. We have all this shiny abstractions, but to make them go fast consistently, it is sometimes vital to look below.