Benchmarks vs. The World

08 December 2016

Most people who care just a bit for performance have at some point stumbled over “The Computer Language Benchmarks Game”. I have the utmost respect for Isaac Gouy, the maintainer of that site, who has improved both the site and the rules steadily while keeping abreast of all the trolls pushing one or other agenda. Kudos!

Since it’s probably the site people go for benchmark comparisons between programming language implementations, the game creates some incentives for language creators to make their implementations fast (unless they opt out of the rat race, like some scripting languages do) on the given microbenchmarks. The maintainer of the site is aware of this and prefaces the results with a helping of salt:

“We are profoundly uninterested in claims that these measurements, of a few tiny programs, somehow define the relative performance of programming languages.”

But as we all know, people don’t read (unless it’s really funny, or something), so the incentive to be perceived as fast (especially for performance-targeted languages like C/C++, Rust or Swift) is undisturbed by the disclaimer.

Now I want to focus on one problem with the k-nucleotide microbenchmark. This is basically about measuring built-in hash map performance with a custom hash function. Hash maps are an awesome piece of data structure with O(1) lookup (at least in theory) and insertion. However, in practice, there are quite a lot of differences between implementations that can worsen performance considerably (depending on usage).

Notably, hash maps depend on the hash function to not be easily guessable. Otherwise, a malicious user can guess entries to our hash table that produce collisions, a procedure known as HashDoS.

Before Java 7, java.util.HashMap used a linked list of buckets to handle collisions, which reduced insertion/lookup performance to O(N) in the worst case. Adding a list of carefully chosen entries could thus turn into a quadratic-time ordeal, bringing Java applications to a halt. The current hash map handles collisions by storing elements in a binary tree.
Both C++’s unordered_map and Rust’s std::collections::HashMap rely on open adressing. However, unordered_map sizes their backing store using prime numbers, whereas Rust uses powers of two. The former uses more bits of the computed hash, thus masking the effect of bad hash functions, whereas the latter is faster with a good hash function. Also Rust uses a unique salt per HashMap to make hash prediction unfeasable for some modest performance loss.

Update: I was wrong, unordered_map uses chaining. It also has a lower load factor, trading memory for a reduced risk of collisions. Reality, as usual isn’t so simple.

The “winning” hash functions submitted to the k-nucleotide benchmarks are laughably bad. Here’s the one from the winning C entry:

#define CUSTOM_HASH_FUNCTION(key) (khint32_t)((key) ^ (key)>>7)

This puts the bits at up to two positions, which is kind of OK if your hash maps uses all bits you give it. Even worse, look at the C++ entry:

struct hash{
   uint64_t operator()(const T& t)const{ return t.data; }
};

This is just an identity function! You could only possibly do worse by throwing away bits. If you used such a hash in a real world program where any kind of untrusted inputs end up in the keys of your map, your code is now vulnerable to HashDoS: All a malicious user needs to do is find out the prime numbers that unordered_map uses and submit keys that “hash” to multiples of that number, and both hash insertion and lookup becomes O(N).

In contrast, Rust’s code using the FNV hash is three times slower due to more expensive hash calculations. They also use a salt per HashMap to make the hash function even more unpredictable. They try to give the best performance with good hash functions (such as you should be using in the real world).

Now the problem here is that the microbenchmark only has one known set of input data, and the laughably bad hash functions happen to work quite well on it. The real world has arbitrary (possibly chosen with malicious intent) sets of input data. Optimizing for the benchmark case means pessimizing for the real world case.

I hope that I have succeeded in convincing everyone that the discrepance between the benchmark case and the real-world case is a bad thing, because the former will likely be optimized for at the expense of the latter. Now you could say that not all HashMaps contain user-facing keys, and you would be right. I even created a weak hashed map for my job because we had some control over the keys and could use the extra performance.

But look into all the code you’ve ever written and I’d wager you that the majority of HashMaps you use contain keys that are to some extent controlled by the users of your application. Now you could say, “But those were not in performance-critical code!” – the problem with HashDoS is that kills your application no matter if it’s called once or a thousand times. So now that I’ve hopefully convinced you that HashMaps should use strong hash functions, how can Mr. Gouy change the rules to align the benchmark case better with the preferrable solution to the real-world case?

Mandate a known good hash function for all implementations to use. This has the downside that some languages don’t easily allow to override the hash function, also some languages may be better at implementing hash functions than others, which counters the intent of the benchmark. Also this opens up the benchmark to discussions about everyone’s favorite hash function.
Require the submitted hash functions to pass some statistical tests (at least require avalanche property). This at least sets a lower bound on hash function strength, while keeping implementations simple enough not to swamp the profile with hash calculations so the actual map operations still show up
Change the input data to HashDoS every implementation suspectible to it until there are no more suspectible implementations 😁. I like this solution because it’s sneaky and requires no explanation, however calculating the required input data will take considerable amount of time and effort, and will only penalize one language per change.

P.S.: Since the fastest Java solution uses the third-party FastUtil library, another way for Rust to level the playing field would be to port the C hash map to Rust and use the resulting crate as dependency. I’d personally consider that cheating, but hey, I don’t have any say in this anyway.