Llogiq on stuff

Rust Mutation Testing

Not least since last JavaLand I am a fan of mutation testing. However, there is no working crate to do it in Rust yet (There is mutant, but it’s both outdated and far from finished). So because I may cobble the time together to implement it, I’m going to blog the journey to order my thoughts. This post will give a broad overview on the mechanics of mutation testing and the problems we’ll need to overcome to get it in Rust.

First, let’s motivate the whole thing: Why would I want to mutate my code? The question this attempts to answer is: “Am I testing the right thing?” Note that this is different to the question “Am I testing enough?”; we already have line coverage and branch coverage to answer the latter. The problem with those is that they can mislead; just because a line or branch is covered does not mean it is really tested, as a cursory look into mutation testing will show.

With mutation testing, I start with running my tests as usual, and then I mutate the code and run the tests again. If at least one fails, the mutant is “killed”. The mutation is rolled back and another one tried. The resulting metric is how many mutants survived divided by how many mutations were tried. Usually, mutation testing tools show the surviving mutations.

The idea is that if you can put errors in your code without making your tests fail, your tests didn’t sufficiently guard against those errors. Now what are typical mutations:

  • Changing && to || and vice versa
  • Deleting statements
  • Duplicating statements
  • Inserting other statements (e.g. return)
  • Replacing boolean subexpressions with true or false
  • Replacing comparisons, e.g. ==!= or >>=
  • Replacing variables with others from same scope of compatible types

As you can imagine, this isn’t exactly easy to do in Rust, especially if we want to do it fast enough to be useful. We probably do not want the full compiler within our tool (or even everything down to MIR), because that would mean forking rustc. So going with the abstract syntax tree, how far can we go?

A Plan

Turns out there’s not too much we can do. The boolean operator switcheroo is A-OK, also inserting returns at random points for unit-returning functions can be a fun way to disrupt control flow. Perhaps we could add default returns for other return types via some kind of whitelist. The boolean subexpression replacement also looks good (we can identify those as expressions in if expressions or while loops or as subexpressions thereof). Also comparison operators seem pretty safe. We may be able to do that whitelist dance for other operations like addition/subtraction (we’ll need to make sure the types have the corresponding std::ops::* implementations). If we’re very adventurous, we might try to clone as much of rust’s typeck as needed to do trait lookup.

My idea is that for every method under test, we can add another argument that will just be the mutation count. 0 is the original code and every other value refers to a mutation (up to LAST_MUTATION, whatever that is). We also store some metadata about the mutations somewhere so we can refer to them in the resulting report.

The code is then not mutated directly, but conditioned on the mutation parameter. So for example, we would change some expressions to depend on mc (our “mutation counter”):

Original Expression Mutated Expression
x == y mu::eq(x, y, mc, 1)
x != y mu::ne(x, y, mc, 4)
x > y mu::gt(x, y, mc, 7)
.. if mc == 12 { return; } ..
x && y mu::and(x, y, mc, 13)
x || y mu::or(x, y, mc, 16)

The mu::*.. functions would change the operations to true, false or something else, depending on mc. This means we only compile the mutated code once and run it once per mutation. Our test runner knows how many mutations exist and can run them all, possibly in parallel, keeping track of which tests failed. We will want to know two things:

  1. Which mutations survived – those will be reported directly
  2. Which tests killed what mutations. This will be useful to identify tests that can be removed because all the mutations they cover are already covered by other tests. This is the classical minimum set cover problem, which is NP-hard, but good solutions can be found in polynomial time nonetheless (using Linear Programming Relaxation). I am not aware of any tool that does this at the moment, but it would be really neat to have.

Note that the mutation numbers must be different for all methods under test – we only ever introduce one error at a time.

Now my first step will be to look into the current test crate to see how the tests are actually run. Also I’m looking into syntex to see if I can find the tests.