Counting Newlines Really Fast

24 September 2016

Looking into xi editor’s rope crate, I found a method with a nice nerd-snipey (obligatory XKCD) comment, which I will reproduce here in full so you don’t have to chase the link and find back:

// TODO: explore ways to make this faster - SIMD would be a big win
fn count_newlines(s: &str) -> usize {
    s.as_bytes().iter().filter(|&&c| c == b'\n').count()
}

I remembered Bit Twiddling Hacks’ zero byte detection, and thought it would be nice to try and use it to optimize this method.

// one 1 in each byte
const LO : usize = ::std::usize::MAX / 255;
// one 128 in each byte
const HI : usize = LO * 128;

// bit twiddling ahoy
fn count_zero_bytes(x: usize) -> usize {
    ((x.wrapping_sub(LO)) & !x & HI) / 128 % 255
}

The modulo operation actually counts the bits here, because 256 modulo 255 is one, all multiples of powers of 256 are conflated into ones, and then added. Some examples:

X	X in hex	X % 255
0	0	0
1	1	1
256	100	1
257	101	2
65536	10000	1
65792	10100	2
65793	10101	3
16843009	1010101	4
72340172838076673	101010101010101	8

To use this, we only need to xor the values with repeated newlines before counting. Perf looked good (about twice as fast at 10k chars), so I pushed a PR to xi.

Andrew Gallant left a comment outlining his and bluss’ implementation of the same algorithm, using the .count_ones() method to great effect on systems that have the POPCNT instruction.

fn count_zero_bytes(x: usize) -> usize {
    ((x.wrapping_sub(LO)) & !x & HI).count_ones()
}

Unfortunately there are some really slow implementations out there, so it’s not a clear win. In any case, POPCNT can count all sixty-four bits, whereas our method only looks at up to eight of them. Before we pessimize the code for machines that need a more complex shim, there is one possible optimization with a shift + multiply:

fn count_zero_bytes(x: usize) -> usize {
    (((x.wrapping_sub(LO)) & !x & HI) / 128).wrapping_mul(LO)
        >> ((USIZE_BYTES - 1) * 8)
}

Remember that LO has a value of 1 at each byte, so after multiplying with it, the highest byte contains the sum of all bytes prior to multiplication. The shift just gets the value of that highest byte.

This improves matters for machines that don’t have POPCNT, but is slower on current (e.g. 6th or 7th gen) hardware, which has a fast implementation, so it’s not a clear win either. But remember that POPCNT counts all bits. Could we use more of the 64 bits to amortize the bit counting better?

Of course we can. I started by pulling the .count_ones() out of the method and using shifts to distribute the resulting bits from our two usizes so that one count operation would suffice. Nice speedup so far.

Next, I doubled the words we read per loop (four instead of two), again shifting the bits to different positions so they would count independently. This sped up the code even more. So why stop there? I doubled the words again (now we read 64 bytes per loop), and lo and behold, I shaved off another 25%. I also added some code to look at single usizes (because this would still be faster than looking at each byte).

But can we do without POPCNT and still be faster? Turns out we can. Because our custom operation (.wrapping_mul(LO) >> ((USIZE_BYTES - 1) * 8) in case you forgot) adds the individual bytes, armed with the knowledge that the maximum value of this addition is 64 (because we can never get more than 64 newlines in eight 64-bit words) we can add the newline-indicating words together first and then add the individual bytes of the result using our trick.

This looks roughly like:

count += (mask_zero_bytes(x0 ^ REP_NEWLINE)
        + mask_zero_bytes(x1 ^ REP_NEWLINE)
        + mask_zero_bytes(x2 ^ REP_NEWLINE)
        + mask_zero_bytes(x3 ^ REP_NEWLINE)
        + mask_zero_bytes(x4 ^ REP_NEWLINE)
        + mask_zero_bytes(x5 ^ REP_NEWLINE)
        + mask_zero_bytes(x6 ^ REP_NEWLINE)
        + mask_zero_bytes(x7 ^ REP_NEWLINE)
    ).wrapping_mul(LO) >> ((USIZE_BYTES - 1) * 8);

However, (and I was a bit surprised when finding out) we can get it a bit faster still, because Rust appears not to shuffle additions optimally (despite them being commutative). So if we just add some parenthesis, we can eke out another ~5% speedup on top of the 15% we get (presumably because we got rid of the intermediate bit shifting):

count += ((mask_zero_bytes(x0 ^ REP_NEWLINE)
         + mask_zero_bytes(x1 ^ REP_NEWLINE)
         + mask_zero_bytes(x2 ^ REP_NEWLINE)
         + mask_zero_bytes(x3 ^ REP_NEWLINE))
        + (mask_zero_bytes(x4 ^ REP_NEWLINE)
         + mask_zero_bytes(x5 ^ REP_NEWLINE)
         + mask_zero_bytes(x6 ^ REP_NEWLINE)
         + mask_zero_bytes(x7 ^ REP_NEWLINE))
    ).wrapping_mul(LO) >> ((USIZE_BYTES - 1) * 8);

Interestingly, adding the same parenthesis to the POPCNT-using version had no effect on performance on every machine I tested on.

The benchmarks speak for themselves. Note that I’m not using a single SIMD intrinsic here, relying instead on LLVM autovectorization to do the right thing – which it apparently does in this case.

(If you happen to run the benchmark on your machine, please file an issue at that repo containing your findings – I’ll be interested how this runs on different hardware!)

I updated the PR to reflect the improvements. I also sent another PR to Andrew’s ripgrep repo, so they can also benefit from the optimization.

PS.: Note that this operation is likely not responsible for much of the runtime of either xi ~~or ripgrep~~ (Update: I was assured by ripgrep’s author that newline counting is in fact a sizable portion of runtime), but it was fun to pull off anyway. 😁

PPS.: Bonus round!

Do you have an optimization story to share? Or want to discuss other cool bit hacks? Feel free to do so on both r/rust and rust-users!