Even quicker byte count

27 September 2016

Last time, I thought I’d sped up line counting (which both xi-rope and ripgrep use) pretty good. I had two versions (one using POPCNT and being faster on newer machines, and one using old-fashioned bit-hacks running faster on older ones). Along comes the phenomenal Veedrac and sends a PR with a completely unsafe version that nonetheless not only works, but smokes every benchmark competitor by a factor of at least two.

I’ve never been so happy for being beaten in a performance contest.

Veedrac’s code is a tour de force. Pointer arithmetic, array reduction, bit manipulation, you name it. Like my fastest version, it reduces the count in steps, but the highest step increment is a whopping 8K! Not only that, it manages to squeeze the intermediate count for those 8192 bytes into 4 usizes.

Now, how does that work? There’s actually a simple trick to it. Apart from using multiple usizes to keep intermediate counts, unlike my solution in words instead of bytes, which allows it to add more than 255 values and combining those counts only when the ~8k bytes have been screamed through. Hyper-screaming indeed.

Apart from using 4 counters in parallel (which complicates matters just a bit), the algorithm is as follows:

Per usize, get the one-bits (via subtraction/masking, as my count did)
Collect up to 255×8 one-bit values by simple addition
combine each two bytes into one short word by shift + mask + addition (this is how it looks like on 64 bit machines, on 32 bit machines, only half the size is used):

byte	0	1	2	3	4	5	6	7
bytecounts	c1	c2	c3	c4	c5	c6	c7	c8
hi masked >> 8	0	c1	0	c3	0	c5	0	c7
lo masked	0	c2	0	c4	0	c6	0	c8
added	c1+c2		c3+c4		c5+c6		c7+c8

Now we have four 16-bit entity per 64-bit word that carry numbers. Note that the number of each value is strictly less than 255 which means we can add two of them in a 16-bit half-word without overflow. This is exactly what the algorithm does (if the slice is large enough).
finally, the counts are reduced by our multiply + shift trick from last time, only that we multiply by a value that has one bit set per 16bit instead of per byte (it looks like 0x1000100010001 in hex on 32bit machines, and we get that number portably by dividing std::usize::MAX / 0xFFFF):

half-word	b01	b23	b45	b67
multiplier	1	1	1	1
multiplied	b01+b23+b45+b67	b23+b45+b67	b45+b67	b67
shift >> 48	0	0	0	b01+b23+b45+b67

(for 32-bit architectures, we obviously only shift by 16)

Remember we do this by four 64-bit words at once, so each loop iteration counts eight kilobytes. We could in theory save even more by doing the multiply-reduction only every 128 loops (because after that the counts could overflow), but the savings are probably negligible.

Since not all slices are that big, we do another round for 1 kilobyte, sixtyfour bytes and finally single usizes (Note that on 32 bit machine, we need to half the numbers).

In conclusion: A clever way to use the available bits that I did not see before allowed Veedrac to amortize the count reductions with much more bytes counted. No wonder the code is faster.

PS.: I tried reproducing this with much less unsafe code, but couldn’t even get close. So the unsafe code stays for now. I have reviewed and tested the code and am confident it will run perfectly fine.