Matching Puzzle Pieces and Disappointing Benchmarks
I recently had a piece of code that used .to_lowercase() to sort some text. Which takes a bit of memory. On the plus side, the code used .sort_by_cached_key, which is pretty cool. But I wondered whether doing the .to_lowercase() log(n) times instead of n times would be slower than allocating a String for each entry, given that for many strings, even the first few charactes would be different.
First, case insensitively comparing two &strs in Rust is possible, if a bit convoluted. The solution here is to iterate over all chars, then calling char::to_lowercase on that, which returns another iterator over char (because some chars can correspond to multiple chars in lowercase), which we can flat_map. The second piece of the puzzle is that Iterator has a cmp method but does not implement Ord because it is not idempotent: If you call cmp, you exhaust the iterator. Still, with sort_by, we can interleave the lowercase conversion and comparison.
For good measure, I also added the unicase crate to the benchmarks.
Being the curious person that I am, I naturally wrote a benchmark, which is short enough to reproduce here (if you aren’t interested, scroll down for the conclusion):
use fake::faker::name::raw::Name;
use fake::{locales::EN, Fake};
fn setup(num: usize) -> Vec<String> {
(0..num).map(|_| Name(EN).fake()).collect::<Vec<String>>()
}
#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_cached_lowercase(bencher: divan::Bencher, size: usize) {
let names = setup(size);
bencher.counter(size).bench_local(|| {
let mut sorted = names.clone();
sorted.sort_by_cached_key(|name| name.to_lowercase());
sorted
})
}
#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_iter_lowercase(bencher: divan::Bencher, size: usize) {
let names = setup(size);
bencher.counter(size).bench_local(|| {
let mut sorted = names.clone();
fn caseless(s: &String) -> impl Iterator<Item = char> + '_ {
s.chars().flat_map(char::to_lowercase)
}
sorted.sort_by(|s1, s2| caseless(s1).cmp(caseless(s2)));
sorted
})
}
#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_unicase(bencher: divan::Bencher, size: usize) {
let names = setup(size);
bencher.counter(size).bench_local(|| {
let mut sorted = names.clone();
sorted.sort_by(|s1, s2| unicase::UniCase::new(s1).cmp(&unicase::UniCase::new(s2)));
sorted
})
}
fn main() {
// Run registered benchmarks.
divan::main();
}
The result on my M2-MAX MacBook Pro:
Timer precision: 41 ns
low fastest │ slowest │ median │ mean │ samples │ iters
├─ sort_by_cached_lowercase │ │ │ │ │
│ ├─ 1 16.68 ns │ 18.14 ns │ 17.49 ns │ 17.49 ns │ 100 │ 25600
│ │ 59.94 Mitem/s │ 55.1 Mitem/s │ 57.15 Mitem/s │ 57.15 Mitem/s │ │
│ ├─ 5 212.9 ns │ 265 ns │ 215.5 ns │ 219.2 ns │ 100 │ 3200
│ │ 23.47 Mitem/s │ 18.86 Mitem/s │ 23.19 Mitem/s │ 22.8 Mitem/s │ │
│ ├─ 10 452.5 ns │ 567.1 ns │ 457.7 ns │ 462.2 ns │ 100 │ 1600
│ │ 22.09 Mitem/s │ 17.63 Mitem/s │ 21.84 Mitem/s │ 21.63 Mitem/s │ │
│ ├─ 100 5.207 µs │ 11.33 µs │ 5.291 µs │ 5.455 µs │ 100 │ 100
│ │ 19.2 Mitem/s │ 8.824 Mitem/s │ 18.89 Mitem/s │ 18.32 Mitem/s │ │
│ ├─ 1000 73.62 µs │ 110.9 µs │ 75.99 µs │ 78.89 µs │ 100 │ 100
│ │ 13.58 Mitem/s │ 9.009 Mitem/s │ 13.15 Mitem/s │ 12.67 Mitem/s │ │
│ ╰─ 10000 853.7 µs │ 1.053 ms │ 864.8 µs │ 886.3 µs │ 100 │ 100
│ 11.71 Mitem/s │ 9.495 Mitem/s │ 11.56 Mitem/s │ 11.28 Mitem/s │ │
├─ sort_by_iter_lowercase │ │ │ │ │
│ ├─ 1 13.91 ns │ 23.35 ns │ 14.89 ns │ 15.68 ns │ 100 │ 25600
│ │ 71.87 Mitem/s │ 42.81 Mitem/s │ 67.15 Mitem/s │ 63.76 Mitem/s │ │
│ ├─ 5 134.8 ns │ 196 ns │ 137.4 ns │ 148.6 ns │ 100 │ 3200
│ │ 37.08 Mitem/s │ 25.5 Mitem/s │ 36.37 Mitem/s │ 33.64 Mitem/s │ │
│ ├─ 10 442.1 ns │ 1.03 µs │ 483.8 ns │ 519.1 ns │ 100 │ 800
│ │ 22.61 Mitem/s │ 9.702 Mitem/s │ 20.66 Mitem/s │ 19.26 Mitem/s │ │
│ ├─ 100 17.33 µs │ 34.12 µs │ 18.16 µs │ 19.66 µs │ 100 │ 100
│ │ 5.769 Mitem/s │ 2.93 Mitem/s │ 5.504 Mitem/s │ 5.086 Mitem/s │ │
│ ├─ 1000 337.6 µs │ 433 µs │ 352.1 µs │ 355.4 µs │ 100 │ 100
│ │ 2.961 Mitem/s │ 2.309 Mitem/s │ 2.839 Mitem/s │ 2.813 Mitem/s │ │
│ ╰─ 10000 5.543 ms │ 6.579 ms │ 5.572 ms │ 5.602 ms │ 100 │ 100
│ 1.803 Mitem/s │ 1.519 Mitem/s │ 1.794 Mitem/s │ 1.784 Mitem/s │ │
╰─ sort_by_unicase │ │ │ │ │
├─ 1 15.05 ns │ 22.05 ns │ 15.87 ns │ 16.78 ns │ 100 │ 25600
│ 66.42 Mitem/s │ 45.34 Mitem/s │ 63 Mitem/s │ 59.59 Mitem/s │ │
├─ 5 86.66 ns │ 125 ns │ 91.86 ns │ 97.79 ns │ 100 │ 6400
│ 57.69 Mitem/s │ 39.97 Mitem/s │ 54.42 Mitem/s │ 51.12 Mitem/s │ │
├─ 10 202.5 ns │ 470.8 ns │ 207.7 ns │ 230.9 ns │ 100 │ 1600
│ 49.36 Mitem/s │ 21.24 Mitem/s │ 48.13 Mitem/s │ 43.3 Mitem/s │ │
├─ 100 4.749 µs │ 18.33 µs │ 4.833 µs │ 5.201 µs │ 100 │ 100
│ 21.05 Mitem/s │ 5.454 Mitem/s │ 20.68 Mitem/s │ 19.22 Mitem/s │ │
├─ 1000 107.2 µs │ 158 µs │ 107.5 µs │ 111.4 µs │ 100 │ 100
│ 9.327 Mitem/s │ 6.325 Mitem/s │ 9.298 Mitem/s │ 8.973 Mitem/s │ │
╰─ 10000 1.753 ms │ 1.919 ms │ 1.757 ms │ 1.772 ms │ 100 │ 100
5.702 Mitem/s │ 5.208 Mitem/s │ 5.688 Mitem/s │ 5.641 Mitem/s │ │
So unless you only have one element (which is the trivial case), sort_by_cached_key is worth it, and iterating over UTF-8 characters to do case conversion is a lot slower than I would have thought, drowning out the benefit of not needing to allocate. The real surprise is that Unicase can often be faster, despite making the comparison more complex.