Llogiq on stuff

Matching Puzzle Pieces and Disappointing Benchmarks

I recently had a piece of code that used .to_lowercase() to sort some text. Which takes a bit of memory. On the plus side, the code used .sort_by_cached_key, which is pretty cool. But I wondered whether doing the .to_lowercase() log(n) times instead of n times would be slower than allocating a String for each entry, given that for many strings, even the first few charactes would be different.

First, case insensitively comparing two &strs in Rust is possible, if a bit convoluted. The solution here is to iterate over all chars, then calling char::to_lowercase on that, which returns another iterator over char (because some chars can correspond to multiple chars in lowercase), which we can flat_map. The second piece of the puzzle is that Iterator has a cmp method but does not implement Ord because it is not idempotent: If you call cmp, you exhaust the iterator. Still, with sort_by, we can interleave the lowercase conversion and comparison.

For good measure, I also added the unicase crate to the benchmarks.

Being the curious person that I am, I naturally wrote a benchmark, which is short enough to reproduce here (if you aren’t interested, scroll down for the conclusion):

use fake::faker::name::raw::Name;
use fake::{locales::EN, Fake};

fn setup(num: usize) -> Vec<String> {
    (0..num).map(|_| Name(EN).fake()).collect::<Vec<String>>()
}

#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_cached_lowercase(bencher: divan::Bencher, size: usize) {
    let names = setup(size);
    bencher.counter(size).bench_local(|| {
        let mut sorted = names.clone();
        sorted.sort_by_cached_key(|name| name.to_lowercase());
        sorted
    })
}

#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_iter_lowercase(bencher: divan::Bencher, size: usize) {
    let names = setup(size);
    bencher.counter(size).bench_local(|| {
        let mut sorted = names.clone();
        fn caseless(s: &String) -> impl Iterator<Item = char> + '_ {
            s.chars().flat_map(char::to_lowercase)
        }
        sorted.sort_by(|s1, s2| caseless(s1).cmp(caseless(s2)));
        sorted
    })
}

#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn sort_by_unicase(bencher: divan::Bencher, size: usize) {
    let names = setup(size);
    bencher.counter(size).bench_local(|| {
        let mut sorted = names.clone();
        sorted.sort_by(|s1, s2| unicase::UniCase::new(s1).cmp(&unicase::UniCase::new(s2)));
        sorted
    })
}

fn main() {
    // Run registered benchmarks.
    divan::main();
}

The result on my M2-MAX MacBook Pro:

Timer precision: 41 ns
low                          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ sort_by_cached_lowercase                │               │               │               │         │
│  ├─ 1                      16.68 ns      │ 18.14 ns      │ 17.49 ns      │ 17.49 ns      │ 100     │ 25600
│  │                         59.94 Mitem/s │ 55.1 Mitem/s  │ 57.15 Mitem/s │ 57.15 Mitem/s │         │
│  ├─ 5                      212.9 ns      │ 265 ns        │ 215.5 ns      │ 219.2 ns      │ 100     │ 3200
│  │                         23.47 Mitem/s │ 18.86 Mitem/s │ 23.19 Mitem/s │ 22.8 Mitem/s  │         │
│  ├─ 10                     452.5 ns      │ 567.1 ns      │ 457.7 ns      │ 462.2 ns      │ 100     │ 1600
│  │                         22.09 Mitem/s │ 17.63 Mitem/s │ 21.84 Mitem/s │ 21.63 Mitem/s │         │
│  ├─ 100                    5.207 µs      │ 11.33 µs      │ 5.291 µs      │ 5.455 µs      │ 100     │ 100
│  │                         19.2 Mitem/s  │ 8.824 Mitem/s │ 18.89 Mitem/s │ 18.32 Mitem/s │         │
│  ├─ 1000                   73.62 µs      │ 110.9 µs      │ 75.99 µs      │ 78.89 µs      │ 100     │ 100
│  │                         13.58 Mitem/s │ 9.009 Mitem/s │ 13.15 Mitem/s │ 12.67 Mitem/s │         │
│  ╰─ 10000                  853.7 µs      │ 1.053 ms      │ 864.8 µs      │ 886.3 µs      │ 100     │ 100
│                            11.71 Mitem/s │ 9.495 Mitem/s │ 11.56 Mitem/s │ 11.28 Mitem/s │         │
├─ sort_by_iter_lowercase                  │               │               │               │         │
│  ├─ 1                      13.91 ns      │ 23.35 ns      │ 14.89 ns      │ 15.68 ns      │ 100     │ 25600
│  │                         71.87 Mitem/s │ 42.81 Mitem/s │ 67.15 Mitem/s │ 63.76 Mitem/s │         │
│  ├─ 5                      134.8 ns      │ 196 ns        │ 137.4 ns      │ 148.6 ns      │ 100     │ 3200
│  │                         37.08 Mitem/s │ 25.5 Mitem/s  │ 36.37 Mitem/s │ 33.64 Mitem/s │         │
│  ├─ 10                     442.1 ns      │ 1.03 µs       │ 483.8 ns      │ 519.1 ns      │ 100     │ 800
│  │                         22.61 Mitem/s │ 9.702 Mitem/s │ 20.66 Mitem/s │ 19.26 Mitem/s │         │
│  ├─ 100                    17.33 µs      │ 34.12 µs      │ 18.16 µs      │ 19.66 µs      │ 100     │ 100
│  │                         5.769 Mitem/s │ 2.93 Mitem/s  │ 5.504 Mitem/s │ 5.086 Mitem/s │         │
│  ├─ 1000                   337.6 µs      │ 433 µs        │ 352.1 µs      │ 355.4 µs      │ 100     │ 100
│  │                         2.961 Mitem/s │ 2.309 Mitem/s │ 2.839 Mitem/s │ 2.813 Mitem/s │         │
│  ╰─ 10000                  5.543 ms      │ 6.579 ms      │ 5.572 ms      │ 5.602 ms      │ 100     │ 100
│                            1.803 Mitem/s │ 1.519 Mitem/s │ 1.794 Mitem/s │ 1.784 Mitem/s │         │
╰─ sort_by_unicase                         │               │               │               │         │
   ├─ 1                      15.05 ns      │ 22.05 ns      │ 15.87 ns      │ 16.78 ns      │ 100     │ 25600
   │                         66.42 Mitem/s │ 45.34 Mitem/s │ 63 Mitem/s    │ 59.59 Mitem/s │         │
   ├─ 5                      86.66 ns      │ 125 ns        │ 91.86 ns      │ 97.79 ns      │ 100     │ 6400
   │                         57.69 Mitem/s │ 39.97 Mitem/s │ 54.42 Mitem/s │ 51.12 Mitem/s │         │
   ├─ 10                     202.5 ns      │ 470.8 ns      │ 207.7 ns      │ 230.9 ns      │ 100     │ 1600
   │                         49.36 Mitem/s │ 21.24 Mitem/s │ 48.13 Mitem/s │ 43.3 Mitem/s  │         │
   ├─ 100                    4.749 µs      │ 18.33 µs      │ 4.833 µs      │ 5.201 µs      │ 100     │ 100
   │                         21.05 Mitem/s │ 5.454 Mitem/s │ 20.68 Mitem/s │ 19.22 Mitem/s │         │
   ├─ 1000                   107.2 µs      │ 158 µs        │ 107.5 µs      │ 111.4 µs      │ 100     │ 100
   │                         9.327 Mitem/s │ 6.325 Mitem/s │ 9.298 Mitem/s │ 8.973 Mitem/s │         │
   ╰─ 10000                  1.753 ms      │ 1.919 ms      │ 1.757 ms      │ 1.772 ms      │ 100     │ 100
                             5.702 Mitem/s │ 5.208 Mitem/s │ 5.688 Mitem/s │ 5.641 Mitem/s │         │

So unless you only have one element (which is the trivial case), sort_by_cached_key is worth it, and iterating over UTF-8 characters to do case conversion is a lot slower than I would have thought, drowning out the benefit of not needing to allocate. The real surprise is that Unicase can often be faster, despite making the comparison more complex.