Arenas vs. Indices

06 April 2019

When optimizing code, one thing I’m always looking for is memory layout and access patterns. One such pattern is an arena: Reserve some sufficiently large space to put your objects in, then allocate by incrementing a pointer. If your objects are of a uniform type, you can basically simplify this to a Vec of that type.

This also means that the indices into that Vec become quasi-references. And because our objects are evenly sized, we can (in theory) index up to isize::MAX elements. This also means that if we restrict ourselves to storing not more than u32::MAX elements, we can get away with using a u32 as index even on 64 bit systems. If we restrict our arena to 64K elements we can use a u16, and with a maximum of 256 elements, a u8 is sufficient.

Why is that of interest? This brings me back to memory layout: If we have multiple references, each will take 8 bytes on a 64 bit system. Reducing those to 4 bytes for a u32 will reduce size of those types and thus reduce memory usage, which can help staying in cache. With less objects in the arena, we may be able to reduce pointer size even further.

As an aside, this is similar to what the JVM does with the compressed oops optimization, the difference is a) we use object size as a multiplier instead of alignment, thus possibly extending the “address space” and b) we restrict the effect to objects within our arena, whereas the JVM can only use it for the whole heap or not.

A minimum viable program would have the following definition:

struct MVPArena<T>(Vec<T>);

impl<T> MVPArena<T> {
    fn add(&mut self, value: T) -> Idx { .. }
}

type Idx = u32;

impl<T> Index<Idx> for MVPArena<T> {
    type Output = T;
    fn index(&self, idx: Idx) -> &T {
        &self[idx as usize]
    }
}

// similar `IndexMut` left out for brevity

Objects could then be referenced by arena[idx]. This is acceptable when there is only one arena, but becomes problematic if we have more of them. We could take an Idx returned from one arena and use it with another, resulting in either the wrong data or an out-of-bounds error.

Extremely On-brand

Luckily, there’s a way around this. It’s called “branding” and has been described in a paper by Hongwei Xu and Frank Pfenning and converted to Rust by Gankra. We extend both the Arena and the Idx type with a phantom data containing the invariant lifetime:

struct Idx<'a>(u32, PhantomData<*mut &'a ()>);

Invariant here means that the borrow checker may not contract or extend the lifetime. We do this with a mutable pointer here because it means that the compiler treats the data as if it’s being both read and written to. The former precludes shortening the lifetime, the latter inhibits extensions.

To use this we need to extend our Arena to store the same PhantomData, and change add and index to unify the lifetimes:

// ..
fn add(&mut self, value: T) -> Idx<'a> {
    // ..
    Idx(index, self.tag)
}

impl<'a, T> Index<Idx<'a>> for MVPArena<'a, T> {
    type Output = T;
    fn index(&self, idx: Idx<'a>) -> &'a T {
        &self[idx.0 as usize]
    }
}

This way, the compiler will call us out whenever we try to mix up arenas and indices.

There is one problem with this setup: The lifetime cannot leave the scope. Why is this a problem? Let’s see a very simple example one would use an arena for: Build a tree.

struct Tree<'i>(Option<(Idx<'i>, Idx<'i>)>);

fn build_tree(arena: &mut Arena<'i, Tree<'i>>, depth: usize) {
    if depth == 0 {
        arena.add(Tree(None))
    else {
        arena.add(Tree(Some((build_tree(arena, depth - 1),
                             build_tree(arena, depth - 1)))))
    }
}

in_arena(|arena| { build_tree(arena, 3); });

This fails with a very unhelpful lifetime error: We cannot get the lifetime out of the scope, and notably, we cannot store it within the arena, despite the whole thing being teared down at the end.

Not My Type

So we want to ensure that the indices are only ever given out by their respective arena and not mixed between arenas. Since we cannot use invariant lifetimes, we need another way to set up distinct types for each Arena. Can we do that at all? Yes, with the help of proc macros.

There are a few ingredients here: First, the arena needs a constructor that takes a distinctly-typed argument. Unfortunately, we need to make it public because public macros cannot call private functions. So we make it unsafe instead.

Next, we need a macro that creates a new type on each call. I’ll spare you the details, but it creates code like the following:

{
    struct Tag; // Some distinct type

    let mut __arena = unsafe { Arena::new(Tag) };
    let arena = &mut __arena;
    // .. do something with arena here
}

This will keep the user from mixing up indices with multiple arenas, because macro hygiene means each part of the code gets its own Tag type. At least I hope it works. I still have a nagging doubt that this whole idea might have a hole somewhere through which all the unsoundness might come back.

Another downside is that our indices and arenas can no longer be Send nor Sync. If we could share an arena between threads, nothing would keep us from having two threads instantiate arenas from the same place in the code, which means they could mix up indices.

The upside is that we can store our indices within our Arena. I consider this a win, if only for the fact that I see no way to make it sound with multiple threads, and many algorithm will work fine with a per-thread arena anyway.

Indexing? Where we’re going we need no…Indexing.

But is it really an arena if we need to call it to get a reference to a contained object? Ideally, we could just deref Idx to &T. But this means we’d need to keep the arena hidden somewhere to use for the Deref.

If we only have one Arena, we could in theory use a static ARENA: Arena<T, B>. But this would mean fixing the type, plus there can be only one arena throughout the program, limiting the usefulness of this approach considerably.

Another option is to augment our proc_macro to extend all derefs with our arena pointer. For that to work, we’d need the following Deref implementations:

// blanket impl to allow dereferencing non-arena values
impl<B, D, T> Deref for (D, SmallArena<T, B>) where D: Deref<T> {
    type Target = T;
    fn deref(&self) -> &T {
        &*self.0
    }
}

// this allows dereffing into the arena
impl<T, B> Deref for (Idx32<B>, &mut SmallArena<T, B>) {
    type Target = T;
    fn deref(&self) -> &T {
        self.1[self.0]
    }
}

The DerefMut impls are quite similar, so I won’t show them here.

This allows us to remove the arena from the equation, but we have to make all derefs into our arena explicit (as in *t). Also there is still the problem of working within macros, especially if we have multiple nested Arenas. For now we can work around this by parsing the token stream of inner in_arena calls. I’m unsure if we need additional Deref implementations if two macro invocations leave the tuples in the wrong order, e.g. (arena1, (arena2, idx1)).

We could in theory work with arbitrary code, but that would need some sort of data flow analysis, and I’ll leave that as an exercise for the reader.

Other Memory Layouts

One benefit of the arena as designed is that the memory layout is continuous, so indexing is a simple lookup. The downside is that the arena needs to reallocate when growing and possibly copy around the memory. Depending on the workload this may or may not matter, but it makes one wonder if there is another way to set up the arena so that we can keep the memory we’ve allocated?

All problems in computer science can be solved by another level of indirection

– David Wheeler

Other implementations use a Vec<Vec<_>> to store their data (those don’t try for efficient indexing, but we could do something similar). One possible implementation would split the index into the high and low u16 and use the former to index the outer Vec and the latter for the inner one. For example toolshed does this.

The benefit is that a simple shift + mask is enough to split the index, so indexing requires a shift, mask and two lookups. On the other hand, there are downsides: a) the memory allocation overhead scales linearly with the number of elements. With our simple Vec-based solution, our allocation overhead scales logarithmically (at least if we can realloc efficiently). Also the memory overhead of the outer Vec grows linearly with the number of elements. We can do better, at the price of some complexity.

Let’s say we have an array of Vecs, the first one allocated with 64K, the next one also at 64K, the next at 128K, the next double that and so on. With some fairly simple math, we can split the index into an outer array index and an index into the Vec. Also for efficiency reasons, we don’t actually need a Vec (we know the capacity and length of each Vec by our definition), so we can just store an array of (possibly null) raw pointers and allocate each as needed. As an aside, this is what Java streams do when a StreamBuilder needs to reserve memory to store the stream elements.

This way, we keep our allocation load logarithmic without needing realloc. On the other hand, indexing becomes more expensive, with one leading_zeros, a shift and three masks. I still think that for many workloads, the increased locality and reduced allocation overhead will make up for the more expensive indexing.

I have some experimental code to that effect, but I’m not yet ready to push it, also I’m not quite sure if it’s a good fit for this crate, as indexing becomes much more expensive in terms of CPU than working with plain references.

The Silver Lining

There’s a lot of space to cover here, and we’ve only just begun. If you like hacking on this stuff, come join our project! You can find the code in the compact_arena crate.