Arenas vs. Indices
When optimizing code, one thing I’m always looking for is memory layout and
access patterns. One such pattern is an arena: Reserve some sufficiently large
space to put your objects in, then allocate by incrementing a pointer. If your
objects are of a uniform type, you can basically simplify this to a Vec
of
that type.
This also means that the indices into that Vec
become quasi-references. And
because our objects are evenly sized, we can (in theory) index up to
isize::MAX
elements. This also means that if we restrict ourselves to
storing not more than u32::MAX
elements, we can get away with using a
u32
as index even on 64 bit systems. If we restrict our arena to 64K elements
we can use a u16
, and with a maximum of 256 elements, a u8
is sufficient.
Why is that of interest? This brings me back to memory layout: If we have multiple references, each will take 8 bytes on a 64 bit system. Reducing those to 4 bytes for a u32 will reduce size of those types and thus reduce memory usage, which can help staying in cache. With less objects in the arena, we may be able to reduce pointer size even further.
As an aside, this is similar to what the JVM does with the compressed oops optimization, the difference is a) we use object size as a multiplier instead of alignment, thus possibly extending the “address space” and b) we restrict the effect to objects within our arena, whereas the JVM can only use it for the whole heap or not.
A minimum viable program would have the following definition:
struct MVPArena<T>(Vec<T>);
impl<T> MVPArena<T> {
fn add(&mut self, value: T) -> Idx { .. }
}
type Idx = u32;
impl<T> Index<Idx> for MVPArena<T> {
type Output = T;
fn index(&self, idx: Idx) -> &T {
&self[idx as usize]
}
}
// similar `IndexMut` left out for brevity
Objects could then be referenced by arena[idx]
. This is acceptable when there
is only one arena, but becomes problematic if we have more of them. We could
take an Idx
returned from one arena and use it with another, resulting in
either the wrong data or an out-of-bounds error.
Extremely On-brand
Luckily, there’s a way around this. It’s called “branding” and has been
described in a paper by Hongwei Xu and Frank Pfenning and converted to Rust
by Gankra. We extend both the Arena
and the Idx
type with a phantom data
containing the invariant lifetime:
struct Idx<'a>(u32, PhantomData<*mut &'a ()>);
Invariant here means that the borrow checker may not contract or extend the lifetime. We do this with a mutable pointer here because it means that the compiler treats the data as if it’s being both read and written to. The former precludes shortening the lifetime, the latter inhibits extensions.
To use this we need to extend our Arena to store the same PhantomData, and
change add
and index
to unify the lifetimes:
// ..
fn add(&mut self, value: T) -> Idx<'a> {
// ..
Idx(index, self.tag)
}
impl<'a, T> Index<Idx<'a>> for MVPArena<'a, T> {
type Output = T;
fn index(&self, idx: Idx<'a>) -> &'a T {
&self[idx.0 as usize]
}
}
This way, the compiler will call us out whenever we try to mix up arenas and indices.
There is one problem with this setup: The lifetime cannot leave the scope. Why is this a problem? Let’s see a very simple example one would use an arena for: Build a tree.
struct Tree<'i>(Option<(Idx<'i>, Idx<'i>)>);
fn build_tree(arena: &mut Arena<'i, Tree<'i>>, depth: usize) {
if depth == 0 {
arena.add(Tree(None))
else {
arena.add(Tree(Some((build_tree(arena, depth - 1),
build_tree(arena, depth - 1)))))
}
}
in_arena(|arena| { build_tree(arena, 3); });
This fails with a very unhelpful lifetime error: We cannot get the lifetime out of the scope, and notably, we cannot store it within the arena, despite the whole thing being teared down at the end.
Not My Type
So we want to ensure that the indices are only ever given out by their respective arena and not mixed between arenas. Since we cannot use invariant lifetimes, we need another way to set up distinct types for each Arena. Can we do that at all? Yes, with the help of proc macros.
There are a few ingredients here: First, the arena needs a constructor that
takes a distinctly-typed argument. Unfortunately, we need to make it public
because public macros cannot call private functions. So we make it unsafe
instead.
Next, we need a macro that creates a new type on each call. I’ll spare you the details, but it creates code like the following:
{
struct Tag; // Some distinct type
let mut __arena = unsafe { Arena::new(Tag) };
let arena = &mut __arena;
// .. do something with arena here
}
This will keep the user from mixing up indices with multiple arenas, because
macro hygiene means each part of the code gets its own Tag
type. At least I
hope it works. I still have a nagging doubt that this whole idea might have a
hole somewhere through which all the unsoundness might come back.
Another downside is that our indices and arenas can no longer be Send
nor
Sync
. If we could share an arena between threads, nothing would keep us from
having two threads instantiate arenas from the same place in the code, which
means they could mix up indices.
The upside is that we can store our indices within our Arena. I consider this a win, if only for the fact that I see no way to make it sound with multiple threads, and many algorithm will work fine with a per-thread arena anyway.
Indexing? Where we’re going we need no…Indexing.
But is it really an arena if we need to call it to get a reference to a
contained object? Ideally, we could just deref Idx
to &T
. But this means
we’d need to keep the arena hidden somewhere to use for the Deref
.
If we only have one Arena, we could in theory use a static ARENA: Arena<T, B>
.
But this would mean fixing the type, plus there can be only one arena
throughout the program, limiting the usefulness of this approach considerably.
Another option is to augment our proc_macro
to extend all derefs with our
arena pointer. For that to work, we’d need the following Deref
implementations:
// blanket impl to allow dereferencing non-arena values
impl<B, D, T> Deref for (D, SmallArena<T, B>) where D: Deref<T> {
type Target = T;
fn deref(&self) -> &T {
&*self.0
}
}
// this allows dereffing into the arena
impl<T, B> Deref for (Idx32<B>, &mut SmallArena<T, B>) {
type Target = T;
fn deref(&self) -> &T {
self.1[self.0]
}
}
The DerefMut
impls are quite similar, so I won’t show them here.
This allows us to remove the arena from the equation, but we have to make all
derefs into our arena explicit (as in *t
). Also there is still the problem of
working within macros, especially if we have multiple nested Arenas. For now we
can work around this by parsing the token stream of inner in_arena
calls. I’m
unsure if we need additional Deref
implementations if two macro invocations
leave the tuples in the wrong order, e.g. (arena1, (arena2, idx1))
.
We could in theory work with arbitrary code, but that would need some sort of data flow analysis, and I’ll leave that as an exercise for the reader.
Other Memory Layouts
One benefit of the arena as designed is that the memory layout is continuous, so indexing is a simple lookup. The downside is that the arena needs to reallocate when growing and possibly copy around the memory. Depending on the workload this may or may not matter, but it makes one wonder if there is another way to set up the arena so that we can keep the memory we’ve allocated?
All problems in computer science can be solved by another level of indirection
– David Wheeler
Other implementations use a Vec<Vec<_>>
to store their data (those don’t try
for efficient indexing, but we could do something similar). One possible
implementation would split the index into the high and low u16
and use the
former to index the outer Vec
and the latter for the inner one. For example
toolshed does this.
The benefit is that a simple shift + mask is enough to split the index, so
indexing requires a shift, mask and two lookups. On the other hand, there are
downsides: a) the memory allocation overhead scales linearly with the number of
elements. With our simple Vec
-based solution, our allocation overhead scales
logarithmically (at least if we can realloc
efficiently). Also the memory
overhead of the outer Vec
grows linearly with the number of elements. We can
do better, at the price of some complexity.
Let’s say we have an array of Vec
s, the first one allocated with 64K, the
next one also at 64K, the next at 128K, the next double that and so on. With
some fairly simple math, we can split the index into an outer array index and
an index into the Vec
. Also for efficiency reasons, we don’t actually need a
Vec
(we know the capacity and length of each Vec
by our definition), so we
can just store an array of (possibly null) raw pointers and allocate each as
needed. As an aside, this is what Java streams do when a StreamBuilder
needs to reserve memory to store the stream elements.
This way, we keep our allocation load logarithmic without needing realloc
. On
the other hand, indexing becomes more expensive, with one leading_zeros
, a
shift and three masks. I still think that for many workloads, the increased
locality and reduced allocation overhead will make up for the more expensive
indexing.
I have some experimental code to that effect, but I’m not yet ready to push it, also I’m not quite sure if it’s a good fit for this crate, as indexing becomes much more expensive in terms of CPU than working with plain references.
The Silver Lining
There’s a lot of space to cover here, and we’ve only just begun. If you like hacking on this stuff, come join our project! You can find the code in the compact_arena crate.