March 2019

I want lists in Lisp to be fast to read. Reads are more common than writes. What's the best way to design the list?

Lisp lists were often linked lists, where one node points to the memory address of the other. Visiting nodes this way has limitations: (a) jumps across different memory areas cause cache misses, (b) building the linked list calls malloc for each node which is slow, and (c) indexed access is O(n) instead of O(1).

Direct index access may be the most important feature in lists. Programs often loop over arrays. A list can be an array. So before the list gets designed with some exotic, impossible data structure, I'd like to first see the list work like an array. A list that works as an array is faster than a linked list.

A simple way to make the list work like an array is to make it use a single, contiguous memory area that saves in its metadata the number of list elements. This works well if elements are appended: save the element and increment the element count. When the list gets full it gets dynamically resized, which has amortized constant time. With some work the list can prepend elements fast too. I don't know how bad it'd get to insert something in the middle or to pluck something out but it's not the common case. [1]

What happens when the list gets resized? If the list was allocated on the stack and a new element is added, the new memory address of the list needs to be saved back on the stack. It looks like this affects only the implementation of join, which is the primitive used by push. So far, a list getting resized isn't a problem. [2]

How will the elements be saved in the list? If all elements are pointers to values, reads get slower because of the dereference, and writes get slower because elements are allocated with malloc.

Can some elements not be pointers to values? Integers, decimals, and booleans are small enough that they could be saved right in the list. So do some other object types depending on the Lisp implementation, like streams, binary code, and VMs (closures, not functions; there is a difference). These object types may need padding to match the size of references, so that all list elements have the same size to calculate the index offsets, but the gain is indexed access at O(1).

How much padding? For integers and decimals, it's as little as 1 byte or less. So the big memory hit would be for booleans. Most languages address the overhead of booleans by adding bitmaps, and Lisp may want to do the same.

As for elements that can't be this small, use a reference. This is a small group: list, hash, and function. These elements are often added in a list with a reference anyway when created on the heap. The odd case is when they're created in the data section or the stack (inlined, not as a reference to a heap object). The solution is to copy them to the heap. [3]

Search that must be fast isn't solved by a linked list alone. Hashing is an essential part of fast search. So it seems better to ignore fast search for linked lists and focus on fast search for hash tables. The simplest way I've seen to make lists fast is to turn them into arrays.









Notes:

[1]  This type of list could stay fast when an element is removed: leave a hole in the list and save the index of each hole at the end. This makes indexed access no longer be O(1) but O(holes). It may work fast for a low number of holes, and past some threshold the list can be resized to remove the holes.

[2]  I saw the list become a problem in an early version where setting a new element in the list lead to creating a new list. The problem wasn't that the list got resized, but that there was no clean way to tell the intermediate representation that referred to the old memory address of the list that the list had a new memory address.

I'm glad there was a dirty one. Walk over the IR at runtime to find the right register.

[3]  Copying an object from the data section to the heap doesn't cause more overhead because it's the only way to load it on the heap. The more interesting case is copying an object from the stack to the heap.