March 2017, rev. February 2019

I was mistaken about the performance of this approach. Simulating the stack ran ~2x-2.5x slower on my processor.

There is a performance optimization that as far as I know no compiler to date has done. That is to convert recursion to iteration when the recursive function call is not at the end of the function. [1]

The reason to want to do this conversion is that iteration is faster than recursion. Iteration avoids the overhead of creating a new stack frame, while still letting the programmer think in recursion. [2]

Some compilers do a similar conversion called tail-call optimization. When the recursive function is at the end of the function (the tail of the function) the compiler resets the local variables and jumps (using a goto) to the beginning of the function. [3] This has the effect we are after: it avoids creating a new stack frame. But it doesn't work if there's more to do after the call returns. Since the local variables have been wiped out, the program would corrupt its data if it were to reuse them when unwinding from the recursion.

The conversion I'm suggesting is different than tail-call optimization. Although it bypasses the stack, it doesn't throw it out completely. And it applies to the stack of all functions called by a function, rather than just one function called recursively. A good name for it might be: stack unrolling.

The conversion may look familiar. It's a twist to what the processor already does:

 For every function call within a function, create a local array to simulate the callee function's stack and set a pointer to the beginning of this array. Before the function call, increment this pointer (like a push) and go to (with a jump or goto) an inline copy of the callee function code. When the callee function code finishes, decrement this pointer (like a pop) and goto the line after the function was called.

This has the same effect as function calls but the stack is simulated. There's a stack per function instead of one stack for all functions. Simulating the stack is faster than making a function call.

All code is inlined in a single function and executed with gotos. With this conversion a whole program can run in a single function. [4]

I can imagine a downside though. If this is applied as a source to source conversion the end result will look like spaghetti code. That's probably why this conversion is rarely done by hand.

All these years we may have been using the stack badly, but we're only one optimization away from not having to use it at all.

How to remove the heap

There is another performance optimization that as far as I know no compiler to date has done. That is to convert automatically a program that builds a control flow graph using pointers into a program that builds a control flow graph using an array. For example, in the recursive descent parser of a compiler.

A good name for this conversion might be: pointer-to-array unrolling.

The reason to want to do this conversion is to avoid thrashing the cache. Programs written with linked lists don't take advantage of the cache, because pointers are often allocated in non-contiguous memory addresses that cause cache misses. Whereas programs written with arrays traversed sequentially benefit not only from cache hits but also from cache prefetching.

When I benchmarked building a parse tree with an array, it was 80% faster than building it with malloc.

The main reason programs are written with linked lists to begin with is to allow for an unknown list size. But I wouldn't be surprised if in most cases this unknown size is small enough to fit in a fixed-size array or an array that's resized dynamically when full. A second reason is to be able to insert elements anywhere in the middle of the list; a fixed-size array can't do this fast. A third reason is it helps to think of the graph as a graph instead of as an array.

Impact

These two conversions touch two different areas of a program. Stack unrolling applies to local variables; these are always on the stack. Pointer-to-array unrolling applies to data structures built on the heap; these could be saved on a single stack.

If as Harold Abelson said "programs must be written for humans to read, and only incidentally for machines to execute," compilers that want to produce faster code for programs written for humans to read must implement these two optimizations.

Notes:

 [1] Here is one example of this conversion that trades the stack for the heap in an application instead of a compiler. Here is another that uses an interpreter and requires a scheduler function. [2] I got distracted by this problem while trying to write a fast parser. Parsers are written recursively, where one function may call several others. So this conversion would help when writing the parsers of interpeters and compilers and databases. [3] The name "tail-call" is slightly misleading, because the function call doesn't have to be at the end of the function. It could be in the middle of a switch statement, or in an if clause, or in an expression. It's more accurate to say that a tail call is the last piece of code executed in the function. A better name might be "last-call". [4] A similar example is a feature that never made it into ANSI C: running with jump tables. For gcc to go through the trouble of adding special support for this rare feature there must have been something powerful to gain. The gain is that inlining all code in a single function and running with jump tables runs fast.