March 2017, rev. February 2019
I was mistaken about the performance of this approach. Simulating the stack ran ~2x-2.5x slower on my processor.
There is a performance optimization that as far as I know no compiler to date has done. That is to convert recursion to iteration when the recursive function call is not at the end of the function. 
The reason to want to do this conversion is that iteration is faster than recursion. Iteration avoids the overhead of creating a new stack frame, while still letting the programmer think in recursion. 
Some compilers do a similar conversion called tail-call optimization. When the recursive function is at the end of the function (the tail of the function) the compiler resets the local variables and jumps (using a goto) to the beginning of the function.  This has the effect we are after: it avoids creating a new stack frame. But it doesn't work if there's more to do after the call returns. Since the local variables have been wiped out, the program would corrupt its data if it were to reuse them when unwinding from the recursion.
The conversion I'm suggesting is different than tail-call optimization. Although it bypasses the stack, it doesn't throw it out completely. And it applies to the stack of all functions called by a function, rather than just one function called recursively. A good name for it might be: stack unrolling.
The conversion may look familiar. It's a twist to what the processor already does:
This has the same effect as function calls but the stack is simulated. There's a stack per function instead of one stack for all functions.
All code is inlined in a single function and executed with gotos. With this conversion a whole program can run in a single function. 
I can imagine a downside though. If this is applied as a source to source conversion the end result will look like spaghetti code. That's probably why this conversion is rarely done by hand.
All these years we may have been using the stack badly, but we're only one optimization away from not having to use it at all.
How to remove the heap
There is another performance optimization that as far as I know no compiler to date has done. That is to convert automatically a program that builds a control flow graph using pointers into a program that builds a control flow graph using an array. For example, in the recursive descent parser of a compiler.
A good name for this conversion might be: pointer-to-array unrolling.
The reason to want to do this conversion is to avoid thrashing the cache. Programs written with linked lists don't take advantage of the cache, because pointers are often allocated in non-contiguous memory addresses that cause cache misses. Whereas programs written with arrays traversed sequentially benefit not only from cache hits but also from cache prefetching.
When I benchmarked building a parse tree with an array, it was 80% faster than building it with malloc.
The main reason programs are written with linked lists to begin with is to allow for an unknown list size. But I wouldn't be surprised if in most cases this unknown size is small enough to fit in a fixed-size array or an array that's resized dynamically when full. A second reason is to be able to insert elements anywhere in the middle of the list; a fixed-size array can't do this fast. A third reason is it helps to think of the graph as a graph instead of as an array.
These two conversions touch two different areas of a program. Stack unrolling applies to local variables; these are always on the stack. Pointer-to-array unrolling applies to data structures built on the heap; these could be saved on a single stack.
If as Harold Abelson said "programs must be written for humans to read, and only incidentally for machines to execute," compilers that want to produce faster code for programs written for humans to read must implement these two optimizations.