November 2018

I'm not ready to claim that garbage collection is garbage based strictly on how backwards it is, but I wonder if the bigger part of the problem is concurrency, not memory management.

Hypothesis: if concurrency is more important, GC is irrelevant.

There are two main ways to read a value from memory: (a) to access the value directly and guarantee the value is available until no longer needed, and (b) to make a copy of the value. Both are useful. The way they work together affects the design of runtime systems, like programming languages and databases.

The first way is to access the variable directly. Imagine a local variable saved on the stack with (= a 5). Writing the value happens directly: the memory area on the stack for the variable is overwritten with the value 5. Reading the value back also happens directly. Just read the value from the memory area. This is a simple way to process the variable because the current thread is the only reader and writer.

Reading and writing gets tricky when there's more than one reader and writer. If the variable is global, any thread could access the variable. If one thread runs (= a 5) and another (= a 6) the value will be overwritten by whichever thread runs last. This can lead to one of the threads reading the out-of-date value 5, but both reading and writing the value directly doesn't crash the program.

This was the easy case. The reason there's no crash is that there's enough space in the representation of the old value to save the new value. Both numbers need the same amount of space: 8 bytes. (Or they both take 9 bytes, depending on how the runtime represents numbers.) And the reason the old and new values are read without being corrupted is that the processor can read 8 bytes atomically in a single instruction.

The harder case is when there isn't enough space in the old value to save the new value. One example is saving the string "bigger" over the fixed-size string "tiny". Another is when the new value is of a different type. What's common in both is that the value processed isn't always an object with a fixed representation in memory. Saving the new value must change the representation. [1]

If the old value was number 5 (which takes 8-9 bytes) and the new value will be a string (which may take at least 11 bytes), then the representation of the value in memory must be able to support both types. This often means changing a pointer at runtime from the old value to the new value, with the side effect that accessing the value directly is no longer carefree. It now needs a guarantee: if one thread read the old value 5 and another thread set it to a string, the old thread shouldn't crash. The old thread should still have access to the old value until it no longer needs it.

So if the representation of a value changes under multiple threads, the runtime system must guarantee that the old value is available until no longer needed. Guaranteeing this must run more code. Reading and writing with many threads will be slower, whereas with a single thread it will be faster.

The way to guarantee the old value is available until no longer needed is:
  1. When changing a variable on the stack, change it directly. Nothing more must be done to guarantee the old value is available until no longer needed, because the old value is never needed by other threads: on the stack there's only one thread. [2]

  2. When changing a variable on the heap, either restrict access to the change by forcing all threads to use the same reference to the object (eg by locking, which can be slow) or allow multiple copies of the variable (eg one per thread that used the variable).
Both ways of accessing the variable work together. Sometimes the value is accessed directly (eg on the stack), sometimes the value is copied (eg on the heap).

In programming languages, using the same reference until no longer needed is called garbage collection (GC). It keeps one copy, shared by many objects. In databases this same idea in a more general form is called multi-version concurrency control (MVCC) and optimistic concurrency control (OCC). It makes many copies, one per object. I used to think static programming languages would always outperform MVCC and OCC because static languages don't make copies of data, but now I see the real constraint isn't the typing of the language: for high concurrency, to copy is unavoidable. [3]

The same approach followed by databases for concurrency can be followed by programming languages. An iterator over a large list on the heap can use a reference instead of make a copy of the list. A change in the list while it's being read can copy-on-write the change, letting the iterator access the older version while the change is applied in the new version, similar to MVCC. When the iterator closes, the old copy is freed. This isn't the same as GC throughout the whole language because here it's needed only for the iterator.

High concurrency needs copies. GC keeps only one. Since copying is unavoidable for high concurrency, GC is the weaker solution. And if like many other features in programming languages GC is a mistake, this is a good mistake to reference but not copy.


[1]  This affects both static and dynamic typing systems. The root of the problem is that the representation in memory can change. It can change under both typing systems. A static typing system protects from this at compile time if possible by throwing an error, or it lets the program crash at runtime. A dynamic typing system protects from this at runtime and doesn't let the program crash.

[2]  Numbers, strings and lists can all be allocated on the stack. If a stack-bound string is set to a shorter string, it simply shrinks. Same for a list that has an element deleted.

But if a stack-bound string or list gets enlarged it can be migrated from the stack to the heap, and therefore needs the same guarantee as global variables allocated on the heap.

[3]  Should big data be copied too? Databases say yes and no. Yes, databases copy data from disk into memory, because memory is faster and there's less memory than data. And yes dbs keep copies of changed rows during transactions. But no, after data are loaded in memory (into the database buffer cache; same as the heap) there can be a single reference to the data.

The important part may be to break data into fine-enough granularity so that the copies are smaller and made faster.