August 2017

I looked at a few lock-free hash tables and none can be used as the global environment of a programming language.

Most make an incomplete assumption: that what a language wants to save in the hash table is an integer key and an integer value.

Well, how about data other than ints? Like keys and values that are longs, or strings, or pointers to lists and hash tables.

A lock-free hash table can be changed to save bigger keys and values, but not without hurting its existing optimizations. Comparing strings needs more work to become as fast as comparing integers, for example. And since a pointer is twice as big as an integer, saving pointers halves the number of elements saved in a cache line, which slows down reads.

(Is there no way to compress pointers to 2 bytes — half the size of an int? Can some pointers be offsets to a pool of a small size?) [1]

Another thing hash tables ignored is how a language handles datatypes. Imagine a hash table decides to save pointers instead of integers to support more datatypes. [2] Now integer values that were allocated on the stack need to be copied on the heap with malloc, to save in the hash table the pointer to the heap. That's an extra call to malloc. It slows down saving. The call to malloc can be avoided if the hash table can save not just the integer but also the datatype.

(Malloc ignores datatypes too. Is there room for a type-aware malloc?)

Starting with an incomplete assumption doesn't only hurt performance though. It also limits what the hash table can be used for.

One of the more gripping ideas I bumped into is that a programming language that wants to solve the most general case of concurrent data access to variables must eventually become a database. If this is what's best for a language, a lock-free hash table must become a database, not merely a key-value store.

Once again this changes the design. If for example the hash table decides to use multi-version concurrency control to become a database, the existing design of the lock-free hash table isn't enough to support it. It must change to save multiple versions.

It's a good idea to study a lock-free hash table in its simplest form first to understand the key idea behind it, but the hash table doesn't exist in a vacuum. The natural progression is to integrate it into something else and this changes the design.

A language in search of a better lock-free data structure for its global environment would do well not to stay at incomplete assumptions.









Notes:

[1]  I didn't know this at the time but it turns out they can.

[2]  I noticed a property of hash keys. When the hash table is used as the global environment, the keys don't need to be saved on the heap with malloc because keys don't change. This opens up more room to optimize.

If the hash table is designed so that the keys must be saved on the heap with malloc, a second optimization is to malloc only when adding a new key instead of always. When changing the value of an existing key, there's no need to malloc memory again for the second copy of the key and then free it.