Hardware support for concurrency

Inspired by this article, starting a general thread because this is a fun topic to think about:

The key thing I’ve been thinking of is: with a model of programming where programming with lightweight threads is substantially easier and more commonplace (but still using a procedural model), how would you structure a hardware architecture differently?

What if the CPU was built for asynchronous programming, and task yields to the scheduler were a hardware instruction that could be pipelined? In that case, instead of trying to predict a branch, it would be possible to add instructions from other async tasks to the pipeline to fill in the dead space, and simply avoiding branch prediction completely in favor of lightweight parallelism for the sake of throughput?

Could structured concurrency with type system support help compiler optimizations here?


This is an interesting topic to me.

As seemingly cache misses are the main performance hit today, I guess adding L4 cache chips directly connected to each CPU core would be very effective, if not the most effective move as for today. I mean multiple CPU cores are contenting on shared BUS to access RAM today, inducing high latency, while there already been too many transistors built into the cores under reasonable thermal limits, so external but local (thus directly connected) cache chips sounds reasonable.

But I’m no expert at what I said above, just some imagination and/or speculation.