Hardware support for concurrency

Inspired by this article, starting a general thread because this is a fun topic to think about:

The key thing I’ve been thinking of is: with a model of programming where programming with lightweight threads is substantially easier and more commonplace (but still using a procedural model), how would you structure a hardware architecture differently?

What if the CPU was built for asynchronous programming, and task yields to the scheduler were a hardware instruction that could be pipelined? In that case, instead of trying to predict a branch, it would be possible to add instructions from other async tasks to the pipeline to fill in the dead space, and simply avoiding branch prediction completely in favor of lightweight parallelism for the sake of throughput?

Could structured concurrency with type system support help compiler optimizations here?