You don’t need explicit compiler help. At least with C this can be done entirely with a library. A task_switch() call can conform to the standard C-ABI, requiring no compiler support, and do the switching (in assembly). Without duff’s device. This is for example how kernels written in C do their task switching.
Likely the same can be said for Rust and nearly any language, since they all have ABIs to which can be conformed such that task_switch() looks like a normal function call.
Oh, I have written my own share of userspace C context switching libraries, I know all the gory the details :). For example see my minimalist [1] stackful coroutine library: the full context switching logic is three inline asm instructions (99% of the complexity in that code is to transparently support throwing exceptions across coroutine boundaries with no overhead in the happy path).
You need compiler help for the custom calling convention support and possibly to optimize away the context switching overhead for stackful coroutines, which is something that compilers can already do for stackless coroutines.
The duff device is just a way to simulate stackless coroutines (i.e. async/await or whateverer) in plain C, in a way that the compiler can still optimize quite well.
> the full context switching logic is three inline asm instructions
You tell the compiler that you clobber the caller-saved registers (GPD_CLOBBERS), so in terms of cost it’s not just three asm instructions. Since these are caller-saved registers they will be live at every point in your code, even if your task switch routine is inlined. You have to consider the code the compiler generates to preserve the caller-saved registers before invoking your instruction sequence when evaluating total cost. This is an additional cost that is not necessary in callback style.
Caller-saved registers (aka. "volatile registers") are only saved when they are live in the caller at the point of a function call, and they are not always live. Code generation tends to prefer callee-saved registers instead at these points, precisely so they don't need to be saved there. Whether callee-saved registers are live at the inline task switch depends on whether they have been saved already in the function prologue, and if they are in use as temporaries. Not many registers are live at every point, typically just the stack and frame pointers.
Both types of code (async and stackful) have to save live state, whetever it is, across context switches, whether that's spilling registers to the stack in the stackful case, or into the future object across "await" in the async case. However, typically the code generator has more leeway to decide which spills are optimal in the stackful case, and the spills are to the hot stack, so low cost. Unless integrated with the code generator, async/await spills tend to be unconditional stores and loads, thus on average more expensive.
You're right about potentially redundant saves at stackful context switch. (Though if you control the compiler, as you should if you are comparing best implementations of both kinds, you can avoid truly redundant saves)
However, in practice few of the callee-saved registers are really redundant. If the caller doesn't use them, it's caller or some ancestor further up the chain usually does. If any do, they are genuine live state rather than redundant. There are cases you can construct where no ancestor uses a register, or uses one when it would be better not to, so that in theory it would be better not to use it and not to save it on context switch. But I think this is rare in real code.
You must compare this against the the various extra state storage, and memory allocations, in async/await: For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack, and the way async/await implementations tend to allocate, fill, and later free a separate future object for each level of await in the call stack. All that extra storing is not free. Also, when comparing against the best of stackful, how many await implementations compile to pure continuation jumps without a return to an event loop function just to call the next async handler, and how many allow await results to be transferred directly in registers from generator to consumer, without being stored in the allocated future object?
I would summarise the difference between async/await and stackful-cooperative is that the former has considerable memory op overheads, but they are diffused throughout the code, so the context switch itself looks simple. It's an illusion, though, just like the "small asm" stackful context switch is an illusion due to clobbered live registers. The overhead is still there, either way, and I think it's usually slightly higher overhead in the async/await version. But async/await does have the advantage of not needing a fixed size "large enough for anything" stack to be preallocated per context, which it replaces with multiple and ongoing smaller allocations & frees per context instead.
It would be interesting to see an async/await transform applied to the Linux kernel, to see if it ended up faster or slower.
I see the confusion now, I intended all of my arguments regarding caller-saved to actually refer to callee-saved registers. Hopefully you understand that you can never avoid preserving callee-saved registers with a cooperative task switching system and that this is not a necessary cost in an async/callback system.
> For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack,
Regarding spill of live registers to heap state in the Async case vs stack state in the sync case, contemporary compilers are very good at keeping working state in registers and eliding redundant loads/stores to memory as long as it can prove that that memory is not referenced outside of that control flow. This is truth whether the source of truth is on the stack or the heap. This is due in part to the C11 memory model, which was somewhat implicit before it was standardized. In other words, an optimizing compiler does not treat all load/stores as “volatile *”. Furthermore, the heap state relevant to the current task would be as “hot” as the stack (in terms of being in cache) since it’s likely recently used. Given that I am skeptical of the argument that using heap state is slower than stack state due to increased register spillage. Again, just to be clear, this entire case is a separate discussion from the unavoidable added cost of preserving callee-saved registers in the synchronous case.
Where heap state does have a cost is in allocation. Allocating a stack variable is essentially free, while allocating heap memory can be expensive due to fragmentation and handling concurrent allocations. This cost is usually amortized since allocation is usually done once per task instantiation. Contrast this with register preservation, which is done on every task switch.
This only adds to the confusion. Which one is the caller and the callee? The executor and the coroutine? Or the coroutine and the task-switch code?
In my implementation, at the task switching point, there is no callee saved registers, all registers are clobbered and any live register must be saved (also there is no difference between coroutine vs non-coroutine code, control transfer is completely symmetrical), so calling from executor into the coroutine and coroutine into the executor (or even between coroutines) would run the same code.
At the limit both async (i.e. stackless coroutines i.e. non-first-class continuations) and fibers (i.e. stackfull coroutines, i.e. first-class continuations) can be translated mechanically in CPS form (i.e. purely tail calling, never returning functions, the only difference is the amount of stack frames that can be live at a specific time (exactly one for stackless coroutines, unbounded for stackful), so any optimization that can be done on the former can also be done on the latter, so a stackful coroutine that always yields form top level would compile down to exactly the same machine code as a stackless coroutine.
Thanks for the great contribution to the thread. Pretty much my thoughts.
I do believe that having to allocate a large stack is a downside, but again, with compiler help it should be possible, at least in theory, to compile stackful coroutines whose stack usage is known and bounded (i.e all task switch happen at top level or on non recursive inlined functions) to exactly the same stack usage of stackless coroutines.
Sibling has an amazing long response, but tl;dr, the live registers that are clobbered are exactly the same as those that would need to be saved in the async case.
I see the confusion now. I wrote caller-saved when I meant callee-saved. In general you don’t need to preserve callee-saved registers if you don’t use them during normal control flow but in the “sync” case you always have to save callee-saved registers on task switch. In the Async case, you simple return to the executor.
But even without explicit compiler help, you can go a long way, say, with gcc extended inline asm.