Routing in a MoE model might fit.

zozbot234 · 2026-04-04T12:33:00 1775305980

You want routing to be as quick as possible, because there are dependent loads of expert MoE weights (at least from CPU in most setups, potentially from storage) downstream of it. So that ultimately depends on what the bottleneck on that part of the model is: compute, memory throughput or both? If it's throughput, the NPU might be a bad fit.