Btw, shouldn't it in theory be possible to run the Mixtral MoE loading next submodel sequentially and store outputs and then do the rest of the algorithm to make it easier to run on machines that cannot fit whole model in the memory?
Yeah I imagine sequential inference would be slower. How long do you have to wait to load these weights on a personal PC? I have not tried using those systems so far.