You might be disappointed.. It seems to have been weakly confirmed by Geohot that this sort of mixed system is already what GPT-4's 'secret sauce' is. [1] It's something I've also been speculating for months on. Ctrl+F for "220 billion in each". His phrasing, numbers, and details are suggestive of a leak unless he's just completely blowing smoke - and I don't think there's any real reason to think he is.
It looks like they were pumping the model size up, started getting diminishing returns, and so turned to a mixed model of 8 expert systems meet LLMs - to try to keep squeezing out just a bit more juice. I think Geohot offered an incredibly insightful quote, "...whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool."
It also goes a long way to explain their recent, ultimately unsuccessful, gambit in congress. If they don't see any way of overcoming the diminishing returns, then they're like the guy who got a headstart in a race where distance = time ^ (1/2).
Turning to MoE now that we see they don't have to underperform dense counterparts has nothing to do with diminishing returns. It's simply more economically viable
With a multi expert system you end up making way more inferences per query, and depending on exactly how its built up, you may even end up requiring more training as well. There could be some extremely domain specific way they're saving some money, somehow, but it seems generally unlikely. It's all about diminishing returns. OpenAI themselves have also publicly acknowledged they're hitting diminishing returns on model size. [1]
In any case this happens in literally every single domain that uses neural networks. You make this extremely rapid and unbelievably progress which leads one to start looking forward to where this is leading, and the literally infinite possibilities. But then at some point each single percent point improvement starts costing more and more. Initially a bit more compute can help you get over these hurdles, but then it becomes clear you're headed towards an asymptote that's far from where you want to be. It's why genuine fully self driving vehicles look much further away today than they did ~8 years ago.
They're saving money by the simple fact that sparse models are much less expensive to train and infere than dense equivalents.
>OpenAI themselves have also publicly acknowledged they're hitting diminishing returns on model size. [1]
Do people even properly read this ? Altman acknowledged they were hitting economic wall scaling not that they thought they couldn't get better performance scaling further.
Ilya believes there is plenty performance left to squeeze still.
There was even an interview specifically correcting this misinformation.
It looks like they were pumping the model size up, started getting diminishing returns, and so turned to a mixed model of 8 expert systems meet LLMs - to try to keep squeezing out just a bit more juice. I think Geohot offered an incredibly insightful quote, "...whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool."
It also goes a long way to explain their recent, ultimately unsuccessful, gambit in congress. If they don't see any way of overcoming the diminishing returns, then they're like the guy who got a headstart in a race where distance = time ^ (1/2).
[1] - https://www.latent.space/p/geohot#details