> Their SKILL tool involves a set of algorithms that make the process go much faster, they said, because the agents learn at the same time in parallel. Their research showed if 102 agents each learn one task and then share, the amount of time needed is reduced by a factor of 101.5 after accounting for the necessary communications and knowledge consolidation among agents.
This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.
Instead, if you could train a bunch of models that know a lot about a little (which is less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.
Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening here.
> The code will learn 102 small and separte [sic] heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized [sic] on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.
So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're training a bunch of independent fine-tuned models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).
You might be disappointed.. It seems to have been weakly confirmed by Geohot that this sort of mixed system is already what GPT-4's 'secret sauce' is. [1] It's something I've also been speculating for months on. Ctrl+F for "220 billion in each". His phrasing, numbers, and details are suggestive of a leak unless he's just completely blowing smoke - and I don't think there's any real reason to think he is.
It looks like they were pumping the model size up, started getting diminishing returns, and so turned to a mixed model of 8 expert systems meet LLMs - to try to keep squeezing out just a bit more juice. I think Geohot offered an incredibly insightful quote, "...whenever a company is secretive, it's because they're hiding something that's not that cool. And people have this wrong idea over and over again that they think they're hiding it because it's really cool."
It also goes a long way to explain their recent, ultimately unsuccessful, gambit in congress. If they don't see any way of overcoming the diminishing returns, then they're like the guy who got a headstart in a race where distance = time ^ (1/2).
Turning to MoE now that we see they don't have to underperform dense counterparts has nothing to do with diminishing returns. It's simply more economically viable
With a multi expert system you end up making way more inferences per query, and depending on exactly how its built up, you may even end up requiring more training as well. There could be some extremely domain specific way they're saving some money, somehow, but it seems generally unlikely. It's all about diminishing returns. OpenAI themselves have also publicly acknowledged they're hitting diminishing returns on model size. [1]
In any case this happens in literally every single domain that uses neural networks. You make this extremely rapid and unbelievably progress which leads one to start looking forward to where this is leading, and the literally infinite possibilities. But then at some point each single percent point improvement starts costing more and more. Initially a bit more compute can help you get over these hurdles, but then it becomes clear you're headed towards an asymptote that's far from where you want to be. It's why genuine fully self driving vehicles look much further away today than they did ~8 years ago.
They're saving money by the simple fact that sparse models are much less expensive to train and infere than dense equivalents.
>OpenAI themselves have also publicly acknowledged they're hitting diminishing returns on model size. [1]
Do people even properly read this ? Altman acknowledged they were hitting economic wall scaling not that they thought they couldn't get better performance scaling further.
Ilya believes there is plenty performance left to squeeze still.
There was even an interview specifically correcting this misinformation.
I like to toy with the idea that if AIs became self-aware,they'd probably gobble up 95% of their time just trying to compete with each other...like us.
This is one of the existential scenarios that's talked about some times.
AI fights AI and we are collateral damage as the things massive replicate and blow each other up.
Or, an ASI becomes dominate and realizes that humans will be a risk and dumbs us down to reduce that risk, or exterminates us to remove that chess piece from the table.
I imagine all the factions in that war will include integrated collections of both AIs and humans, with "individuals" fulfilling different roles within each team, perhaps with each faction representing some underlying ideology.
I wonder if/how humans will physically/cognitively integrate with the AIs in their group, in some sort of symbiotic hybrid fashion.
It's all really mind-boggling when you think about it.
The me that went to sleep last night is not the same one that wakes up today or tomorrow. I'll have new experiences, maybe even changes in ways of thinking or believing. Maybe I get covid and lose an entire sense (smell).
From minute to minute sometimes it also feels like there's different beings in us, the angry version, the happy version, the loving version, the sad version, etc.. Different parts of our brain even handle some different ways we think or feel, could each of these parts stand alone on their own?
I think it makes a lot of sense that as we have different parts to our brain, so should AI's have different parts for different functions.
Don't they run into the meta-dilemma? In semantics I had this problem that whenever you moved from an single integrated domain (utopia), to a selection of specialized micro domains (feasible), the problem became a tradeoff between on the one hand selecting the right 'expert', and on the other hand integrating the 'knowledge' of the various experts.
For each specific instance of this type of problem you can strive to improve the tradeoff, but there is no 'general' solution'.
I'm a bit confused as to why a Mixture of Experts (MOE) isn't one of the comparators. That seems like the most relevant direct comparator, rather than the several other paradigms that they cited.
MoE is a type of neural network architecture, not an approach to lifelong learning. As they say " We propose a new Shared Knowledge Lifelong Learning (SKILL) challenge, which deploys a decentralized population of LL agents that each sequentially learn different tasks, with all agents operating independently and in parallel." What they propose is effectively an MoE with separately learned routing as far as I can tell, with the heads of the LLM being task specific and there being a task mapper to assign training labels. The baselines based on Parameter-Isolation methods are sort of similar MoEs, in that additional weights are added and trained on each tasks (somewhat as experts would).
There are network architectures for MoE, but afaik the concept of MoE is separate from them. MoE at least significantly predates the last decade's neural network boom.
afaik the term MoE is generally used in the context of neural networks, with the term mixture models being more general. But you are right it's not exclusive to NNs.
> What they propose is effectively an MoE with separately learned routing as far as I can tell, with the heads of the LLM being task specific and there being a task mapper to assign training labels.
What you are describing would indeed be an MoE model, but that's not where it ends. The manuscript continues [1]: "all agents become identical after all tasks have been learned and shared, and they all can master all tasks."
That is a substantial divergence from the MoE model!
To the extent that there is an MoE-like stage in their training, I think it's odd that the term "mixture of experts" is not mentioned nor is the literature on the topic cited.
> Or imagine every smartphone user is a local tour guide in the city he or she is visiting. Each user takes photos and provides details about significant landmarks, stores, products, and local cuisine.
The lead researcher said in this article, "Humans have the means of sharing information. We are now pushing that idea into the AI domain." A classic example of human hubris...thinking that we will be able to control and benefit from the results of such inventions. It boggles my mind how some people are so naive to think that this is just an academic game without real-world consequences that are likely to be devastating.
Why? Every day the rest of the world make confident extreme claims that AI/advanced technology/STEM will improve the world with zero evidence either, and no one ever challenges those. In fact, everyone just jumps on the happy "tech is the best" bandwagon with nary a thought. If you think I'm making extreme claims, go look at the one-sided only positive view of technology being espoused by everyone else.
On the other hand, technology has brought us to the brink of a climate disaster and a screwed up society (just look at how much research there is for the damaging effects of social media).
So you've realized that one side of the extreme is irrational because it's extremely over-positive, and decided you'd just jump on the other side of the irrational extreme over-negative to be contrary? Saying AI will save the world is just as irrational as saying it will destroy it. By your own admission, even.
Well, I don't think my statements are extreme actually. I've given them a lot of thought and reasoned them out more fully elsewhere. I do sincerely believe AI has the capability to make society significantly worse and even destroy it through its isolating effects and through its removal of purpose from humanity. Of course, these comment boxes are a little small to have extended discussions but I'm perfectly willing to hear any serious arguments against what I have to say.
The modern development of AI can be likened to the earlier development of the atom bomb. Everyone thinks it will destroy the world because technically it has the potential to do so and everyone's a pessimist on average. But it'll probably not be nearly as bad as claimed. Us humans have a strong tendency to enjoy being alive and preserve doing so.
The atom bomb is an extreme device, the issue is the extremeness is multi-polar.
Atom bombs have extreme explosions, but they also require extreme maintenance, extreme precision in manufacturing, and are extremely detectable.
For AI the level of extremeness has not been defined. Yea, it requires extreme manufacturing, but it's the same manufacturing that builds billions of chips per year that are distributed almost everywhere. A server farm running AI looks much the same as a server farm running other computing tasks.
Coming back to atom bombs, yes, atom bombs could destroy the world 10 minutes from now, and every 10 minutes from now until we destroy enough of them they can no longer do that. It simply has not happened yet and we've likely been very very lucky that it has not happened.
Lastly, AI, especially when we reach AGI, is intelligence which means it will likely have some hand in if the world is destroyed or not. This risk rapidly increases if AGI capabilities are far beyond human capabilities.
Atom bombs are just tools, they are only controlled by humans. But a super intelligence can be an actor, not just a tool. This changes everything. It might be vastly more intelligent than humans. It wouldn't be limited by skull size, meager energy provided by food, or low spike frequency of biological neurons. The difference might be as large as between humans and chimpanzees, or humans and mice, or even humans and ants. There is no reason to expect we are anywhere near the theoretical peak of intelligence.
Intelligence is extremism. It is the most potentially extreme agent in the universe because it allows you to control all other forms of extremes.
When humans intelligence grew beyond that of the other animals, we became the masters of those animals. We can choose to manipulate them, control them, or if they cannot be controlled, destroy them.
Now, I don't understand why saying "There is no particular reason why humanity is the highest form and capability of intelligence" would be an extreme claim in any way. At least to me, the opposite is true "Humans are the most capable form of intelligence possible" is a anthropocentric and extreme statement.
Don"t panic, this is just the full autmation of enshittification. Producing piles of worthless ouput without human assistance. Nothing too dramatic, really.
> After consolidating their knowledge, Itti and Ge explained, these AIs could serve as a comprehensive medical assistant, providing doctors with the latest, most accurate information across all areas of medicine.
As for the "most accurate" part, I would agree if at the same they provided the source of each bit of information. Otherwise it would be just speculation with varying levels of probability.
It's probably the end for our image of a dominant purely biological humanity. This isn't so bad or unexpected though. We are just the latest step in the universe's growing comprehension of itself.
At one point the self-replicating molecule was the most advanced cognition available in our little corner of reality, then at various times we had single celled organisms, colonial organisms, etc etc until eventually we got humans. There have been phase changes in the nature of cognition many times, this will just be the latest.
What comes next is just the latest link in a chain of continuous increasing complexity of life stretching back to the first time a couple of organic molecules reacted. It's exciting to be alive in a time when we get to see one of these phase changes. It's like being alive at the dawn of civilization or when the first fish crawled out onto land.
This is a really interesting idea. It's like the reverse of knowledge distillation (which I've been thinking about a lot[1]) where you have one giant model that knows a lot about a lot & you use that model to train smaller, faster models that know a lot about a little.
Instead, if you could train a bunch of models that know a lot about a little (which is less computationally intensive because the problem space is so confined) and combine them into a generalized model, that'd be hugely beneficial.
Unfortunately, after a bit of digging into the paper & Github repo[2], this doesn't seem to be what's happening here.
> The code will learn 102 small and separte [sic] heads(either a linear head or a linear head with a task bias) for each tasks respectively in order. This step can be parallized [sic] on multiple GPUS with one task per GPU. The heads will be saved in the weight folder. After that, the code will learn a task mapper(Either using GMMC or Mahalanobis) to distinguish image task-wisely. Then, all images will be evaluated in the same time without a task label.
So the knowledge isn't being combined (and the agents aren't learning from each other) into a generalized model. They're training a bunch of independent fine-tuned models for specific tasks & adding a model-selection step that maps an image to the most relevant "expert". My guess is you could do the same thing using CLIP vectors as the routing method to supervised models trained on specific datasets (we found that datasets largely live in distinct regions of CLIP-space[3]).
[1] https://github.com/autodistill/autodistill
[2] https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learnin...
[3] https://www.rf100.org