Humans take in a tremendously high bitrate of data via other senses and are able to connect those to the much lower amount of language input such that the language can go much further.
GPT-3 is learning everything it knows about the entire universe just from text.
Imagine we received a 1TB information dump from a civilization that lives in an alternate universe with entirely different physics. How much could we learn just from this information dump?
And from our point of view, it could be absurdly exotic. Maybe their universe doesn't have gravity or electromagnetic radiation. Maybe the life forms in that universe spontaneously merge consciousnesses with other life forms and separate randomly, so whatever writing we have received is in a style that assumes the reader can effortlessly deduce that the author is actually a froth of many consciousnesses. And in the grand spectrum of how weird things could get, this "exotic" universe I have described is really basically identical to our own, because my imagination is limited.
Learning about a whole exotic universe from just an info dump is the task of GPT-3. For instance, tons of our writing takes for granted that solid objects don't pass through each other. I dropped the book. Where is the book? On the floor. Very few bits of GPT-3's training set includes the statements "a book is a solid object", "the floor is a solid object", "solid objects don't pass through each other", but it can infer this principle and others like it.
From this point of view, its shortcomings make a lot of sense. Some things GPT fails at are obvious to us having grown up in this universe. I imagine we're going to see an explosion of intelligence once researchers figure out how to feed AI systems large swaths of YouTube and such, because then they will have a much higher bandwidth way to learn about the universe and how things interact, connecting language to physical reality.
This is a fantastically good point. I think things will get even more interesting once the ML tools have access to more than just text, audio and image/video information. They will be able to draw inferences that humans will generally be unaware of. For example, maybe something happens in the infrared range that humans are generally oblivious to, or maybe inferences can be drawn based on how radio waves bounce around an object.
"The universe" according to most human experience misses SO much information and it will be interesting to see what happens once we have agents that can use all this extra stuff in realtime and "see" things we cannot.
As far as I know, all sensory evolution prior to this point has been motivated based on incremental gains in fitting a changing environment.
True vision requires motive and embodied self. I’m ignorant about the state of the art here, but I’m way more terrified of what these things don’t see than interested in what they could show us. It seems to me that the only human motives accessible to machines are extremely superficial and behavioral based.
Knowledge is not some disconnected map of symbols that results in easily measurable behavior, it has a deep and fundamental relation to conscious and unconscious human motivation.
I don’t see any possible way to give a machine that same set of motives without having it go through our same evolutionary and cultural history, and strongly believe most of our true motives are under many protective layers of behavioral feints and tests and require voluntary connection and deep introspection to fractionally expose to our conscious selves, let alone others, let alone a computer.
These models seem to be amazingly good at combining maps of already travelled territory. Trying to use them to create maps for territory that is new to us seems incredibly dangerous.
Am I missing something here, or is it not true that AI models operate purely on bias? What we chose to measure and train the model on seems to predetermine the outcome, it’s not actually empirical because it can’t evaluate whether it’s predictions make sense outside of that model. At some point it’s always dependent on a human saying “success/fail”, and seems more like an incredibly complicated kaleidoscope. Maybe they can cause humans to see patterns we didn’t see before, but I don’t think it’s something that could actually make new discoveries on its own.
I think your point is more interesting but the problem is tabula-rasa knowledge starts. A human isn't born knowing about quantum mechanics, christoffel symbols or what pushforward measures are. If there was just a method to learn facts from scratch as cheaply as brilliant humans do, it would be so amazing. Even if you count from elementary school years, humans still end up with less energy spend by several orders of magnitude.
Transformers themselves are a lot more effective compared to n-gram models or non-contextual word vectors. I imagine there is something to Transformers as Transformers are to word2vec.
Google's Imagen was trained on about as many images as a 6 year old would have seen over their lifetime at 24fps and a whole lot more text. It can draw a lot better and probably has a better visual vocabulary but is also way outclassed in many ways.
Paucity of the stimulus is a real problem and may mean our starting point architecture from genetics has a lot of learning built in than just a bunch of uninitialized weights randomly connected. A newborn animal can often get up and walk right away in many species.
Definitely. I do think video is much more important than images, because video implicitly encodes physics, which is a huge deal.
And, as you say, there are probably some structural/architectural improvements to be made in the neural network as well. The mammalian brain has had a few hundred million years to evolve such a structure.
It also remains unclear how important learning causal influence is. These networks are essentially "locked in" from inception. They can only take the world in. Whereas animals actively probe and influence their world to learn causality.
The mammalian brain have had a few hundred million years to evolve neural plasticity [1] which is the key function missing in AI. The brain’s structure isn’t set in stone but develops over one’s lifetime and can even carry out major restructuring on a short time scale in some cases of massive brain damage.
Neural plasticity is the algorithm running on top of our neural networks that optimizes their structure as we learn so not only do we get more data, but our brains get better tailored to handle that kind of data. This process continues from birth to death and physical experimentation in youth is a key part of that development, as is social experimentation in social animals.
I think “it remains unclear” only to the ML field, from the perspective of neuroscientists, current neural networks aren’t even superficially at the complexity of axon-dendrite connections with ion channels and threshold potentials, let alone the whole system.
A family member’s doctoral thesis was on the potentiation of signals and based on my understanding if it, every neuron takes part in the process with its own “memory” of sorts and the potentiation she studied was just one tiny piece of the neural plasticity story. We’d need to turn every component in the hidden layers of a neural network into it’s own massive NN with its own memory to even begin to approach that kind of complexity.
> our starting point architecture from genetics has a lot of learning built in
I don't doubt that evolution provided us with great priors to help us be fast learners, but there are two more things to consider.
One is scale - the brain is still 10,000x more complex than large language models. We know that smaller models need more training data, thus our brain being many orders of magnitude larger than GPT-3 naturally learns faster.
The second is social embedding - we are not isolated, our environment is made of human beings, similarly an AI would need to be trained as part of human society, or even as part of an AI society, but not alone.
> Google's Imagen was trained on about as many images as a 6 year old would have seen over their lifetime at 24fps
The six year old has the advantage of being immersed in a persistent world where images have continuity and don’t jump around randomly. For example infants learn very quickly that most objects stay put even when they aren’t being observed. In contrast a dataset of images on the internet doesn’t really demonstrate how the world works.
Drawing involves taking a mental image and converting it into a sequence of actions that replicate the image on a physical surface. Imagen does not do that. I think the images it generates are more analogous to the image a person creates in their mind before drawing something.
I was too loose with that. There is CLIPDraw and others that operate at the stroke/action level but haven't been trained on as much data. Still impressive at the time:
One of the more interesting things I have seen recently is the combination of different domains in models / datasets. The top network of Stable Diffusion combines text-based descriptions with image-based descriptions, where the model learns to represent either text or images in the same embedding; a picture, or a caption for that picture, lead to similar embeddings.
Effectively, this can broaden the context the network can learn. There are relationships that are readily apparent to something that learned images that might not be apparent to something trained only on text, or vis-versa.
It will be interesting to see where that goes. Will it be possible to make a singular multi-domain encoder, that can take a wide range of inputs and create an embedding (an "mental model" of the input), and have this one model be usable as the input for a wide variety of tasks? Can something trained on multi-domains learn new concepts faster than a network that is single-domain?
They haven't even figured out basic math, so not sure what you would expect to find there. They aren't smart enough to generate structure that doesn't already exist.
Depends on the method. Evolutionary methods can absolutely find structure that we missed, and they often go hand in hand with learning. Like AlphaGo move 37.
AlphaGo had a lot of driver code involved to make it tick, it wasn't just a big network deciding what to do. You would need something similar here, without someone figuring out that driver code you aren't revolutionizing anything with todays neural networks.
Yes, since Go is a very simple game. Making a proper driver for much more complex domains like engineering blueprints is not something we know how to do today.
Edit: Also you are missing the Go engine in that comment, it can't train without a Go engine to train against that evaluates the results of each move. That Go engine is a part of the training algorithm and thus is also a part of the driver code, you would need to produce something similar to train a similar AI for other domains. We don't know how to write similar blueprint engines or text evaluation engines, so we can't expect such AI models to produce similar results.
The hypothesis that you can't learn some things from text - you need real life experience, is intuitive and I used to think it's true. But there are interesting results from just a few days ago saying that text by itself is also enough:
> We test a stronger hypothesis: that the conceptual representations learned by text only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection.
The claim isn’t that you can’t learn it from text, but rather that this is why models require so much text to train on - because they’re learning the stuff that humans learn from video.
The key issue is learning effort (such as energy vs time). Congenitally deaf-blind humans with no accompanying mental disabilities as a shared cause can learn as children just fine without any video or sound from comparatively low bandwidth channels like proprioception and touch.
Another issue is what we really care about is scientific reasoning and there, if anything, nature has given an anti-bias, at least at the level of interfacing with facts. People aren't born biased towards learning Metric Tensors and Christoffell Symbols but it takes only a few years at a handful of hours a day using a small number of joules for many humans to get it (I'm counting from all grade school prerequisites vs GPUs watts x time). Much fewer for genius children.
Im testing this argument out, but doesnt this apply to all tasks, not just language? I can learn to paint from scratch in what like 300 attempts? 1000 attempts? It takes far more examples to train a guided diffusion model. I'd struggle to believe that our brains are hardwired for painting
> Humans take in a tremendously high bitrate of data via other senses and are able to connect those to the much lower amount of language input such that the language can go much further.
They don't. Human bitrates are quite low all things considered. The eyes which by far produce them most information only have a bitrate equivalent to ~2kbps:
The rest of the input nerves don't bring us over 20kpbs.
The average image recognition system has access to more data and can tell the difference between a cat and a banana. A human has somewhat more capability than that.
I think the link says a single synapse does 2kbps, not the whole visual cortex. There are 6 quadrillion (10^12) synapses (3q per hemisphere) in visual cortex according to https://pubmed.ncbi.nlm.nih.gov/7244322/
If we play "the bus filled with ping-pongs" with that information: it is a 3D structure so if you assume cortex is a perfect cube that feeds to something right behind it, you will get (10^12)^(2 (dimensions)/3 (of 3 dimensions)) channels, e.g. 10^8 channels 2kbps each. E.g. about 25GB/s. Which is less than an order of magnitude off from an estimate you would get from 8000x8000 resolution per eye True Color at 24fps - 9GB/s.
GPT-3 is learning everything it knows about the entire universe just from text.
Imagine we received a 1TB information dump from a civilization that lives in an alternate universe with entirely different physics. How much could we learn just from this information dump?
And from our point of view, it could be absurdly exotic. Maybe their universe doesn't have gravity or electromagnetic radiation. Maybe the life forms in that universe spontaneously merge consciousnesses with other life forms and separate randomly, so whatever writing we have received is in a style that assumes the reader can effortlessly deduce that the author is actually a froth of many consciousnesses. And in the grand spectrum of how weird things could get, this "exotic" universe I have described is really basically identical to our own, because my imagination is limited.
Learning about a whole exotic universe from just an info dump is the task of GPT-3. For instance, tons of our writing takes for granted that solid objects don't pass through each other. I dropped the book. Where is the book? On the floor. Very few bits of GPT-3's training set includes the statements "a book is a solid object", "the floor is a solid object", "solid objects don't pass through each other", but it can infer this principle and others like it.
From this point of view, its shortcomings make a lot of sense. Some things GPT fails at are obvious to us having grown up in this universe. I imagine we're going to see an explosion of intelligence once researchers figure out how to feed AI systems large swaths of YouTube and such, because then they will have a much higher bandwidth way to learn about the universe and how things interact, connecting language to physical reality.