Computer scientists prove why bigger neural networks do better

SpaceManNabs · on Feb 10, 2022

> Right now, we are routinely creating neural networks that have a number of parameters more than the number of training samples. This says that the books have to be rewritten.

Confused by this statement. Double descent with overparameterization is exhibited in "classical settings" too and mentioned in older books.

> In their new proof, the pair show that overparameterization is necessary for a network to be robust.

What is important to note here is that many of papers this paper cites prove or show this result in certain network architectures. This paper adds universality.

> The proof is very elementary — no heavy math, and it says something very general

The most elementary part was clever use of Hoeffding's inequality. Some people are really fast readers haha.

I don't even know how you pick up the fact that isoperimetry holds in manifold settings with positive curvature while also playing with all those norms and inequalities. A few years ago I mentioned on here all the maths that I knew or wanted to know to read more papers, and others critiqued that the list was too long. Well, this is why!

aray · on Feb 10, 2022

> Double descent with overparameterization is exhibited in "classical settings" too and mentioned in older books.

I’m curious for references or citations to this. When I was going over double descent I tried to find citations like this (just in a couple places like ML/stats textbooks).

tomrod · on Feb 10, 2022

There are a handful of papers in the 90s that show this, but it wasn't recognized for what it is. Double descent is REALLY crazy to me, coming from a classical background.

pishpash · on Feb 10, 2022

Over-parameterization for regularization is really old. The pseudoinverse min-norm solution for under-determined linear systems even has that flavor.

tomrod · on Feb 10, 2022

Sure, but that's identification approaches in econometrics and matrix analysis contexts. Using that for neural networks is new-ish in the zeitgeist, which did not exist in the 1990s as it does today.

moyix · on Feb 10, 2022

Here's one that lists some older references: https://arxiv.org/abs/2004.04328

hikarudo · on Feb 11, 2022

Here's a very good visual explanation of double descent:

https://mlu-explain.github.io/double-descent/

oxff · on Feb 11, 2022

> ... "the model perfectly memorizes ... the training data."

Can you call this learning?

carbocation · on Feb 11, 2022

Daniela Witten has a nice twitter thread on the topic of double descent in classical settings: https://twitter.com/daniela_witten/status/129229310485515878...

jfmc · on Feb 11, 2022

> Right now, we are routinely creating neural networks that have a number of parameters more than the number of training samples. This says that the books have to be rewritten.

I also believe that this statement is weird. I have a very shallow knowledge of ML, but I can imagine that in a convolutional neural network a training sample interacts with lots of parameters. This 'one training sample <-> one parameter' correspondence seems wrong to me.

asfarley · on Feb 12, 2022

One CNN parameter would interact with the total number of training samples, but that doesn’t tell you about the ratio of total training samples to total CNN parameters.

stared · on Feb 10, 2022

I am surprised that the paper does not even cite the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635, https://eng.uber.com/deconstructing-lottery-tickets/).

In the LTH paper (IMHO the most fundamental deep learning publication in the last few years), the number of tickets goes as layer_size^n_layers.

gwern · on Feb 10, 2022

I don't see how lottery tickets yield the isoperimetry result, even in a heuristic or handwavy sort of way. Yes, a larger network is more likely to have good-scoring subnetworks; sure. But that's all it says. What does that tell me about how efficiently I can construct an adversarial example? For that, I need something else, like, say a geometric argument about what sort of network will interpolate between high-dimensional datapoints with properties like "not changing much in response to small input changes"...

paraph1n · on Feb 11, 2022

In the linked Uber paper, can someone explain the plots with the ellipses? It's a 2d plot of initial vs final weight, yes? But where does the ellipse shape come from and why?

renewiltord · on Feb 10, 2022

Considering the subject, it is at least somewhat amusing that you double posted this.

rackjack · on Feb 10, 2022

Silly thought: if bigger NN's are better, shouldn't more neurons be better? Why aren't elephants smarter than us, despite having more neurons?

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

https://pubmed.ncbi.nlm.nih.gov/24971054/

gwern · on Feb 10, 2022

Density is also important. If we look at other things - some recent studies have been done on number-counting (https://royalsocietypublishing.org/doi/10.1098/rstb.2020.052...) or bird brains (https://www.gwern.net/docs/psychology/neuroscience/2020-herc...) - density jumps out as a major predictor. African elephants may have some more neurons, but the density isn't as great as a human where it counts, so they are remarkably intelligent (like ravens and crows), but still not human-level. There are diminishing returns in both directions. We have more neurons than any bird as much or more dense, and we have more density than any elephant with as many or more neurons. Put that together, and we squeak across the finish line to being just smart enough to create civilization.

An analogy: what's the difference between a supercomputer, and the same number of CPUs scattered across a few datacenters? It's that in a supercomputer, those CPUs are packed physically as close as possible with expensive interconnects to allow them to communicate as fast as possible. (For many applications, the supercomputer will finish long before the spread out nodes ever finish communicating and idling.) But you need to improve both or else your new super-fast CPUs will spend all their time waiting on Infiniband to chug through, or your fancy new Infiniband will be underutilized and you should've bought more CPUs.

throwaway5486nv · on Feb 11, 2022

My understanding is more than the density configuration of the neurons matters most. The reason is in some cases neural network with drop out's perform better than fully connected neural network. This proves less dense networks can be more intelligent.

thesz · on Feb 13, 2022

>This proves less dense networks can be more intelligent.

Case in point: https://openai.com/blog/block-sparse-gpu-kernels/

More-nitors · on Feb 11, 2022

maybe this is like 20nm vs 10nm vs 5nm?

user90349032 · on Feb 10, 2022

And yet, no animal except humans is self aware. Really makes you wonder why that is.

moomin · on Feb 10, 2022

Probably that you don’t know how to measure what you’re describing.

Plenty of animals recognise themselves in the mirror, for instance.

988747 · on Feb 10, 2022

I wouldn't say "plenty" - few primates, dolphins, orcas, elephants, and, strangely, magpies. But the grounds for that claim are shaky for some of them, the only 3 species we are 100% sure about are chimpanzees, orangutans, and humans. Magpies, for example, require "a training" (whatever that means).

rjzzleep · on Feb 11, 2022

Probably cause we've only tested a few, not that it matters though. Humans take a pretty long time to recognize themselves in the mirror. I wonder if the mirror test would change if we would expose the animals for a almost a year before doing the test, just like humans.

That said even ants pass the test, i.e. they were recently(2015) tested.

But the whole thing can be characterized as: "Let me make up a random test, according to my personal opinion of what defines cognition and then see if a random animal I choose passes it".

Every couple of years we have requests of slews of psychology papers requested to be invalidated because they're unreproducible.

user_7832 · on Feb 11, 2022

> But the whole thing can be characterized as: "Let me make up a random test, according to my personal opinion of what defines cognition and then see if a random animal I choose passes it".

But of course! How do we know that humans are, indeed, the smartest? What if we've been failing every single test that mice have been throwing at us over the past millennia, and they wonder why we are so dumb?

(This is a reference to The Hitchhiker's Guide to the Galaxy in case you're wondering if I've gone mad. Not that one wouldn't presuppose the other :)

kaba0 · on Feb 11, 2022

The mirror-test is biased to vision, though.

Dogs pass the scent-based mirror test for example, their eyes are just simply not the primary way of interacting with the world.

Swizec · on Feb 10, 2022

There are lots of self aware non human animals.

Dolphins and elephants are famous examples, most primates as well. Even many birds show levels of self awareness and theory of mind (they know the difference between what they know and what others know)

attemptone · on Feb 10, 2022

There are also lots of non self-aware human animals :P

visarga · on Feb 10, 2022

Seems like being a social animal is necessary for self awareness.

Swizec · on Feb 10, 2022

In fact there is a popular theory[1] that bird intelligence evolved because of the way their social structures work. Birds mate for life but they cheat. Every bird wants their partner to be loyal and itself to sex as many other birds as possible.

This means birds have to keep track of who can and can't see them cheat, who knows and who doesn't. There's even evidence that they rat each other out (2nd degree info) if they think there's a reward to be had. All of this requires immense intelligence, which happens to prove useful in other contexts.

There's also a bird species who does this with food caches. Easier to steal from others than to build their own so a plethora of deceptive tactics developed to ensure others can't see where you're storing those delicious nuts. Complete with fake caches, lying, and espionage.

[1] I learned about it in The Genius of Birds

saltminer · on Feb 11, 2022

Thievery in nature is really interesting, squirrels also do it.

https://www.labroots.com/trending/plants-and-animals/15629/s...

Ardon · on Feb 10, 2022

You might be interested in the theories on the evolution of human intelligence: https://en.wikipedia.org/wiki/Evolution_of_human_intelligenc...

This is exactly the question the field is about, and I find it fascinating to read about

tpoacher · on Feb 10, 2022

depends on your definition of self aware.

Most cats and dogs I've seen correctly recognise themselves in the mirror after the novelty of seeing one for the first time wears off.

t-3 · on Feb 11, 2022

My cats use reflections to spy on and stalk each other, which seems to me a way more significant sign of intelligence than looking at themselves.

dr_dshiv · on Feb 10, 2022

Self awareness is social self awareness. Viewing oneself as a social actor.

stjohnswarts · on Feb 10, 2022

that is simply incorrect bonobos, orcas, elephants, dolphins, chimpanzees, etc have all shown degrees of self awareness.

stjohnswarts · on Feb 11, 2022

Haha downvoted for stating the obvious and something that can be found in seconds on any google search. Never change HN.

0ldskool · on Feb 10, 2022

Is that something that can/has been proven? From my understanding there are many other smart animals that are self aware.

stjohnswarts · on Feb 11, 2022

There is proof, and you can find lots of studies and papers on it. HN is a bit cultish and likes to think that humans are far above every other creature on the planet in importance.

q845712 · on Feb 10, 2022

are you sure? https://en.wikipedia.org/wiki/Theory_of_mind_in_animals

alexpotato · on Feb 10, 2022

Because, IIRC, a lot of neurons are dedicated to motion/sensing.

Bigger animals may require more neurons to handle moving larger and/or more complicated muscle groups.

Interesting related point there is the encephalization quotient which is related to the predicted ratio of brain size to body mass. On the wikipedia page [0] they list the EQ for various animals. Humans are the highest but dolphins and ravens are not far behind.

molticrystal · on Feb 10, 2022

To further emphasize that having neural material focused on the appropriate functions is more important vs how much you have, here is a story about a guy whose brain is mostly hollow and filled with fluid, it probably did cause his IQ to be 75 and causes him weakness in his legs, but otherwise he lives a normal adult life more or less.

https://www.newscientist.com/article/dn12301-man-with-tiny-b...

gwern · on Feb 10, 2022

Volume != neurons. In any case, 75 is awful and is usually considered borderline retarded. (If you're tempted to respond with other cases of higher IQ, note that they are often retracted or unconfirmed and likely fraudulent in some way; see https://www.gwern.net/Hydrocephalus .)

willmw101 · on Feb 10, 2022

>Volume != neurons

Exactly. Most of the newer research on this topic suggests that it's neural connection complexity, and specifically frontal lobe volume, rather than overall brain size that determines intelligence or brain power.

https://neuroscience.stanford.edu/news/ask-neuroscientist-do...

>Luckily, there is much more to a brain when you look at it under a microscope, and most neuroscientists now believe that the complexity of cellular and molecular organization of neural connections, or synapses, is what truly determines a brain’s computational capacity. This view is supported by findings that intelligence is more correlated with frontal lobe volume and volume of gray matter, which is dense in neural cell bodies and synapses, than sheer brain size. Other research comparing proteins at synapses between different species suggests that what makes up synapses at the molecular level has had a huge impact on intelligence throughout evolutionary history. So, although having a big brain is somewhat predictive of having big smarts, intelligence probably depends much more on how efficiently different parts of your brain communicate with each other.

mattkrause · on Feb 10, 2022

As a counterpoint, rats without a cortex can do...basically everything normal rats can do--except trim their toenails. The classic reference for this is Whitslaw's 1990 chapter "The decorticate rat".

This thread has links to a copy, plus a bunch of related studies in humans and animals. https://twitter.com/markdhumphries/status/107105276276554137...

heavenlyblue · on Feb 11, 2022

Have you read the PDF given?

The whole nail part is basically a single sentence in the paper.

For example, decorticate rats are unable to escape narrow alleyways because they can not turn around due to their tonsils touching the walls and them being unable to ignore that feeling.

Another example is that they take a few seconds vs (!) 5 minutes to groom themselves on average.

mattkrause · on Feb 11, 2022

Yes! I got interested in this when my colleague worked with a person who had an entire hemisphere resected as a teenage.

FWIW, the nail thing is a bit of a neuroscience meme. I heard--and stole--this quip from multiple people in several different situations. There's also a really striking figure in that chapter (p. 7 or 8).

No argument that the rats' behaviors are affected. I suppose whether you find the slowness of their grooming expected (because of brain damage) or impressive (because it happens at all) is a matter of taste. Glass^W Skull half-empty or half-full, if you will.

heavenlyblue · on Feb 11, 2022

> No argument that the rats' behaviors are affected. I suppose whether you find the slowness of their grooming expected

They are not slow, they just stop grooming themselves well enough.

gwern · on Feb 11, 2022

Stick that rat outside in a wild rat colony, and we'll see how well "everything normal rats can do except trim their toenails" works out.

mattkrause · on Feb 11, 2022

The cool thing about Whitslaw's work is that it focused on natural behaviors (rather than like...a rotorod test).

I don't think he released them into the wild (would be a tough experiment with 80s tech), but there are a bunch of studies of their interactions with conspecifics. They can mate[0], though less successfully than controls, but playfight a bit better than they do[1].

[0] https://psycnet.apa.org/record/1983-29790-001 [1] https://doi.org/10.1159/000114124

acchow · on Feb 10, 2022

Doesn't this demonstrate the opposite of what you were claiming?

lacksconfidence · on Feb 10, 2022

I feel like the quotes agree with parent:

> “If something happens very slowly over quite some time, maybe over decades, the different parts of the brain take up functions that would normally be done by the part that is pushed to the side,” adds Muenke, who was not involved in the case.

Ajedi32 · on Feb 10, 2022

Did you see the scans? The dude's head is practically empty (brain 55-75% smaller than normal) and nobody even noticed until he was 44 years old and got an MRI.

divbzero · on Feb 10, 2022

I think it’s that a priori you would expect a hollow brain to have a far more drastic effect and not allow for a mostly normal adult life.

pishpash · on Feb 10, 2022

Why would you expect that, when a tiny insect can do pretty intelligent things? What "unexpected" things humans can do are probably all in the >75 IQ range.

cloogshicer · on Feb 10, 2022

You probably already know this (since you wrote "silly thought"), but real-life neurons are ridiculously more complex than simulated "neurons" in an NN. So the analogy doesn't really hold.

pishpash · on Feb 10, 2022

They're more complex in biological construction and in signaling mechanism, but no proof that they are more complex in function.

mattkrause · on Feb 10, 2022

An individual biological neuron can compute a variety of functions, including max and xor, that a single perceptron can't (e.g., https://www.science.org/doi/10.1126/science.aax6239 ). In general, one needs a fairly elaborate ANN to approximate the behavior of a single biological neuron.

OTOH, a three-layer network is a universal function approximator and RNNs are universal dynamical systems approximators, so they are sort of trivially equivalent.

csomar · on Feb 10, 2022

Do you think they have accumulated complexity for no reason. Because usually evolution kicks out things that aren't needed.

pishpash · on Feb 11, 2022

For one they need to engage in metabolism and reproduction, but I'd like to see some argument of how neurons are more complex without those needs, e.g. do they compute some radically different class of functions than typical ANN's do, or require entirely different interconnections, etc.

XnoiVeX · on Feb 10, 2022

I think a lot of people on this thread are missing this critical insight.

visarga · on Feb 10, 2022

You can simulate the data processing of a real neuron with 1000 digital ones, a small neural net.

I think we read too much into the complexity of biological neurons. Remember they need to do much more than compute signals. They need to self assemble, self replicate and pass through various stages of growth. They need to function for 80-100 years. Many of those neurons and synapses exist only for redundancy and other biological constraints.

A digital neuron doesn't care about its physical substrate and can be millions of times faster. They can be copied identically for no cost and cheaply fine-tuned for new tasks. Their architecture and data can evolve much faster than ours, and the physical implementation can remain the same during this process.

btilly · on Feb 10, 2022

How do you measure intelligence? Elephants have much better memories than we do!

https://www.scientificamerican.com/article/elephants-never-f...

Ajedi32 · on Feb 10, 2022

That article doesn't seem to support your claim. All of the feats mentioned would be entirely unremarkable in your average human.

btilly · on Feb 10, 2022

Really? You'd immediately recognize someone you knew for a few weeks over 20 years ago? You wouldn't need a bit to try to figure out who they are?

If so, then your memory is unusually good. I know that this is well beyond my capabilities. Nor do I have the ability to visit a place that I lived 40 years earlier and find my way around.

Spooky23 · on Feb 10, 2022

I think it depends on the intensity of the experience.

I recently found myself in a hotel that I stayed in as a 7-8 year old in the 80s for a particularly memorable vacation with my extended family. It was funny that I still remembered the I unusual aspects of the layout and could spot many of the changes that had been made over the years.

But if you asked me to describe someone I met for a few days in a business context in 2020, I’d have a hard time remembering detail.

Someone · on Feb 10, 2022

How many other elephants did these elephants see in those 20+ years? It wouldn’t surprise me if that was fewer than 100. How many did they spend a few weeks or more with? It wouldn’t surprise me if that were less than 20.

There also, AFAIK, isn’t evidence they remember _all_ other elephants they’ve shared time with for at least few weeks (I certainly do not rule that out, either, given the low number they likely will meet in their life)

tshaddox · on Feb 10, 2022

> You'd immediately recognize someone you knew for a few weeks over 20 years ago?

Yeah? Maybe not if they were a kid 20 years ago or their appearance had otherwise changed significantly, but otherwise I don't see why not.

joebob42 · on Feb 10, 2022

Aside from other points, more neurons might be better "all else equal", but there are differences between our brain and an elephant's beyond just neuron count.

It's like how just getting a bigger faster computer can help with your problem, but its less powerful than a new more efficient algorithm on the same computer.

juancn · on Feb 10, 2022

Well, biological neurons are much more complex than CS neurons (https://www.quantamagazine.org/how-computationally-complex-i...).

Also, you're working under the assumption that they are equivalent between mammals which as far as we can tell it's not the case (https://www.medicalnewstoday.com/articles/why-are-human-neur...).

So my guess is that the comparison is much more complex than just number of neurons.

fabiospampinato · on Feb 10, 2022

I read on wikipedia [0] the other day a fairly disturbing statistic related to this, apparently human men have on average a ~10% bigger brain than women. It'd be interesting to know if that translates to a higher neuron count or the difference in volume is due to something else.

[0]: https://en.wikipedia.org/wiki/Brain_size#:~:text=In%20men%20....

andrewflnr · on Feb 10, 2022

Probably just a consequence of overall physical size being larger. AFAIK there continues to be no evidence of a sex difference in overall intelligence, so slight difference in brain size is probably a red herring.

World_Peace · on Feb 10, 2022

Elephants very likely could be more intelligent than us, it just seems that intelligence is a difficult thing to measure quantitatively.

bee_rider · on Feb 10, 2022

In particular, a given elephant might be "more intelligent" than a human -- we just happen to have evolved from a particular niche that has rendered us bizarrely good at abstracting knowledge and combining it with the knowledge of other humans.

notahacker · on Feb 10, 2022

What is "more intelligent" if not "more capable of abstracting, synthesizing and sharing knowledge"?

bee_rider · on Feb 10, 2022

How about the ability to solve novel problems?

We have very good problem solving ability of course, but a superpowered ability to ask others how they solved the problem. If we wanted to somehow define a kind of 'brain horsepower' type intelligence, it seems to me that the former is closer to it than the latter, and it doesn't seem obvious to me that humans would necessarily take the top spot. Or that there's a reasonable/ethical way to test it -- let's take a human, elephant, crow, and dolphin, raise them in total isolation from the any community to get a measure of their untrained intelligence... we might get some interesting results on intelligence, but mostly we will learn something about ballistics as some ethics review board launches us unto the Sun.

jayd16 · on Feb 10, 2022

You'd also need the desire for such things.

tshaddox · on Feb 10, 2022

It may be hard to measure and even define precisely, but I think it's pretty clear that if we did agree on a definition in the context of this conversation it would be defined in such a way that humans are more intelligent than elephants.

lariati · on Feb 10, 2022

I have listened to François Chollet say that all intelligence is specialized intelligence.

I suspect the question really doesn't make sense if that is true.

We just have this bias/mind projection fallacy that intelligence is a general physical property of the brain that can be measured. I just suspect this is not true.

Like athletic ability doesn't generalize well. Of course, someone not athletic at all is never going to be a great athlete in anything but it makes no sense to compare Lance Armstrong to Patrick Mahomes in some general athletic context. Putting a number on a general athletic ability index between the two would just be total nonsense.

tshaddox · on Feb 10, 2022

For one thing, when the article says "bigger" it means "more parameters," not "more neurons."

urthor · on Feb 11, 2022

If neural networks have shown us anything, it's that not all neurons are born the same. After all, even in nature there are multiple types of neurons.

I imagine it's 100% dependent on the cardinal rule of neural networks:

"Choice of training data is 10 times more important than the actual model."

What we have over elephants are opposable thumbs, excellent eyes, and vocal cords. And crucially, we're generally speaking pretty slow, weak and useless.

Except for our elaborate methods of I/O.

Our entire success is based on a feedback loop. "If human uses their IO this way, human will get more food."

Thus, we become ever more sophisticated at this. We are nothing if not a vehicle for using our high dexterity, low gross force, opposable thumbs in inventive ways to get food.

Plus, we have a biological imperative to pass these techniques on as knowledge.

A baby elephant can probably feed itself by eating green stuff at 1 year old (I know nothing about elephants).

A human child realistically cannot independently scrounge up enough solid food to sustain themselves, until they're what, twelve? Twenty-two? Certainly no younger than eight.

We have, almost certainly, the most useless progeny in the animal kingdom.

Hence we invest an enormous amount of time and energy in education to make them able to feed themselves.

So the tl;dr is

1) Human brains are pretty similar to the animal kingdom's. 2) Human opposable thumbs are world class. Pretty close to as good as it gets. Sight is also top notch, many animals have useless eyeballs. 3) Most human food is obtained by doing creative things with thumbs. This is very complex, and takes a lot of practice. 4) Human birth the most useless children in the entire animal kingdom. These children take decades to fully grow, hence we invest an enormous amount of time educating them in opposable thumbs. 5) Over time our education system gets better and better, and our list of clever things we can do with opposable thumbs get longer and longer.

Essentially what we have over the other animals isn't neurons.

What we have over the other animals is a data collection/cleaning/utilization cycle.

kemiller · on Feb 10, 2022

IANANS but my understanding is that neurons/body mass is more indicative. Large animals have more neurons because large bodies need more.

salty_biscuits · on Feb 10, 2022

They talk about the encephalization quotient, which is to the 2/3 power

https://en.wikipedia.org/wiki/Encephalization_quotient

raxxorrax · on Feb 11, 2022

But only 1/3 as much in their cerebral cortex. Maybe having a multi purpose trunk does bind a lot of capacity?

gfody · on Feb 10, 2022

are we certain they're not? i'm not sure we know how to measure smartness

peterburkimsher · on Feb 10, 2022

Dumbo is smarto!

Elephants have bodies built like a tank (and used as such by Hannibal), but humans have better I/O ports.

{reading, writing, listening, speaking, singing, typing, doing, going}

Without opposable thumbs, an elephant is probably quite envious of human writing & typing. Let's use the privilege wisely to encourage one another, teach and learn from each other, from Donald Tusk, and give a helping hand.

Someone · on Feb 10, 2022

But African elephants have quite versatile opposable finger-like extensions at the tip of their trunks (Asian elephants have only one such thing)

beebeepka · on Feb 10, 2022

I only stopped saying my cat is smarter than the vast majority of people I've met because she is no longer with us.

I did, and still do, believe this to be true. Would love to befriend a bird

wizzwizz4 · on Feb 10, 2022

You can befriend corvids. Teaching them symbolic language is tricky, but they can trade and socialise and solve puzzles (if you manage to explain the puzzle).

prideout · on Feb 10, 2022

> The proof relies on a curious fact about high-dimensional geometry, which is that randomly distributed points placed on the surface of a sphere are almost all a full diameter away from each other.

What theorem is this referring to? Sounds like something I should already be familiar with, but I'm not.

leto_ii · on Feb 10, 2022

I think it's something related to the curse of dimensionality [1] [2], basically just a property of high dimensional spaces (perhaps only certain kinds of spaces though).

[1] https://en.wikipedia.org/wiki/Curse_of_dimensionality

[2] http://kops.uni-konstanz.de/bitstream/handle/123456789/5715/...

bo1024 · on Feb 10, 2022

Yes.

Even though almost every all pairs of points are almost a full diameter away from each other, they are also almost all almost orthogonal (i.e. the angle they make with the center of the sphere is very close to 90 degrees).

hedora · on Feb 10, 2022

The intrinsic dimensionality of a dataset is also relevant here.

The M-Tree is one of my favorite indexes. It works with data that's embedded in infinite dimensional spaces (sometimes; it's bumping up against an impossibility result that's sketched in a sibling comment).

grungegun · on Feb 10, 2022

For reference, see the book High Dimensional Probability by Vershynin. It's free online. See Theorem 3.1.1. It proves that a sub-gaussian random vector is in some sense close in norm to sqrt(n) where n is the number of dimensions. Most of these results are true up to multiplying by some unknown constant.

bick_nyers · on Feb 10, 2022

My initial intuition is telling me that it would be diameter/2, from the perspective of a single point, the closest points would be near zero distance away, and the furthest points would be on the opposite side, a full diameter away, and I am assuming that there are a lot of points in a uniform distribution.

What I have just thought about though, is what points would be exactly diameter/2 distance away from that point? If you have a circle, you might think it would be the points that form a 90 degree triangle, but that is not the case, those points would be sqrt(2)*radius distance away.

So while it is obvious to me that it is not diameter/2, it is not obvious to me why it would be diameter either, or how larger n converges it closer to the diameter or some other fixed number.

dan-robertson · on Feb 10, 2022

If you consider a point on the sphere it means choosing a bunch of xi such that:

  x1^2 + x2^2 + … + xn^2 = 1.

Suppose wlog you pick (1,0,0,…,0). Then the distance from your point to a random point is:

  D = (x1-1)^2 + x2^2 + … + xn^2

And from the first equation we know:

  x1^2 = 1 - x2^2 - x3^2 - … - xn^2

Intuitionistically, your point will be far from a random point if x1 is close to zero, and x1 will be close to zero because everything is close to zero.

But we can be more mathematical about it. Our (very reasonable) assumption is that the volume of a n-dimensional disk is proportional to the nth power of its radius. The third equation shows that x1 is going to be big (meaning the distance to the chosen point above is not so close to the diameter) if a corresponding[1] point on the n-disk is close to the middle. But the distance from the origin, R, of a random point in the n-disk is distributed with pdf proportional to p(r) = r^n for r in [0,1]. So the cdf is just r^(n+1) and E[x1^2] = 1 - E[R] = 1 - (n+1)/(n+2), which tends to 0 as n grows.

Therefore we get E[D] = E[(1-x1)^2] + 1 - E[x1^2] which tends to 2 as n grows large.

[1] the correspondence is that if I give you a point on a disk, you can turn it into a point on a sphere by flipping a coin to decide if it goes in the upper or lower hemisphere and then projecting up or down perpendicular to the disk from the point onto the sphere. But thinking a little more, I’m not sure this preserves the metric as it favours points on the sphere that correspond to the middle parts of the disk. So I think the actual expected value of x1 should be smaller.

WithinReason · on Feb 10, 2022

Let me hijack your explanation starting from this point:

    D = (x₁-1)² + (x₂² + … + xₙ²)

Since all the xₙ² sum to 1, as the dimensionality grows (∑xₙ²→1 as n→∞) each individual xₙ will converge towards 0. Since x₁ is almost 0, therefore the (x₁-1)² term will be almost 1.

Since we know that ∑xₙ²=1, and that x₁² is almost 0, then we also know that ∑xₙ² - x₁² is almost 1, which is the 2nd half of the above expression for D. So the average distance converges to "almost 1 + almost 1", which "almost 2", which is the diameter.

Retric · on Feb 10, 2022

Close, the distance formula is square root of (X1^2 + X2^2 …).

So exactly 1 gives a distance of 1, but almost 1 + almost 1 gives a distance of almost square root of 2.

WithinReason · on Feb 10, 2022

Good point!

akomtu · on Feb 10, 2022

"each individual xₙ will converge towards 0"

I'm not sure it will. x1 is chosen randomly in the -1..1 interval. I dont see how the million other dimensions would force it to stick to 0. Those N other dimensions shrink the stddev(xi) by sqrt(N), though.

WithinReason · on Feb 10, 2022

Then try normalizing a random 1000-element vector. The average of the vector elements is around 0.027.

akomtu · on Feb 11, 2022

The fact that average of D is sqrt(2) is the easy part. But average doesnt mean that random D values would concentrate in that spot. D may be evenly distributed. So the question is what's the variance of D? Using your formulas above:

D^2 = (1 - x1)^2 + (1 - x1^2) = 2 - 2x1

We know that x1 is 0 on average, but its distribution is restricted by the fact that we choose a random point on the n-sphere. So what's the probability of that random point falling into a thin stripe where abs(x1) < eps? And how does this probability behave for large n?

This eps-stripe is basically the (n-1)-sphere of width eps, so the stripe's area A(n)=eps•S(n-1), and so our probability p=eps•S(n-1)/S(n).

The "magical" property of n-spheres is that their area grows at first, reaches maximum in 7 dimensions and then falls off rapidly to zero. Using formulas from wikipedia, I get: p = eps•(n/4)^(n/2) for large n.

In other words, the distribution of x1 approaches the look of the delta function at the n^n pace and for all practical matters, D=sqrt(2) with high precision for n > 10.

Where

adgjlsfhk1 · on Feb 10, 2022

I think the most intuitive way of thinking about this is sphere packing. Asking what percent of points are within distance d of an n-sphere of radius 1 is equivalent to asking what the ratio of volumes is. For d<1, the n-volume of a radius d sphere tends to 0 as n goes towards infinity, so that means almost all of the points are as far away as possible.

bjourne · on Feb 10, 2022

It's just another way to state the https://en.wikipedia.org/wiki/Curse_of_dimensionality

aix1 · on Feb 10, 2022

Not my area of expertise, but the quoted "fact" seems at best incompletely stated: surely for it to hold there must be some constraints on the number of points (likely as a function of the diameter)?

Retric · on Feb 10, 2022

It’s just wrong as stated, there is only one point a full diameter away from each point on a high dimensional sphere. Aka (1,0,0,0,0, …) maps to (-1,0,0,0,0, …) and nothing else. Just as (1,0) maps to (-1,0) on a unit circle and (1,0,0) maps to (-1,0,0) on a unit sphere.

On a high dimensional sphere they should generally be close to square root of 2 radius away from each other.

dan-robertson · on Feb 10, 2022

The fact should say that the expected distance between two random points tends to the diameter as the dimension increases. The intuition is that to be close you need to be close in a large number of coordinates and the law of large numbers (though coordinates aren’t independent) suggests that is unlikely. If you fix one point on a sphere (say (1,0,…,0)) then, for a high dimension, most points will not have any extreme values in coordinates and will look like (~0,~0,…,~0) where ~0 means something close to zero. But if we sum the squares of everything apart from the first we get 1 - (~0)^2 ~= 1, so the distance from our fixed point is (1 - ~0)^2 + sum_2^n (0 - ~0)^2 ~= 1 + 1 = 2.

Retric · on Feb 10, 2022

You forgot the square root on distance formula. Distance = square root of ((X1 - X2) ^ 2 + (Y1 - Y2) ^2 + …).

Consider the origin (0,0,0, …) to a random point on the sphere (~0, ~0, ~0, …). So Distance from origin = square root of ((~0-0)^2 + (~0-0)^2 + (~0-0)^2 + … ), which sums to 1 by definition of the unit high dimensional sphere.

Then plug in 1 vs 0 in the first place because we care about (1,0,0,0 …) and you get the correct answer = square root of ((~0-1)^2 + (~0-0)^2 + (~0-0)^2 + … ) ~= square root of 2.

Edited to fix typo and add clarity.

dan-robertson · on Feb 10, 2022

Wow. Can’t believe I missed that.

hedora · on Feb 10, 2022

If the data points are in the space [0,1]^n, and your metric function is:

d(x,y) = 0 if x == y; 1 otherwise

Then all points are distance one apart. It's been proven that, as dimensionality increases, normal euclidian distance over uniform point clouds rapidly converges to have the same behavior as the equality metric.

The proof relies on the information gained by performing pairwise distance calculations.

In the example distance function I gave, there is zero information gained if you plug in two points that are known to be non-equal.

The information gained from evaluating the Euclidian distance function converges to zero as the dimensionality of the data set increases.

(Note: This does not hold for low dimensional data that's been embedded in a higher dimensional space.)

Edit: Misread your comment. Yes, everything ends up being the same distance apart. More precisely, the ratio of mean distance / stddev distance tends to infinity. The intrinsic dimensionality of the data is monotonic w.r.t. that ratio.

bick_nyers · on Feb 10, 2022

Euclidean distance calculations change based on number of dimensions, for example, in 3 dimensions it is sqrt(a^2+b^2+c^2).

Retric · on Feb 10, 2022

Yes, that’s why it’s square root of 2. Consider the origin (0,0,0, …) to a random point on the sphere (~0, ~0, ~0, …).

Distance = square root of ((X1 - X2) ^ 2 + (Y1 - Y2) ^2 + …). So D = square root of ((~0-0)^2 + (~0-0)^2 + (~0-0)^2 + … ), which is equal to 1 by definition of the unit high dimensional sphere.

So distance from (1,0,0,0 …) to (~0, ~0, ~0, …) = square root of ((~0-1)^2 + (~0-0)^2 + (~0-0)^2 + … ) ~= square root of 2.

bick_nyers · on Feb 10, 2022

Ahh ok, for some reason I was thinking (1,1,1) would be a valid point in this case

ravi-delia · on Feb 10, 2022

It should be almost all points are almost a full diameter away. However it's still very striking, and an unintuitive fact about very high dimensional spheres.

pfortuny · on Feb 10, 2022

It is for VERY HUGE n, as siblings explain.

6gvONxR4sf7o · on Feb 10, 2022

Off topic, but I love that they make it trivial to find a link to the original paper. I know not everyone loves quanta, but stuff like this is really refreshing.

lordgrenville · on Feb 10, 2022

What do people not like about Quanta?

lnyan · on Feb 11, 2022

Reviews and authors' responses of this paper: https://openreview.net/forum?id=z71OSKqTFh7

dchftcs · on Feb 11, 2022

The result is mildly interesting but doesn't strike me as remarkable. We've always known that overparameterization with regularization can improve performance in certain classes of problems. The comparison with using n-order polynomial to fit n points is also non-sequitur.

So there's a lower bound on the number of parameters required to produce a good interpolation of a broad class of "smooth" functions, and that's larger than the data size. Ok. I'm guessing one could find an even larger lower bound to approximate a more exotic class of functions that are still "smooth" in some intuitive sense, and it wouldn't make me any more excited.

The main problem is, what does it say about how well a neural network can approximate specific types of functions, compared to anything else, with the same or more parameters, with the same huge amount of data, which has always been the real mystery here?

Can we analytically define a class of functions that can be fit well by neural networks with n data points, using < parameters, than any other known method, and map these functions to real world applications? Or, in other words, given a class of functions representing general real world data sets that neural networks tend to do well on (dogs and cats), can we characterize a tractable algorithm to compute nd parameters needed from the n sample points where there's high probability that it could produce a good interpolation?

If not, then this paper doesn't explain anything better for neural networks than for say kernel methods with polynomial kernels or manifold reconstruction using restricted classes of spline functions. The sensationalist headline, along with the "books need to be written" rhetoric, wants to make us think the result or the techniques it presented could get us closer to answering the above questions, but it seems to me there's zero truth in that.

555watch · on Feb 11, 2022

Maybe someone that knows the topic well could elaborate a little bit on how do the authors arrive to their conclusion? From what I'm gathering, by Theorem 2 they show, that with n approaching infinity, the probability that the function $f$ exists approaches zero very quickly. I understand that, for this reason, they require that the variables $d$ and $k$ also approach infinity reasonably fast (or faster than $n$) in order to bound that probability with anything other than zero.

We can already stop here and question, why does an upper bound on any similar probability interest us? If we want to show that the function $f$ exists with high probability, we should also consider lower bounds, not only the upper bounds as is done in the paper (clearly, I can bound any probability by 1 and would not be wrong).

But even leaving this question aside and going back to Theorem 2, they essentially show that for any sample of size $n$, they can find (and overfit) a smooth function $f$, given a sufficiently large model space $d$ and $k$. Assume I launch such a model in production, and continue generating further observations $n$. It follows from the theorem 2 that $f$ will quickly become unsuitable and require retraining on a larger training space $d$ and $k$.

However, from a statistical standing point, unless the data generating process is non-differentiable at every point, we should be able to assume that there exists N_epsilon, such, that for every n > n_epsilon, we should be able to find $f$ such that the errors would be controllably small (< epsilon) against the true data generating process f + sigma. So, beyond a certain $n$, further increasing of observations should not affect the initial fit of the model.

This is not at all what follows from the Theorem 2, suggesting that any such $f$ is still fitting on the errors, not necessarily the true process.

What am I missing (or assuming incorrectly)? Would be very interested to discuss this paper further!

kd5bjo · on Feb 10, 2022

Is there a corresponding result that gives the number of examples needed to provide a sufficient training set for a given physical phenomenon? I’m imagining a high-dimensional equivalent of Nyquist’s sampling theorem.

Coupled with this result, we’d then have a reasonable estimator of the network size required for particular tasks before even starting the data collection.

pishpash · on Feb 10, 2022

VC dimension?

samwisedum · on Feb 10, 2022

Let's add more nodes so we can overfit even better!

zwaps · on Feb 10, 2022

Can someone speak to the generality of assuming c-isoperimetry for the distribution of features?

Without knowing anything about this in particular, this seems to be a rather pertinent restriction of the result related to things like sampling assumptions and the like.

woopwoop · on Feb 10, 2022

It really depends on what we assume about "natural" data. If it looks "positively curved", e.g. the uniform measure on the boundary of a convex body, or the a gaussian, or something, this holds. But if the distribution exhibits a strong hierarchical structure that's not so good. I think it's a plausible if not obviously true assumption.

yalogin · on Feb 10, 2022

Can someone ELI5 how one increases the size of an NN? If I take the handwritten digit classifier that people use as ML 101, is it just a matter of increasing the size of the hidden layers?

teebs · on Feb 11, 2022

You can just do that by adding large hidden layers or more hidden layers, up to a point. But eventually the signal that's coming from the input data is so diluted through all the neurons and layers that your models stop performing better as you add more neurons. Many of the advances in NNs come from structuring the neurons in particular ways - for example, in computer vision, convolutional neural networks. These are kind of a small neural network that looks at each part of the image and gives an output - basically, shrinking and summarizing the image - which is then fed into another small neural network that shrinks & summarizes some more, and so on until you have only one result, like "is this a cat" or something like that. Transformers, which are mostly used for natural language processing, have small neural network layers that let the network figure out which other words in a text are relevant for understanding a given word (or, token, really). These help simplify the problem for the NN so it doesn't have to sort through all the data it has at once, which is a bottleneck on scaling NNs.

Not sure that explains it like you're 5, but hopefully it addresses your question

joshvm · on Feb 11, 2022

EfficientNet is a nice simple example of this. It's family of networks with different sizes, controlled by a single scaling parameter. That parameter controls depth, width and resolution - see here for more details, the figure is easy to understand: https://paperswithcode.com/method/efficientnet

malka · on Feb 11, 2022

Yes, or add more layers

raymondh · on Feb 13, 2022

> The proof relies on a curious fact about high-dimensional geometry, which is that randomly distributed points placed on the surface of a sphere are almost all a full diameter away from each other.

That should be √2 × radius.

Looking out from the North Pole of hypersphere, almost all points lie near the equator, not the South Pole.

feketegy · on Feb 10, 2022

This looks interesting, I bookmarked it.

My biggest blocker is the "statistics" part of M/L, knowing what algorithms to choose for various cases.

qorrect · on Feb 10, 2022

This book was a big help for me and is very well written, https://xcelab.net/rm/statistical-rethinking/ . You can find it free online ( along with video course ). The printed version is a very nice high quality book.

stevofolife · on Feb 10, 2022

Do you know if there are any online classes that use this book as a reference? Or more generally, what type of courses teaches this subject?

lariati · on Feb 10, 2022

Thanks so much. That is an amazing level of choice in the example code. I need this right now as a type of statistical strength training.

kache_ · on Feb 10, 2022

check out introduction to statistical learning

VirusNewbie · on Feb 10, 2022

So, the smoother the function, the easier it is to approximate using SGD? Does that mean there's a relationship between training time and parameters and beyond a certain parameter count we can discover an equivalent approximation by trading one for the other?

mensetmanusman · on Feb 11, 2022

Size = N x D

D is the dimensionality, and they give a picture example with D = X.pixels * Y.pixels

How does D change when you introduce video? (Picture arrays)

E.g. can a NN recognize something more by having access to an array of approximately the same image as it sweeps through space/time?

viraptor · on Feb 11, 2022

You'd expect so, because there are already projects which do sub-pixel image recovery from video frame. So putting that before recognition should be doable.

ogogmad · on Feb 10, 2022

Does this help against adversarial examples? The article seems to suggest so.

AmericanBlarney · on Feb 10, 2022

This conclusion feels like saying more CPU and memory are better. Seems obvious that more moves allows matching to have more nuance, but I guess cool that someone proved it.

nazgul17 · on Feb 10, 2022

From what I understand, it says that more parameters are good. This wasn't obvious before this paper: you can fit a polynomial instead of a neutral net, but adding parameters wouldn't help with robustness in that case: the polynomial would become more and more jagged.

ska · on Feb 10, 2022

> Seems obvious that more moves allows matching to have more nuance,

This really has to be balance against overfitting. The key problem in ML is generalization, and lots of things improve training performance while making that worse.

amelius · on Feb 10, 2022

Asymptotically better? Or practically better?

ravi-delia · on Feb 10, 2022

We know from reality that they get practically better, but theoretic intuition suggests you shouldn't see an effect after some point. This paper shows that this intuition is wrong if you want your networks to be robust. It doesn't guarantee large networks will be though.

antishatter · on Feb 11, 2022

More great news from the university of Texas