> Right now, we are routinely creating neural networks that have a number of parameters more than the number of training samples. This says that the books have to be rewritten.
Confused by this statement. Double descent with overparameterization is exhibited in "classical settings" too and mentioned in older books.
> In their new proof, the pair show that overparameterization is necessary for a network to be robust.
What is important to note here is that many of papers this paper cites prove or show this result in certain network architectures. This paper adds universality.
> The proof is very elementary — no heavy math, and it says something very general
The most elementary part was clever use of Hoeffding's inequality. Some people are really fast readers haha.
I don't even know how you pick up the fact that isoperimetry holds in manifold settings with positive curvature while also playing with all those norms and inequalities. A few years ago I mentioned on here all the maths that I knew or wanted to know to read more papers, and others critiqued that the list was too long. Well, this is why!
> Double descent with overparameterization is exhibited in "classical settings" too and mentioned in older books.
I’m curious for references or citations to this. When I was going over double descent I tried to find citations like this (just in a couple places like ML/stats textbooks).
There are a handful of papers in the 90s that show this, but it wasn't recognized for what it is. Double descent is REALLY crazy to me, coming from a classical background.
Sure, but that's identification approaches in econometrics and matrix analysis contexts. Using that for neural networks is new-ish in the zeitgeist, which did not exist in the 1990s as it does today.
> Right now, we are routinely creating neural networks that have a number of parameters more than the number of training samples. This says that the books have to be rewritten.
I also believe that this statement is weird. I have a very shallow knowledge of ML, but I can imagine that in a convolutional neural network a training sample interacts with lots of parameters. This 'one training sample <-> one parameter' correspondence seems wrong to me.
One CNN parameter would interact with the total number of training samples, but that doesn’t tell you about the ratio of total training samples to total CNN parameters.
I don't see how lottery tickets yield the isoperimetry result, even in a heuristic or handwavy sort of way. Yes, a larger network is more likely to have good-scoring subnetworks; sure. But that's all it says. What does that tell me about how efficiently I can construct an adversarial example? For that, I need something else, like, say a geometric argument about what sort of network will interpolate between high-dimensional datapoints with properties like "not changing much in response to small input changes"...
In the linked Uber paper, can someone explain the plots with the ellipses? It's a 2d plot of initial vs final weight, yes? But where does the ellipse shape come from and why?
Density is also important. If we look at other things - some recent studies have been done on number-counting (https://royalsocietypublishing.org/doi/10.1098/rstb.2020.052...) or bird brains (https://www.gwern.net/docs/psychology/neuroscience/2020-herc...) - density jumps out as a major predictor. African elephants may have some more neurons, but the density isn't as great as a human where it counts, so they are remarkably intelligent (like ravens and crows), but still not human-level. There are diminishing returns in both directions. We have more neurons than any bird as much or more dense, and we have more density than any elephant with as many or more neurons. Put that together, and we squeak across the finish line to being just smart enough to create civilization.
An analogy: what's the difference between a supercomputer, and the same number of CPUs scattered across a few datacenters? It's that in a supercomputer, those CPUs are packed physically as close as possible with expensive interconnects to allow them to communicate as fast as possible. (For many applications, the supercomputer will finish long before the spread out nodes ever finish communicating and idling.) But you need to improve both or else your new super-fast CPUs will spend all their time waiting on Infiniband to chug through, or your fancy new Infiniband will be underutilized and you should've bought more CPUs.
My understanding is more than the density configuration of the neurons matters most. The reason is in some cases neural network with drop out's perform better than fully connected neural network. This proves less dense networks can be more intelligent.
I wouldn't say "plenty" - few primates, dolphins, orcas, elephants, and, strangely, magpies. But the grounds for that claim are shaky for some of them, the only 3 species we are 100% sure about are chimpanzees, orangutans, and humans. Magpies, for example, require "a training" (whatever that means).
Probably cause we've only tested a few, not that it matters though. Humans take a pretty long time to recognize themselves in the mirror. I wonder if the mirror test would change if we would expose the animals for a almost a year before doing the test, just like humans.
That said even ants pass the test, i.e. they were recently(2015) tested.
But the whole thing can be characterized as: "Let me make up a random test, according to my personal opinion of what defines cognition and then see if a random animal I choose passes it".
Every couple of years we have requests of slews of psychology papers requested to be invalidated because they're unreproducible.
> But the whole thing can be characterized as: "Let me make up a random test, according to my personal opinion of what defines cognition and then see if a random animal I choose passes it".
But of course! How do we know that humans are, indeed, the smartest? What if we've been failing every single test that mice have been throwing at us over the past millennia, and they wonder why we are so dumb?
(This is a reference to The Hitchhiker's Guide to the Galaxy in case you're wondering if I've gone mad. Not that one wouldn't presuppose the other :)
Dolphins and elephants are famous examples, most primates as well. Even many birds show levels of self awareness and theory of mind (they know the difference between what they know and what others know)
In fact there is a popular theory[1] that bird intelligence evolved because of the way their social structures work. Birds mate for life but they cheat. Every bird wants their partner to be loyal and itself to sex as many other birds as possible.
This means birds have to keep track of who can and can't see them cheat, who knows and who doesn't. There's even evidence that they rat each other out (2nd degree info) if they think there's a reward to be had. All of this requires immense intelligence, which happens to prove useful in other contexts.
There's also a bird species who does this with food caches. Easier to steal from others than to build their own so a plethora of deceptive tactics developed to ensure others can't see where you're storing those delicious nuts. Complete with fake caches, lying, and espionage.
There is proof, and you can find lots of studies and papers on it. HN is a bit cultish and likes to think that humans are far above every other creature on the planet in importance.
Because, IIRC, a lot of neurons are dedicated to motion/sensing.
Bigger animals may require more neurons to handle moving larger and/or more complicated muscle groups.
Interesting related point there is the encephalization quotient which is related to the predicted ratio of brain size to body mass. On the wikipedia page [0] they list the EQ for various animals. Humans are the highest but dolphins and ravens are not far behind.
To further emphasize that having neural material focused on the appropriate functions is more important vs how much you have, here is a story about a guy whose brain is mostly hollow and filled with fluid, it probably did cause his IQ to be 75 and causes him weakness in his legs, but otherwise he lives a normal adult life more or less.
Volume != neurons. In any case, 75 is awful and is usually considered borderline retarded. (If you're tempted to respond with other cases of higher IQ, note that they are often retracted or unconfirmed and likely fraudulent in some way; see https://www.gwern.net/Hydrocephalus .)
Exactly. Most of the newer research on this topic suggests that it's neural connection complexity, and specifically frontal lobe volume, rather than overall brain size that determines intelligence or brain power.
>Luckily, there is much more to a brain when you look at it under a microscope, and most neuroscientists now believe that the complexity of cellular and molecular organization of neural connections, or synapses, is what truly determines a brain’s computational capacity. This view is supported by findings that intelligence is more correlated with frontal lobe volume and volume of gray matter, which is dense in neural cell bodies and synapses, than sheer brain size. Other research comparing proteins at synapses between different species suggests that what makes up synapses at the molecular level has had a huge impact on intelligence throughout evolutionary history. So, although having a big brain is somewhat predictive of having big smarts, intelligence probably depends much more on how efficiently different parts of your brain communicate with each other.
As a counterpoint, rats without a cortex can do...basically everything normal rats can do--except trim their toenails. The classic reference for this is Whitslaw's 1990 chapter "The decorticate rat".
The whole nail part is basically a single sentence in the paper.
For example, decorticate rats are unable to escape narrow alleyways because they can not turn around due to their tonsils touching the walls and them being unable to ignore that feeling.
Another example is that they take a few seconds vs (!) 5 minutes to groom themselves on average.
Yes! I got interested in this when my colleague worked with a person who had an entire hemisphere resected as a teenage.
FWIW, the nail thing is a bit of a neuroscience meme. I heard--and stole--this quip from multiple people in several different situations. There's also a really striking figure in that chapter (p. 7 or 8).
No argument that the rats' behaviors are affected. I suppose whether you find the slowness of their grooming expected (because of brain damage) or impressive (because it happens at all) is a matter of taste. Glass^W Skull half-empty or half-full, if you will.
The cool thing about Whitslaw's work is that it focused on natural behaviors (rather than like...a rotorod test).
I don't think he released them into the wild (would be a tough experiment with 80s tech), but there are a bunch of studies of their interactions with conspecifics. They can mate[0], though less successfully than controls, but playfight a bit better than they do[1].
> “If something happens very slowly over quite some time, maybe over decades, the different parts of the brain take up functions that would normally be done by the part that is pushed to the side,” adds Muenke, who was not involved in the case.
Did you see the scans? The dude's head is practically empty (brain 55-75% smaller than normal) and nobody even noticed until he was 44 years old and got an MRI.
Why would you expect that, when a tiny insect can do pretty intelligent things? What "unexpected" things humans can do are probably all in the >75 IQ range.
You probably already know this (since you wrote "silly thought"), but real-life neurons are ridiculously more complex than simulated "neurons" in an NN. So the analogy doesn't really hold.
An individual biological neuron can compute a variety of functions, including max and xor, that a single perceptron can't (e.g., https://www.science.org/doi/10.1126/science.aax6239 ). In general, one needs a fairly elaborate ANN to approximate the behavior of a single biological neuron.
OTOH, a three-layer network is a universal function approximator and RNNs are universal dynamical systems approximators, so they are sort of trivially equivalent.
For one they need to engage in metabolism and reproduction, but I'd like to see some argument of how neurons are more complex without those needs, e.g. do they compute some radically different class of functions than typical ANN's do, or require entirely different interconnections, etc.
You can simulate the data processing of a real neuron with 1000 digital ones, a small neural net.
I think we read too much into the complexity of biological neurons. Remember they need to do much more than compute signals. They need to self assemble, self replicate and pass through various stages of growth. They need to function for 80-100 years. Many of those neurons and synapses exist only for redundancy and other biological constraints.
A digital neuron doesn't care about its physical substrate and can be millions of times faster. They can be copied identically for no cost and cheaply fine-tuned for new tasks. Their architecture and data can evolve much faster than ours, and the physical implementation can remain the same during this process.
Really? You'd immediately recognize someone you knew for a few weeks over 20 years ago? You wouldn't need a bit to try to figure out who they are?
If so, then your memory is unusually good. I know that this is well beyond my capabilities. Nor do I have the ability to visit a place that I lived 40 years earlier and find my way around.
I think it depends on the intensity of the experience.
I recently found myself in a hotel that I stayed in as a 7-8 year old in the 80s for a particularly memorable vacation with my extended family. It was funny that I still remembered the I unusual aspects of the layout and could spot many of the changes that had been made over the years.
But if you asked me to describe someone I met for a few days in a business context in 2020, I’d have a hard time remembering detail.
How many other elephants did these elephants see in those 20+ years? It wouldn’t surprise me if that was fewer than 100. How many did they spend a few weeks or more with? It wouldn’t surprise me if that were less than 20.
There also, AFAIK, isn’t evidence they remember _all_ other elephants they’ve shared time with for at least few weeks (I certainly do not rule that out, either, given the low number they likely will meet in their life)
Aside from other points, more neurons might be better "all else equal", but there are differences between our brain and an elephant's beyond just neuron count.
It's like how just getting a bigger faster computer can help with your problem, but its less powerful than a new more efficient algorithm on the same computer.
I read on wikipedia [0] the other day a fairly disturbing statistic related to this, apparently human men have on average a ~10% bigger brain than women. It'd be interesting to know if that translates to a higher neuron count or the difference in volume is due to something else.
Probably just a consequence of overall physical size being larger. AFAIK there continues to be no evidence of a sex difference in overall intelligence, so slight difference in brain size is probably a red herring.
In particular, a given elephant might be "more intelligent" than a human -- we just happen to have evolved from a particular niche that has rendered us bizarrely good at abstracting knowledge and combining it with the knowledge of other humans.
We have very good problem solving ability of course, but a superpowered ability to ask others how they solved the problem. If we wanted to somehow define a kind of 'brain horsepower' type intelligence, it seems to me that the former is closer to it than the latter, and it doesn't seem obvious to me that humans would necessarily take the top spot. Or that there's a reasonable/ethical way to test it -- let's take a human, elephant, crow, and dolphin, raise them in total isolation from the any community to get a measure of their untrained intelligence... we might get some interesting results on intelligence, but mostly we will learn something about ballistics as some ethics review board launches us unto the Sun.
It may be hard to measure and even define precisely, but I think it's pretty clear that if we did agree on a definition in the context of this conversation it would be defined in such a way that humans are more intelligent than elephants.
I have listened to François Chollet say that all intelligence is specialized intelligence.
I suspect the question really doesn't make sense if that is true.
We just have this bias/mind projection fallacy that intelligence is a general physical property of the brain that can be measured. I just suspect this is not true.
Like athletic ability doesn't generalize well. Of course, someone not athletic at all is never going to be a great athlete in anything but it makes no sense to compare Lance Armstrong to Patrick Mahomes in some general athletic context. Putting a number on a general athletic ability index between the two would just be total nonsense.
If neural networks have shown us anything, it's that not all neurons are born the same. After all, even in nature there are multiple types of neurons.
I imagine it's 100% dependent on the cardinal rule of neural networks:
"Choice of training data is 10 times more important than the actual model."
What we have over elephants are opposable thumbs, excellent eyes, and vocal cords. And crucially, we're generally speaking pretty slow, weak and useless.
Except for our elaborate methods of I/O.
Our entire success is based on a feedback loop. "If human uses their IO this way, human will get more food."
Thus, we become ever more sophisticated at this. We are nothing if not a vehicle for using our high dexterity, low gross force, opposable thumbs in inventive ways to get food.
Plus, we have a biological imperative to pass these techniques on as knowledge.
A baby elephant can probably feed itself by eating green stuff at 1 year old (I know nothing about elephants).
A human child realistically cannot independently scrounge up enough solid food to sustain themselves, until they're what, twelve? Twenty-two? Certainly no younger than eight.
We have, almost certainly, the most useless progeny in the animal kingdom.
Hence we invest an enormous amount of time and energy in education to make them able to feed themselves.
So the tl;dr is
1) Human brains are pretty similar to the animal kingdom's.
2) Human opposable thumbs are world class. Pretty close to as good as it gets. Sight is also top notch, many animals have useless eyeballs.
3) Most human food is obtained by doing creative things with thumbs. This is very complex, and takes a lot of practice.
4) Human birth the most useless children in the entire animal kingdom. These children take decades to fully grow, hence we invest an enormous amount of time educating them in opposable thumbs.
5) Over time our education system gets better and better, and our list of clever things we can do with opposable thumbs get longer and longer.
Essentially what we have over the other animals isn't neurons.
What we have over the other animals is a data collection/cleaning/utilization cycle.
Without opposable thumbs, an elephant is probably quite envious of human writing & typing. Let's use the privilege wisely to encourage one another, teach and learn from each other, from Donald Tusk, and give a helping hand.
You can befriend corvids. Teaching them symbolic language is tricky, but they can trade and socialise and solve puzzles (if you manage to explain the puzzle).
> The proof relies on a curious fact about high-dimensional geometry, which is that randomly distributed points placed on the surface of a sphere are almost all a full diameter away from each other.
What theorem is this referring to? Sounds like something I should already be familiar with, but I'm not.
I think it's something related to the curse of dimensionality [1] [2], basically just a property of high dimensional spaces (perhaps only certain kinds of spaces though).
Even though almost every all pairs of points are almost a full diameter away from each other, they are also almost all almost orthogonal (i.e. the angle they make with the center of the sphere is very close to 90 degrees).
The intrinsic dimensionality of a dataset is also relevant here.
The M-Tree is one of my favorite indexes. It works with data that's embedded in infinite dimensional spaces (sometimes; it's bumping up against an impossibility result that's sketched in a sibling comment).
For reference, see the book High Dimensional Probability by Vershynin. It's free online. See Theorem 3.1.1. It proves that a sub-gaussian random vector is in some sense close in norm to sqrt(n) where n is the number of dimensions. Most of these results are true up to multiplying by some unknown constant.
My initial intuition is telling me that it would be diameter/2, from the perspective of a single point, the closest points would be near zero distance away, and the furthest points would be on the opposite side, a full diameter away, and I am assuming that there are a lot of points in a uniform distribution.
What I have just thought about though, is what points would be exactly diameter/2 distance away from that point? If you have a circle, you might think it would be the points that form a 90 degree triangle, but that is not the case, those points would be sqrt(2)*radius distance away.
So while it is obvious to me that it is not diameter/2, it is not obvious to me why it would be diameter either, or how larger n converges it closer to the diameter or some other fixed number.
If you consider a point on the sphere it means choosing a bunch of xi such that:
x1^2 + x2^2 + … + xn^2 = 1.
Suppose wlog you pick (1,0,0,…,0). Then the distance from your point to a random point is:
D = (x1-1)^2 + x2^2 + … + xn^2
And from the first equation we know:
x1^2 = 1 - x2^2 - x3^2 - … - xn^2
Intuitionistically, your point will be far from a random point if x1 is close to zero, and x1 will be close to zero because everything is close to zero.
But we can be more mathematical about it. Our (very reasonable) assumption is that the volume of a n-dimensional disk is proportional to the nth power of its radius. The third equation shows that x1 is going to be big (meaning the distance to the chosen point above is not so close to the diameter) if a corresponding[1] point on the n-disk is close to the middle. But the distance from the origin, R, of a random point in the n-disk is distributed with pdf proportional to p(r) = r^n for r in [0,1]. So the cdf is just r^(n+1) and E[x1^2] = 1 - E[R] = 1 - (n+1)/(n+2), which tends to 0 as n grows.
Therefore we get E[D] = E[(1-x1)^2] + 1 - E[x1^2] which tends to 2 as n grows large.
[1] the correspondence is that if I give you a point on a disk, you can turn it into a point on a sphere by flipping a coin to decide if it goes in the upper or lower hemisphere and then projecting up or down perpendicular to the disk from the point onto the sphere. But thinking a little more, I’m not sure this preserves the metric as it favours points on the sphere that correspond to the middle parts of the disk. So I think the actual expected value of x1 should be smaller.
Let me hijack your explanation starting from this point:
D = (x₁-1)² + (x₂² + … + xₙ²)
Since all the xₙ² sum to 1, as the dimensionality grows (∑xₙ²→1 as n→∞) each individual xₙ will converge towards 0. Since x₁ is almost 0, therefore the (x₁-1)² term will be almost 1.
Since we know that ∑xₙ²=1, and that x₁² is almost 0, then we also know that ∑xₙ² - x₁² is almost 1, which is the 2nd half of the above expression for D. So the average distance converges to "almost 1 + almost 1", which "almost 2", which is the diameter.
I'm not sure it will. x1 is chosen randomly in the -1..1 interval. I dont see how the million other dimensions would force it to stick to 0. Those N other dimensions shrink the stddev(xi) by sqrt(N), though.
The fact that average of D is sqrt(2) is the easy part. But average doesnt mean that random D values would concentrate in that spot. D may be evenly distributed. So the question is what's the variance of D? Using your formulas above:
D^2 = (1 - x1)^2 + (1 - x1^2) = 2 - 2x1
We know that x1 is 0 on average, but its distribution is restricted by the fact that we choose a random point on the n-sphere. So what's the probability of that random point falling into a thin stripe where abs(x1) < eps? And how does this probability behave for large n?
This eps-stripe is basically the (n-1)-sphere of width eps, so the stripe's area A(n)=eps•S(n-1), and so our probability p=eps•S(n-1)/S(n).
The "magical" property of n-spheres is that their area grows at first, reaches maximum in 7 dimensions and then falls off rapidly to zero. Using formulas from wikipedia, I get: p = eps•(n/4)^(n/2) for large n.
In other words, the distribution of x1 approaches the look of the delta function at the n^n pace and for all practical matters, D=sqrt(2) with high precision for n > 10.
I think the most intuitive way of thinking about this is sphere packing. Asking what percent of points are within distance d of an n-sphere of radius 1 is equivalent to asking what the ratio of volumes is. For d<1, the n-volume of a radius d sphere tends to 0 as n goes towards infinity, so that means almost all of the points are as far away as possible.
Not my area of expertise, but the quoted "fact" seems at best incompletely stated: surely for it to hold there must be some constraints on the number of points (likely as a function of the diameter)?
It’s just wrong as stated, there is only one point a full diameter away from each point on a high dimensional sphere. Aka (1,0,0,0,0, …) maps to (-1,0,0,0,0, …) and nothing else. Just as (1,0) maps to (-1,0) on a unit circle and (1,0,0) maps to (-1,0,0) on a unit sphere.
On a high dimensional sphere they should generally be close to square root of 2 radius away from each other.
The fact should say that the expected distance between two random points tends to the diameter as the dimension increases. The intuition is that to be close you need to be close in a large number of coordinates and the law of large numbers (though coordinates aren’t independent) suggests that is unlikely. If you fix one point on a sphere (say (1,0,…,0)) then, for a high dimension, most points will not have any extreme values in coordinates and will look like (~0,~0,…,~0) where ~0 means something close to zero. But if we sum the squares of everything apart from the first we get 1 - (~0)^2 ~= 1, so the distance from our fixed point is (1 - ~0)^2 + sum_2^n (0 - ~0)^2 ~= 1 + 1 = 2.
You forgot the square root on distance formula. Distance = square root of ((X1 - X2) ^ 2 + (Y1 - Y2) ^2 + …).
Consider the origin (0,0,0, …) to a random point on the sphere (~0, ~0, ~0, …). So Distance from origin = square root of ((~0-0)^2 + (~0-0)^2 + (~0-0)^2 + … ), which sums to 1 by definition of the unit high dimensional sphere.
Then plug in 1 vs 0 in the first place because we care about (1,0,0,0 …) and you get the correct answer = square root of ((~0-1)^2 + (~0-0)^2 + (~0-0)^2 + … ) ~= square root of 2.
If the data points are in the space [0,1]^n, and your metric function is:
d(x,y) = 0 if x == y; 1 otherwise
Then all points are distance one apart. It's been proven that, as dimensionality increases, normal euclidian distance over uniform point clouds rapidly converges to have the same behavior as the equality metric.
The proof relies on the information gained by performing pairwise distance calculations.
In the example distance function I gave, there is zero information gained if you plug in two points that are known to be non-equal.
The information gained from evaluating the Euclidian distance function converges to zero as the dimensionality of the data set increases.
(Note: This does not hold for low dimensional data that's been embedded in a higher dimensional space.)
Edit: Misread your comment. Yes, everything ends up being the same distance apart. More precisely, the ratio of mean distance / stddev distance tends to infinity. The intrinsic dimensionality of the data is monotonic w.r.t. that ratio.
Yes, that’s why it’s square root of 2. Consider the origin (0,0,0, …) to a random point on the sphere (~0, ~0, ~0, …).
Distance = square root of ((X1 - X2) ^ 2 + (Y1 - Y2) ^2 + …). So D = square root of ((~0-0)^2 + (~0-0)^2 + (~0-0)^2 + … ), which is equal to 1 by definition of the unit high dimensional sphere.
So distance from (1,0,0,0 …) to (~0, ~0, ~0, …) = square root of ((~0-1)^2 + (~0-0)^2 + (~0-0)^2 + … ) ~= square root of 2.
It should be almost all points are almost a full diameter away. However it's still very striking, and an unintuitive fact about very high dimensional spheres.
Off topic, but I love that they make it trivial to find a link to the original paper. I know not everyone loves quanta, but stuff like this is really refreshing.
The result is mildly interesting but doesn't strike me as remarkable. We've always known that overparameterization with regularization can improve performance in certain classes of problems. The comparison with using n-order polynomial to fit n points is also non-sequitur.
So there's a lower bound on the number of parameters required to produce a good interpolation of a broad class of "smooth" functions, and that's larger than the data size. Ok. I'm guessing one could find an even larger lower bound to approximate a more exotic class of functions that are still "smooth" in some intuitive sense, and it wouldn't make me any more excited.
The main problem is, what does it say about how well a neural network can approximate specific types of functions, compared to anything else, with the same or more parameters, with the same huge amount of data, which has always been the real mystery here?
Can we analytically define a class of functions that can be fit well by neural networks with n data points, using < parameters, than any other known method, and map these functions to real world applications? Or, in other words, given a class of functions representing general real world data sets that neural networks tend to do well on (dogs and cats), can we characterize a tractable algorithm to compute nd parameters needed from the n sample points where there's high probability that it could produce a good interpolation?
If not, then this paper doesn't explain anything better for neural networks than for say kernel methods with polynomial kernels or manifold reconstruction using restricted classes of spline functions. The sensationalist headline, along with the "books need to be written" rhetoric, wants to make us think the result or the techniques it presented could get us closer to answering the above questions, but it seems to me there's zero truth in that.
Maybe someone that knows the topic well could elaborate a little bit on how do the authors arrive to their conclusion? From what I'm gathering, by Theorem 2 they show, that with n approaching infinity, the probability that the function $f$ exists approaches zero very quickly. I understand that, for this reason, they require that the variables $d$ and $k$ also approach infinity reasonably fast (or faster than $n$) in order to bound that probability with anything other than zero.
We can already stop here and question, why does an upper bound on any similar probability interest us? If we want to show that the function $f$ exists with high probability, we should also consider lower bounds, not only the upper bounds as is done in the paper (clearly, I can bound any probability by 1 and would not be wrong).
But even leaving this question aside and going back to Theorem 2, they essentially show that for any sample of size $n$, they can find (and overfit) a smooth function $f$, given a sufficiently large model space $d$ and $k$. Assume I launch such a model in production, and continue generating further observations $n$. It follows from the theorem 2 that $f$ will quickly become unsuitable and require retraining on a larger training space $d$ and $k$.
However, from a statistical standing point, unless the data generating process is non-differentiable at every point, we should be able to assume that there exists N_epsilon, such, that for every n > n_epsilon, we should be able to find $f$ such that the errors would be controllably small (< epsilon) against the true data generating process f + sigma. So, beyond a certain $n$, further increasing of observations should not affect the initial fit of the model.
This is not at all what follows from the Theorem 2, suggesting that any such $f$ is still fitting on the errors, not necessarily the true process.
What am I missing (or assuming incorrectly)? Would be very interested to discuss this paper further!
Is there a corresponding result that gives the number of examples needed to provide a sufficient training set for a given physical phenomenon? I’m imagining a high-dimensional equivalent of Nyquist’s sampling theorem.
Coupled with this result, we’d then have a reasonable estimator of the network size required for particular tasks before even starting the data collection.
Can someone speak to the generality of assuming c-isoperimetry for the distribution of features?
Without knowing anything about this in particular, this seems to be a rather pertinent restriction of the result related to things like sampling assumptions and the like.
It really depends on what we assume about "natural" data. If it looks "positively curved", e.g. the uniform measure on the boundary of a convex body, or the a gaussian, or something, this holds. But if the distribution exhibits a strong hierarchical structure that's not so good. I think it's a plausible if not obviously true assumption.
Can someone ELI5 how one increases the size of an NN? If I take the handwritten digit classifier that people use as ML 101, is it just a matter of increasing the size of the hidden layers?
You can just do that by adding large hidden layers or more hidden layers, up to a point. But eventually the signal that's coming from the input data is so diluted through all the neurons and layers that your models stop performing better as you add more neurons. Many of the advances in NNs come from structuring the neurons in particular ways - for example, in computer vision, convolutional neural networks. These are kind of a small neural network that looks at each part of the image and gives an output - basically, shrinking and summarizing the image - which is then fed into another small neural network that shrinks & summarizes some more, and so on until you have only one result, like "is this a cat" or something like that. Transformers, which are mostly used for natural language processing, have small neural network layers that let the network figure out which other words in a text are relevant for understanding a given word (or, token, really). These help simplify the problem for the NN so it doesn't have to sort through all the data it has at once, which is a bottleneck on scaling NNs.
Not sure that explains it like you're 5, but hopefully it addresses your question
EfficientNet is a nice simple example of this. It's family of networks with different sizes, controlled by a single scaling parameter. That parameter controls depth, width and resolution - see here for more details, the figure is easy to understand: https://paperswithcode.com/method/efficientnet
> The proof relies on a curious fact about high-dimensional geometry, which is that randomly distributed points placed on the surface of a sphere are almost all a full diameter away from each other.
That should be √2 × radius.
Looking out from the North Pole of hypersphere, almost all points lie near the equator, not the South Pole.
This book was a big help for me and is very well written, https://xcelab.net/rm/statistical-rethinking/ . You can find it free online ( along with video course ). The printed version is a very nice high quality book.
So, the smoother the function, the easier it is to approximate using SGD? Does that mean there's a relationship between training time and parameters and beyond a certain parameter count we can discover an equivalent approximation by trading one for the other?
You'd expect so, because there are already projects which do sub-pixel image recovery from video frame. So putting that before recognition should be doable.
This conclusion feels like saying more CPU and memory are better. Seems obvious that more moves allows matching to have more nuance, but I guess cool that someone proved it.
From what I understand, it says that more parameters are good. This wasn't obvious before this paper: you can fit a polynomial instead of a neutral net, but adding parameters wouldn't help with robustness in that case: the polynomial would become more and more jagged.
> Seems obvious that more moves allows matching to have more nuance,
This really has to be balance against overfitting. The key problem in ML is generalization, and lots of things improve training performance while making that worse.
We know from reality that they get practically better, but theoretic intuition suggests you shouldn't see an effect after some point. This paper shows that this intuition is wrong if you want your networks to be robust. It doesn't guarantee large networks will be though.
Confused by this statement. Double descent with overparameterization is exhibited in "classical settings" too and mentioned in older books.
> In their new proof, the pair show that overparameterization is necessary for a network to be robust.
What is important to note here is that many of papers this paper cites prove or show this result in certain network architectures. This paper adds universality.
> The proof is very elementary — no heavy math, and it says something very general
The most elementary part was clever use of Hoeffding's inequality. Some people are really fast readers haha.
I don't even know how you pick up the fact that isoperimetry holds in manifold settings with positive curvature while also playing with all those norms and inequalities. A few years ago I mentioned on here all the maths that I knew or wanted to know to read more papers, and others critiqued that the list was too long. Well, this is why!