Reflecting a bit more on the article I think the key lies close to this notion, quoted from the article:
> Since kerΠ can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to kerΠ is exactly the closure of the span of the vectors Kti.
{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:
> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of kerΠ...
So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.
> Since kerΠ can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to kerΠ is exactly the closure of the span of the vectors Kti.
{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:
> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of kerΠ...
So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.