Maybe someone that knows the topic well could elaborate a little bit on how do the authors arrive to their conclusion? From what I'm gathering, by Theorem 2 they show, that with n approaching infinity, the probability that the function $f$ exists approaches zero very quickly. I understand that, for this reason, they require that the variables $d$ and $k$ also approach infinity reasonably fast (or faster than $n$) in order to bound that probability with anything other than zero.
We can already stop here and question, why does an upper bound on any similar probability interest us? If we want to show that the function $f$ exists with high probability, we should also consider lower bounds, not only the upper bounds as is done in the paper (clearly, I can bound any probability by 1 and would not be wrong).
But even leaving this question aside and going back to Theorem 2, they essentially show that for any sample of size $n$, they can find (and overfit) a smooth function $f$, given a sufficiently large model space $d$ and $k$. Assume I launch such a model in production, and continue generating further observations $n$. It follows from the theorem 2 that $f$ will quickly become unsuitable and require retraining on a larger training space $d$ and $k$.
However, from a statistical standing point, unless the data generating process is non-differentiable at every point, we should be able to assume that there exists N_epsilon, such, that for every n > n_epsilon, we should be able to find $f$ such that the errors would be controllably small (< epsilon) against the true data generating process f + sigma. So, beyond a certain $n$, further increasing of observations should not affect the initial fit of the model.
This is not at all what follows from the Theorem 2, suggesting that any such $f$ is still fitting on the errors, not necessarily the true process.
What am I missing (or assuming incorrectly)? Would be very interested to discuss this paper further!
We can already stop here and question, why does an upper bound on any similar probability interest us? If we want to show that the function $f$ exists with high probability, we should also consider lower bounds, not only the upper bounds as is done in the paper (clearly, I can bound any probability by 1 and would not be wrong).
But even leaving this question aside and going back to Theorem 2, they essentially show that for any sample of size $n$, they can find (and overfit) a smooth function $f$, given a sufficiently large model space $d$ and $k$. Assume I launch such a model in production, and continue generating further observations $n$. It follows from the theorem 2 that $f$ will quickly become unsuitable and require retraining on a larger training space $d$ and $k$.
However, from a statistical standing point, unless the data generating process is non-differentiable at every point, we should be able to assume that there exists N_epsilon, such, that for every n > n_epsilon, we should be able to find $f$ such that the errors would be controllably small (< epsilon) against the true data generating process f + sigma. So, beyond a certain $n$, further increasing of observations should not affect the initial fit of the model.
This is not at all what follows from the Theorem 2, suggesting that any such $f$ is still fitting on the errors, not necessarily the true process.
What am I missing (or assuming incorrectly)? Would be very interested to discuss this paper further!