Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.
The most recent round of autoresearch (round 2) which decreased "time to GPT-2" from 1.8 hours to 1.65 hours had some examples. I adjusted the program.md to "look at modded nanogpt project and draw inspirations from there for things to try" and it came back with a bunch of tuning, but also tried and implemented new architecture changes, some of which actually helped including the smear gate and the backout skip connection. These are not just hyperparameters, they are new PyTorch code. I'm now working on a more general system that can have a queue of ideas that could be sourced from archive papers, github repos, etc.
Do you have a sense of whether these validation loss improvements are leading to generalized performance uplifts? From afar I can't tell whether these are broadly useful new ideas or just industrialized overfitting on a particular (model, dataset, hardware) tuple.
Did you consider providing the LLM with a framework for automatic hyperparamter tuning? This would free up its capacity to focus on the more important architectural decisions.
I see this critique about autoresearch online often, but I think it’s misplaced.
Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.
This automated run-and-discover process is far beyond what’s possible with hyperparam search.
It wasn't meant as a critique, I'm legitimately interested in knowing more about where it can push boundaries and where it struggles. I agree that in general it's a truism that "Claude can try a number of new ideas" etc., but the question remains as to where in particular it actually takes advantage of this to push the envelope in a way other tools don't -- since that informs when it makes sense to use something like this.