Fascinating, thanks for sharing. Are there any specific kind of problems you find this helps with?
I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)
This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.
It shines on hard problems that have a definite answer.
Google's IMO gold model used parallel reasoning. I don't know what exactly theirs looks like, but their Mind Evolution paper had a similar to my llm-consortium. The main difference being that theirs carries on isolated reasoning, while mine in it's default mode shares the synthesized answer back to the models. I don't have pockets deep enough to run benchmarks on a consortium, but I did try the example problems from that paper and my method also solved them using gemini-1.5. those where path-finding problems, like finding the optimal schedule for a trip with multiple people's calendars, locations and transport options.
And it obviously works for code and math problems. My first test was to give the llm-consortium code to a consortium to look for bugs. It identified a serious bug which only one of the three models detected. So on that case it saved me time, as using them on their own would have missed the bug or required multiple attempts.
I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)
This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.