"Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
Trained with 300 raw pairs directly from the ARC training set without using any data augmentation process, such as generating many more pairs with some kind of ARC generator? That's amazing.
Fasciating. Language is a type of action evoloved for information exchaging, which maps latent "video", "audio" and "thoughts" into "sentences" and vice versa.
Asking models to do math is kind of an effecitve way to measure their capabilities, especially in reasoning and abstraction, which are quite important for problem solving.
You don't need to reason and abstraction to do basic calculation. ChatGPT will however happily give you some decent answers about not-too -hard math that requires reasoning. It just won't operate on digits.
I think the last paragraph quite makes sense. It seems "true" that some kind of reasoning capability emerges as LLMs get bigger, which makes those LLMs quite useful and blows a lot of people's minds at the beginning. But, I think, essentially, the fundamental training goal of LLMs--guessing what the next word should be--pushes the model into a kind of reasonable nonsense generator, and the reasoning capability emerges because it can help the model to make stuff up. Therefore, we should be cautious about the result generated by these LLMs. They might be reasonable, but to make up the next word is their real top priority.