Funny how we now see AI go through developmental phases similar to what we see in young child development. In a weird convoluted way. Strawberry spelling and car wash aren't particularly intuitive as cognitive developmental stages.
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
Also strawberry spelling isn't any real test for current LLMs as they have no concept of letters, they work on tokens which may be several characters including punctuation and numerals. To have any hope of getting that question right tokens would have to have the granularity of individual letters, massively ballooning model size and training time, or the LLM needs to be able to call out to an external tool that will return the result (and needs sufficient examples in the training data to prime that trigger to fire).
While that's true, the tokenizer is half the problem. The important fault demonstrated is it doesn't _know_ it can't see the letters, and won't express this unless it has been trained or instructed to. "I can't see letters through the tokenizer" never appears in a corpus of human writing.
E.g. well known mirror-test [1], passed by kids from age 1.5-2
Or object permanence [2], children knowing by age 2 that things that are not in sight do not disappear from existence.
[1] https://en.wikipedia.org/wiki/Mirror_test [2] https://en.wikipedia.org/wiki/Object_permanence