From the post he's referring to text input as well:
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
I’m normally skeptical of claims like this, but looking at the examples it seems that Sora is reproducing some of its training data verbatim. I guess it’s a case of overfitting? In particular the Civ example seems like it must have been copied almost verbatim.
I agree with the sentiment but I want to point out that a car is not essential for most people living in SF, although many people outside the city think this. Around 35% of households don’t have a car: https://www.sfmta.com/sites/default/files/reports-and-docume...
Today I learned! I remember it being sold to a Chinese company a few years ago, didn’t know it went back to US ownership. Ironically just deleted it because even with an Adblocker it’s still unusable with all the bots