If 7 second video consumed 1k token, I'd assume the budget must be insane to pro...

MyFirstSass · on Feb 21, 2024

Yeah not feasible with todays methods and rag / lora shenanigans, but the way the field is moving i wouldn't be surprised if new decoder paradigms made it possible.

Saw this yesterday, 1M context window but haven't had any time to look into it, just an example new developments happening every week:

https://www.reddit.com/r/LocalLLaMA/comments/1as36v9/anyone_...

Invictus0 · on Feb 21, 2024

That's a 7 second video from an HD camera. When recording a screen, you only really need to consider whats changing on the screen.

nostrebored · on Feb 21, 2024

That’s not true. What content is important context on the screen might change dependent on the new changes.

MetalGuru · on Feb 22, 2024

The point is you can do massive compression. It’s more like a sequence of sparse images than video.

yazaddaruvala · on Feb 21, 2024

Unlikely to be a prompt. It would need to be some form of fine tuning like LORA.