I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722
I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/
The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.
You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.
It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.
You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them
Surely it's not perfect, but this was not the case for previous video generation algorithms
Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.