I find it wild that this model does not have explicit 3D prior, yet learns to ge...

Nihilartikel · on Feb 16, 2024

I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.

emadm · on Feb 16, 2024

Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors

larschdk · on Feb 16, 2024

You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.

It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.

lairv · on Feb 16, 2024

You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them

Surely it's not perfect, but this was not the case for previous video generation algorithms

beezlebroxxxxxx · on Feb 16, 2024

Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.

crooked-v · on Feb 16, 2024

That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.

nodja · on Feb 16, 2024

This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/

TOMDM · on Feb 16, 2024

I wonder how much it'd improve if trained on stereo image data.

QuadmasterXLII · on Feb 16, 2024

Moving camera is just stereo.