Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find it wild that this model does not have explicit 3D prior, yet learns to generate videos with such 3D consistency, you can directly train a 3D representation (NeRF-like) from those videos: https://twitter.com/BenMildenhall/status/1758224827788468722


I was similarly astonished at this adaptation of stable diffusion to make HDR spherical environment maps from existing images- https://diffusionlight.github.io/

The crazy thing is that they do it by prompting the model to in paint a chrome sphere into the middle of the image to reflect what is behind the camera! The model can interpret the context and dream up what is plausibly in the whole environment.


Yeah we were surprised by that, video models are great 3d prices and image models are great video model priors


You aren't looking carefully enough. I find so many inconsistencies in these examples. Perspectives that are completely wrong when the camera rotates. Windows that shift perspective, patios that are suddenly deep/shallow. Shadows that appear/disappear as the camera shifts. In other examples; paths, objects, people suddenly appearing or disappearing out of nowhere. A stone turning into a person. A horse that suddenly has a second head, then becomes a separate horse with only two legs.

It is impressive at a glance, but if you pay attention, it is more like dreaming than realism (images conjured out of other images, without attention to long term temporal, spatial, and causal consistency). I'm hardly more impressed that Google's deep dream, which is 10 years old.


You can literally run 3D algorithms like NeRF or COLMAP on those videos (check the tweet I sent), it's not my opinion, those videos are sufficiently 3D consistent that you can extract 3D geometry from them

Surely it's not perfect, but this was not the case for previous video generation algorithms


Yeah, it seems to have a hard time processing lens distortion in particular which gives a very weird quality. It's actually bending things, or trying to fill in the gaps, instead of distorting the image in the "correct" way.


That leaves me wondering if it'd be possible to get some variant of the model to directly output 3D meshes and camera animation instead of an image.


This is also true for 2D diffusion models[1]. I suppose they need to understand how 3d works for stuff like lighting/shadows/object occlusion, etc.

[1] https://dreamfusion3d.github.io/


I wonder how much it'd improve if trained on stereo image data.


Moving camera is just stereo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: