What I find interesting is that b/c we have so much video data, we have this thi...

uoaei · on Feb 16, 2024

That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.

Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.

pzo · on Feb 16, 2024

Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)

[0] https://github.com/LiheYoung/Depth-Anything

[1] https://medium.com/@patriciogv/the-state-of-the-art-of-depth...

samus · on Feb 16, 2024

There are also models that are trained to generate 3D models from a picture. Use it on videos, and also train it on output generated by video games.

mgoetzke · on Feb 16, 2024

imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)