What I find interesting is that b/c we have so much video data, we have this thing that can project the future in 2d pixel space.
Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.
It's just that the equivalent data is not as easily available on the internet :)
That's what estimation and simulation is for. Obviously that's not what's happening in TFA but it's perfectly plausible today.
Not sure how people are concluding that realistic physics is feasible operating solely in pixel space, because obviously it can't and anyone with any experience training such models would recognize instantly the local optimum these demos represent. The point of inductive bias is to make the loss function as convex as possible by inducing a parametrization that is "natural" to the system being modeled. Physics is exactly the attempt to formalize such a model borne of human cognitive faculties and it's hard to imagine that you can do better with less fidelity by just throwing more parameters and data at the problem, especially when the parametrization is so incongruent to the inherent dynamics at play.
Depth estimation improved a lot as well e.g. with Depth-Anything [0]. But those are mostly relative depth instead of metric. Also when even converted to metric they still seems have a lot of pointclouds at the edges that have to be pruned - visible in this blog [1]. Looks like those models trained on Lidar or Stereo depthmaps that has this limitations. I think we don't have enough clean training data for 3d unless we maybe train on synthetic data (then we can have plenty, generate realistic scene in Unreal Engine 5 and train on rendered 2d frames)
imagine it going a few dimensions further, what will happen when i tell this person 'this'. how will this affect the social graph and my world state :)
Projecting into the future in 3d world space is actually what the endgame for robotics is and I imagine depending on how complex that 3d world model is, a working model for projecting into 3d space could be waaaaaay smaller.
It's just that the equivalent data is not as easily available on the internet :)