Seems related to the Tesla video-based depth perception work?

TheArcane · on May 24, 2019

Tesla approach works chiefly on video scenes with static objects like parked cars.

They train a DepthCNN to infer depth from monocular images (lidar or stereo for supervision) and make sure it's temporally consistent by adjusting with pixel transformations from the previous and next frame using a PoseCNN. https://arxiv.org/abs/1704.07813

The guys at Google use Optical flow (only previous frame) to make sure their model trained on static object video sequences works when the scene is dynamic using a mask for a specific class an object (humans here). They do have to make sure nothing but humans are dynamic in the scene.

davidgould · on May 24, 2019

That was my thought too, particularly because Tesla described extracting distance and velocity for moving objects by processing video frames two at a time in the upcoming hardwire.