Tesla approach works chiefly on video scenes with static objects like parked cars.
They train a DepthCNN to infer depth from monocular images (lidar or stereo for supervision) and make sure it's temporally consistent by adjusting with pixel transformations from the previous and next frame using a PoseCNN. https://arxiv.org/abs/1704.07813
The guys at Google use Optical flow (only previous frame) to make sure their model trained on static object video sequences works when the scene is dynamic using a mask for a specific class an object (humans here).
They do have to make sure nothing but humans are dynamic in the scene.
That was my thought too, particularly because Tesla described extracting distance and velocity for moving objects by processing video frames two at a time in the upcoming hardwire.