Nice paper. I particularly like how they talk through the ideas they tried that didn’t work, and the process they used to land on the final results. A lot of ML papers present the finished result as if it appeared from nowhere without trial and error, perhaps with some ablations in the appendix and I wish more papers followed this one in talking about the dead ends along the way.
I’m aware, I left the academic world in no small part that I refused to write papers that weren’t worth reading. A high quality, but short, CV is a career ender these days. I’m happier now though!
It seems better overall and per parameter than current work, with relative and absolute measurement.
Is there any research people are aware of that provides sub-mm level models? For 3D modeling purposes? Or is "classic" photogrammetry still the best option there?
In grad school I was using stereo video cameras to measure fish. I wonder if a model like this could do it accurately from frame grabs from a single feed now. And of course an AI to identify fish, even if was just which sections of video had/did not have fish, not even doing the species level ID, would have saved a ton of time.
We had a whole workshop on various monitoring technologies and the take home from the various video tools is that having highly trained grad students and/or techs watch and analyze the video is extremely slow and expensive.
I haven't worked with video in a while now, but I wonder if any labs are doing more automated identification these days. It feels like the kind of problem that is probably completely solvable if the right tech gets applied.
Definitely not with this model, because it’s impossible to tell based on the distance alone. Is the fish 34cm away and 34cm long or 30cm away and 30cm long? The fish is floating in a transparent medium, so reference points aren’t even useful as calibration.
Are the fish always the same color/is their color distinct from the background.
I work at an industrial plant we have been able to measure a lot of things simply by analyzing the pixels in the video. For example one application we have a camera pointed down at a conveyor belt. The conveyor belt is one color and objects on the belt are a distinct different color.
- we just count how many pixels in a given frame are a specific color/brightness. Then you can easily work out how much of the conveyor belt has material on it in any given frame.
So if you are tying to work out what section of a video has fish in it you could count how many pixels are a different color to the normal background color.
You can definitely train a model to identify fish, to be honest you don't really have to train a whole model, there are tons of models trained on millions of images, you can just extract the embeddings from those models and train a single matrix to project them to the different classes and it will work very well.
Very likely. Tbh, I think there are a lot of domain tasks where if you added a machine learning expert to the team that success and progress would be a lot higher. But to be fair, there are are a lot of people that can do ML but not a lot of people that have a deep understanding. The difference matters for real world tasks when the difference between dataset performance and generalization performance matter. And it's all too common that works that are SOTA are more difficult to generalize, but this is high variance.
They explain in the paper that they used 1.5 million images with known depth maps (labels) to train a teacher model, and then used the teacher model to create pseudolabels (inferred depth maps) for the full dataset. Then they trained a student model to recover those pseudolabels from distorted versions of the original images.
>Does you monocular depth model get fooled by a billboard that's just a photo?
This is actually a pretty clever example, I tried a few billboards on the demo online and, as these models are regressive so they output the mean of the possible outputs, sometimes the model is perplexed and doesn't seem to know if to output something completely flat or that actually has a depth, and by being perplexed it outputs something in between.
AGI is a pretty fuzzy term that will goal post shift like AI has. You can define it that way tautologically but I can easily see a world where we have self driving cars but standalone AI scientists don’t exist. Does that mean we have AGI because we have self driving cars or not because it’s not general in that it can’t also tackle other human endeavors?
How well would a moncular path with headlights moving toward it at night operate? How about in rain, snow, or fog?
I'm not saying LiDAR is the only way, but I don't see a reason to use this as a solution.
I'm not saying this isn't valuable. I used to work in 3D/metaverse space, and having depth from a single photo, and being able to recreate a 3D scene from that is very valuable, and is the future.