How close are we to solving vision?

IsaacL · on Nov 13, 2016

Here's an interesting test case for "context-aware" machine perception: the UK's Hazard Perception test for new drivers. You sit at a computer and are shown 15 recordings of everyday road scenes, from the drivers view, and you have to click when a "developing hazard" appears. It's quite an annoying test, since what counts as a "developing hazard" is not clearly defined, but the intention is that practicing for the test trains your subconscious to scan for minor cues -- e.g., a car rapidly approaching from a side road (which will likely pull out without stopping), or two pedestrians finishing their conversation and one turning to face the street (they will likely step out into the road).

https://www.gov.uk/theory-test/hazard-perception-test https://www.youtube.com/watch?v=SdQRkmdhwJs

unlikelymordant · on Nov 13, 2016

The visual cortex in the brain (which is the physical analogue of the conv nets) is not the whole brain, it is really just a feature extractor for the rest of the brain. So saying that 'vision is not solved' and 'so much for superhuman performance' is not really news, the performance is superhuman on the particular dataset, that is all. You still need the rest of the brain to reason about the inputs and correct weird errors. These results are a stepping stone to better results on harder datasets. I have a feeling people will be saying 'vision is not solved' until general AI is realised (which in a sense is true, but discounts the very real progress being made today).

erikpukinskis · on Nov 13, 2016

I would go further and say that the visual cortex and the rest of the brain are really just feature extractors for the body to use. And without a lifetime of embodied experience, neither can "vision" be solved nor can "general AI" be realized.

It's impossible to really see something unless you have the opportunity to interact with it (or things like it). It's impossible to understand something unless you have the opportunity to interact with it (or things like it). Without interaction there can be no intelligence. The best you can do is have an intelligent being (the human researcher) train a computer to mimic intelligent action in a constrained environment. But there is no path from there to general intelligence without interaction.

And there is no interaction without a body of some shape. The body can be virtual, and needn't look like a human body, but there needs to be a set of actuators which cause realtime changes in the input space.

unlikelymordant · on Nov 15, 2016

> Without interaction there can be no intelligence. And there is no interaction without a body of some shape.

I disagree fully. Can a person born quadriplegic not be considered intelligent simply because they can't interact with things? They can interact by communicating, same as a computer can.

erikpukinskis · on Nov 15, 2016

A quadriplegic has a body, they can move their head, mouth, eyes, etc, and those things can influence the world. That constitutes a perception action loop which is a basic requirement for intelligence.

An AI without any actuators is more like a fully paralyzed person.

A baby which was born totally paralyzed would in fact be profoundly retarded and they would be unable to communicate or form anything resembling intelligence. See http://io9.gizmodo.com/the-seriously-creepy-two-kitten-exper...

An AI with vocal cords and a microphone in the real world has a body.

posterboy · on Nov 13, 2016

>You still need the rest of the brain to reason about the inputs

Specifically, you need the rest of the brain to know what to look for and where, so "vision" is not really a separable facility, agreed.

catwell · on Nov 13, 2016

This is the same kind of question as: "are computers better than humans at Math." They are obviously better at some things related to scale: they can compute things much faster and derive solutions to equations much more easily. But the issue is: they don't really "understand" what they do. And that is why computers are still not better than humans at discovering things independently, even though a lot of proofs are now machine-assisted.

Similarly, machines are becoming better than us at recognizing some specific instances within categories of objects because they can know more of them, i.e. they have larger "databases". But they are still bad at learning new concepts on their own, even though there has been much progress on that front in recent years.

In general, I don't think it is a good idea to consider limitations in Computer Vision as "vision" issues; instead we should consider them as wider AI issues. Basically, ask ourselves: "could a blind Human solve this problem with the information that our current vision algorithms have"?

I wrote a more detailed response along those lines over three years ago at Quora, when I was still working in CV and the field was achieving its switch to neural nets and deep learning. I still think it is mostly relevant today.

https://www.quora.com/What-are-the-major-open-problems-in-co...

TheOtherHobbes · on Nov 13, 2016

It's not obvious that humans really "understand" math. Or at least, only a very tiny minority of humans understand math well enough to improvise with it.

Most humans are only able to learn a small handful of "cookbook" math practices.

This is a standard trope in AI - AIs are compared with the sum total skill of human culture as a whole, not with the relatively weak skills of individual humans. (We have individuals with stand-out skills in specific domains, but there are no - at least virtually no - individuals with stand-out skills in many domains.)

Perhaps future approaches to AI will be collective. Instead of a single smart all-powerful monoAI we'll build evolving problem-solving polyAI cultures, and skim off the skills and insights they develop.

So "solving vision" isn't a useful measure. AI vision is getting close to classifying photos with human-like levels of consistency. 3D vision is still a problem, but will probably come with time.

But then what? Non-blind humans can all recognise familiar people, pick out strangers as strangers, identify a standard selection of objects, make educated guesses about non-familiar objects, and so on.

But humans can also appreciate art, identify memes and find them amusing, respond to font choices and colours, describe and label spatial relationships and views, and point to the location of objects/places that are not currently in view.

Trained artists and architects can identity and name specific proportions and identify cultural references.

Etc. How many of these are necessary to "solve vision"?

jules · on Nov 13, 2016

Could these problems be solved with bigger networks, or do you really need to improve the algorithm beyond that?

kristjankalm · on Nov 13, 2016

No, not really. The "structure of the world" such nets learn is based on bottom-up processing of data -- moving from basic features such as orientations and colours to more complex features. As a result it will make famously absurd predictions like mistake a spotty fur coat for an actual leopard, since it has no "model" of a leopard, in the sense people go "this is an absurd place for a leopard to sit so it's most ceratinly a fur coat rather than an animal". Or to use use a more technical term, no prior probability of a leopard given observed data. Hence a standard convnet (if there is such a thing) will massively overestimate the probabilities of such "adversial" stimuli.

Houshalter · on Nov 13, 2016

The progress that has been made in the past 5 years has been amazing. No one would have predicted superhuman performance on Imagenet in 2016. Hell no one predicted it would be even close for much simpler datasets like CIFAR-10, which is just low res images of 10 types of objects. This is amazing progress, don't let the AI Effect ruin it (https://en.wikipedia.org/wiki/AI_effect).

Second you can't measure progress without a quantitative benchmark. Feeding a net random images from a different dataset, and then just noticing it makes some mistakes, is not scientific. Sure, I agree, Imagenet has been beaten and we need something better to compare with humans. We need bigger and harder datasets. We need more interesting tasks than classification. We need to work more on video than static images. And researchers are working on this. It's not going to happen overnight, but if the current rate of progress continues, it won't be that long.

Also I question if this focus on machine vision is actually that productive. Originally the developments in vision, were generalizing to many other domains. But now they are increasingly focused on little tricks and optimizations that only apply on that specific task. I don't think it's contributing towards general AI any more.

The human brain has evolved to do vision well. It probably uses a huge number of tricks and optimizations to do as well as it does. NNs may eventually get that good, but it's interesting they can do so well without being so highly task specialized. This makes them very general and applicable to many other kinds of problems.

Lastly, half the problem is just computing resources. The biggest nets are still roughly comparable to insect brains (more synapses, but fewer neurons.) It's really amazing that we can get such good results with such underpowered computers. Much better machine vision might be possible if we had more computing power to train with. Training on big datasets, high resolution images, and especially video, can be really expensive.

>My point here is different: notice that the mistakes that those models make are completely ridiculous to humans. They are not off by some minor degree, they are just totally off.

I wonder if the algorithms think the mistakes humans make are equally ridiculous? Hinton once found some crazy errors NNs made, and then pointed out that the image actually does kind of look like that thing, if you squint.

>In my next post in a few days I will go deeper into the problems of deep nets and analyse the so called adversarial examples. These special stimuli reveal a lot about how convolutional nets work and what their limitations are.

This is a super overblown issue. It's been shown that every machine learning algorithm is vulnerable to adversarial examples. Especially linear models, NNs are actually more resistant to it. We don't know that humans aren't vulnerable to them - no one's ever opened up a human brain and backpropagated to the inputs. Adversarial examples are astronomically unlikely to occur by chance

catwell · on Nov 13, 2016

I am not sure this can be called "adversarial examples" but we sure know how to make the human brain fail at vision tasks, with optical illusions and things like the famous Invisible Gorilla test.

https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...

eli_gottlieb · on Nov 13, 2016

>This is a super overblown issue. It's been shown that every machine learning algorithm is vulnerable to adversarial examples. Especially linear models, NNs are actually more resistant to it. We don't know that humans aren't vulnerable to them - no one's ever opened up a human brain and backpropagated to the inputs. Adversarial examples are astronomically unlikely to occur by chance

Generative models are only vulnerable to adversarial examples that are actually unlikely in the data distribution. They do not have patterns or filters that can be added to ordinary images to cause wild misclassifications.

The brain, as far as we know, uses generative modeling.

So yeah.

oifnwoivnoinvo · on Nov 13, 2016

"Generative models are only vulnerable to adversarial examples that are actually unlikely in the data distribution. They do not have patterns or filters that can be added to ordinary images to cause wild misclassifications."

^ Actually, you are wrong on this. See this recent paper "Universal adversarial perturbations"

https://arxiv.org/pdf/1610.08401v1.pdf

eli_gottlieb · on Nov 13, 2016

That paper deals with non-stochastic deep neural networks, and its mathematical analysis deals with discriminative classification. It doesn't deal with generative models, which model the joint probability distribution of classes and data instances rather than just taking a maximum-a-posteriori estimate from the posterior.

Houshalter · on Nov 13, 2016

We don't know that the brain uses generative models. Generative models are pretty inefficient.

Also the original adversarial examples paper found that autoencoders were just as vulnerable. I don't see why generative models wouldn't be vulnerable.

eli_gottlieb · on Nov 14, 2016

>We don't know that the brain uses generative models.

We have some fair evidence, see: http://www.fil.ion.ucl.ac.uk/~karl/A%20free%20energy%20princ...

mrfusion · on Nov 13, 2016

One thing computer vision is missing is making a depth map given a 2d image. You can look at a photograph and describe it as a 3D scene. This will be important for many fields.

nicklo · on Nov 13, 2016

This problem has already been solved with decent success using deep learning.

See: https://homes.cs.washington.edu/~jxie/pdf/deep3d.pdf

mrfusion · on Nov 13, 2016

That's a good start. I was thinking you could generate unlimited training data by using a game engine. You'd have the actual 3D model for every single frame.

phorese · on Nov 13, 2016

Yup, the community is on it!

http://www.cv-foundation.org/openaccess/content_cvpr_2016/ht...

https://link.springer.com/chapter/10.1007/978-3-319-46475-6_...

And there's more every week... Blender, Unity Engine, Unreal Engine, you name it. (Disclaimer: am author on one of these papers)

davesque · on Nov 13, 2016

I'm aware that certain automotive companies are already doing this.

KayEss · on Nov 13, 2016

Is each frame looked at separately? Given what is shown there seems to be no memory building context and pruning the options. Is that really hard to add?

omginternets · on Nov 13, 2016

There's something called "attentional neural networks" that attempt to do this. They tend to do very well in reading natural language, IIRC, but I've also seen them applied to video.

davesque · on Nov 13, 2016

Wouldn't some kind of recurrent network give better results for restricted fields of vision like this?