Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How close are we to solving vision? (piekniewski.info)
65 points by avivo on Nov 13, 2016 | hide | past | favorite | 25 comments


Here's an interesting test case for "context-aware" machine perception: the UK's Hazard Perception test for new drivers. You sit at a computer and are shown 15 recordings of everyday road scenes, from the drivers view, and you have to click when a "developing hazard" appears. It's quite an annoying test, since what counts as a "developing hazard" is not clearly defined, but the intention is that practicing for the test trains your subconscious to scan for minor cues -- e.g., a car rapidly approaching from a side road (which will likely pull out without stopping), or two pedestrians finishing their conversation and one turning to face the street (they will likely step out into the road).

https://www.gov.uk/theory-test/hazard-perception-test https://www.youtube.com/watch?v=SdQRkmdhwJs


The visual cortex in the brain (which is the physical analogue of the conv nets) is not the whole brain, it is really just a feature extractor for the rest of the brain. So saying that 'vision is not solved' and 'so much for superhuman performance' is not really news, the performance is superhuman on the particular dataset, that is all. You still need the rest of the brain to reason about the inputs and correct weird errors. These results are a stepping stone to better results on harder datasets. I have a feeling people will be saying 'vision is not solved' until general AI is realised (which in a sense is true, but discounts the very real progress being made today).


I would go further and say that the visual cortex and the rest of the brain are really just feature extractors for the body to use. And without a lifetime of embodied experience, neither can "vision" be solved nor can "general AI" be realized.

It's impossible to really see something unless you have the opportunity to interact with it (or things like it). It's impossible to understand something unless you have the opportunity to interact with it (or things like it). Without interaction there can be no intelligence. The best you can do is have an intelligent being (the human researcher) train a computer to mimic intelligent action in a constrained environment. But there is no path from there to general intelligence without interaction.

And there is no interaction without a body of some shape. The body can be virtual, and needn't look like a human body, but there needs to be a set of actuators which cause realtime changes in the input space.


> Without interaction there can be no intelligence. And there is no interaction without a body of some shape.

I disagree fully. Can a person born quadriplegic not be considered intelligent simply because they can't interact with things? They can interact by communicating, same as a computer can.


A quadriplegic has a body, they can move their head, mouth, eyes, etc, and those things can influence the world. That constitutes a perception action loop which is a basic requirement for intelligence.

An AI without any actuators is more like a fully paralyzed person.

A baby which was born totally paralyzed would in fact be profoundly retarded and they would be unable to communicate or form anything resembling intelligence. See http://io9.gizmodo.com/the-seriously-creepy-two-kitten-exper...

An AI with vocal cords and a microphone in the real world has a body.


>You still need the rest of the brain to reason about the inputs

Specifically, you need the rest of the brain to know what to look for and where, so "vision" is not really a separable facility, agreed.


This is the same kind of question as: "are computers better than humans at Math." They are obviously better at some things related to scale: they can compute things much faster and derive solutions to equations much more easily. But the issue is: they don't really "understand" what they do. And that is why computers are still not better than humans at discovering things independently, even though a lot of proofs are now machine-assisted.

Similarly, machines are becoming better than us at recognizing some specific instances within categories of objects because they can know more of them, i.e. they have larger "databases". But they are still bad at learning new concepts on their own, even though there has been much progress on that front in recent years.

In general, I don't think it is a good idea to consider limitations in Computer Vision as "vision" issues; instead we should consider them as wider AI issues. Basically, ask ourselves: "could a blind Human solve this problem with the information that our current vision algorithms have"?

I wrote a more detailed response along those lines over three years ago at Quora, when I was still working in CV and the field was achieving its switch to neural nets and deep learning. I still think it is mostly relevant today.

https://www.quora.com/What-are-the-major-open-problems-in-co...


It's not obvious that humans really "understand" math. Or at least, only a very tiny minority of humans understand math well enough to improvise with it.

Most humans are only able to learn a small handful of "cookbook" math practices.

This is a standard trope in AI - AIs are compared with the sum total skill of human culture as a whole, not with the relatively weak skills of individual humans. (We have individuals with stand-out skills in specific domains, but there are no - at least virtually no - individuals with stand-out skills in many domains.)

Perhaps future approaches to AI will be collective. Instead of a single smart all-powerful monoAI we'll build evolving problem-solving polyAI cultures, and skim off the skills and insights they develop.

So "solving vision" isn't a useful measure. AI vision is getting close to classifying photos with human-like levels of consistency. 3D vision is still a problem, but will probably come with time.

But then what? Non-blind humans can all recognise familiar people, pick out strangers as strangers, identify a standard selection of objects, make educated guesses about non-familiar objects, and so on.

But humans can also appreciate art, identify memes and find them amusing, respond to font choices and colours, describe and label spatial relationships and views, and point to the location of objects/places that are not currently in view.

Trained artists and architects can identity and name specific proportions and identify cultural references.

Etc. How many of these are necessary to "solve vision"?


Could these problems be solved with bigger networks, or do you really need to improve the algorithm beyond that?


No, not really. The "structure of the world" such nets learn is based on bottom-up processing of data -- moving from basic features such as orientations and colours to more complex features. As a result it will make famously absurd predictions like mistake a spotty fur coat for an actual leopard, since it has no "model" of a leopard, in the sense people go "this is an absurd place for a leopard to sit so it's most ceratinly a fur coat rather than an animal". Or to use use a more technical term, no prior probability of a leopard given observed data. Hence a standard convnet (if there is such a thing) will massively overestimate the probabilities of such "adversial" stimuli.


The progress that has been made in the past 5 years has been amazing. No one would have predicted superhuman performance on Imagenet in 2016. Hell no one predicted it would be even close for much simpler datasets like CIFAR-10, which is just low res images of 10 types of objects. This is amazing progress, don't let the AI Effect ruin it (https://en.wikipedia.org/wiki/AI_effect).

Second you can't measure progress without a quantitative benchmark. Feeding a net random images from a different dataset, and then just noticing it makes some mistakes, is not scientific. Sure, I agree, Imagenet has been beaten and we need something better to compare with humans. We need bigger and harder datasets. We need more interesting tasks than classification. We need to work more on video than static images. And researchers are working on this. It's not going to happen overnight, but if the current rate of progress continues, it won't be that long.

Also I question if this focus on machine vision is actually that productive. Originally the developments in vision, were generalizing to many other domains. But now they are increasingly focused on little tricks and optimizations that only apply on that specific task. I don't think it's contributing towards general AI any more.

The human brain has evolved to do vision well. It probably uses a huge number of tricks and optimizations to do as well as it does. NNs may eventually get that good, but it's interesting they can do so well without being so highly task specialized. This makes them very general and applicable to many other kinds of problems.

Lastly, half the problem is just computing resources. The biggest nets are still roughly comparable to insect brains (more synapses, but fewer neurons.) It's really amazing that we can get such good results with such underpowered computers. Much better machine vision might be possible if we had more computing power to train with. Training on big datasets, high resolution images, and especially video, can be really expensive.

>My point here is different: notice that the mistakes that those models make are completely ridiculous to humans. They are not off by some minor degree, they are just totally off.

I wonder if the algorithms think the mistakes humans make are equally ridiculous? Hinton once found some crazy errors NNs made, and then pointed out that the image actually does kind of look like that thing, if you squint.

>In my next post in a few days I will go deeper into the problems of deep nets and analyse the so called adversarial examples. These special stimuli reveal a lot about how convolutional nets work and what their limitations are.

This is a super overblown issue. It's been shown that every machine learning algorithm is vulnerable to adversarial examples. Especially linear models, NNs are actually more resistant to it. We don't know that humans aren't vulnerable to them - no one's ever opened up a human brain and backpropagated to the inputs. Adversarial examples are astronomically unlikely to occur by chance


I am not sure this can be called "adversarial examples" but we sure know how to make the human brain fail at vision tasks, with optical illusions and things like the famous Invisible Gorilla test.

https://en.wikipedia.org/wiki/Inattentional_blindness#Invisi...


>This is a super overblown issue. It's been shown that every machine learning algorithm is vulnerable to adversarial examples. Especially linear models, NNs are actually more resistant to it. We don't know that humans aren't vulnerable to them - no one's ever opened up a human brain and backpropagated to the inputs. Adversarial examples are astronomically unlikely to occur by chance

Generative models are only vulnerable to adversarial examples that are actually unlikely in the data distribution. They do not have patterns or filters that can be added to ordinary images to cause wild misclassifications.

The brain, as far as we know, uses generative modeling.

So yeah.


"Generative models are only vulnerable to adversarial examples that are actually unlikely in the data distribution. They do not have patterns or filters that can be added to ordinary images to cause wild misclassifications."

^ Actually, you are wrong on this. See this recent paper "Universal adversarial perturbations"

https://arxiv.org/pdf/1610.08401v1.pdf


That paper deals with non-stochastic deep neural networks, and its mathematical analysis deals with discriminative classification. It doesn't deal with generative models, which model the joint probability distribution of classes and data instances rather than just taking a maximum-a-posteriori estimate from the posterior.


We don't know that the brain uses generative models. Generative models are pretty inefficient.

Also the original adversarial examples paper found that autoencoders were just as vulnerable. I don't see why generative models wouldn't be vulnerable.


>We don't know that the brain uses generative models.

We have some fair evidence, see: http://www.fil.ion.ucl.ac.uk/~karl/A%20free%20energy%20princ...


One thing computer vision is missing is making a depth map given a 2d image. You can look at a photograph and describe it as a 3D scene. This will be important for many fields.


This problem has already been solved with decent success using deep learning.

See: https://homes.cs.washington.edu/~jxie/pdf/deep3d.pdf


That's a good start. I was thinking you could generate unlimited training data by using a game engine. You'd have the actual 3D model for every single frame.


Yup, the community is on it!

http://www.cv-foundation.org/openaccess/content_cvpr_2016/ht...

http://www.cv-foundation.org/openaccess/content_cvpr_2016/ht...

https://link.springer.com/chapter/10.1007/978-3-319-46475-6_...

And there's more every week... Blender, Unity Engine, Unreal Engine, you name it. (Disclaimer: am author on one of these papers)


I'm aware that certain automotive companies are already doing this.


Is each frame looked at separately? Given what is shown there seems to be no memory building context and pruning the options. Is that really hard to add?


There's something called "attentional neural networks" that attempt to do this. They tend to do very well in reading natural language, IIRC, but I've also seen them applied to video.


Wouldn't some kind of recurrent network give better results for restricted fields of vision like this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: