This seems very related to this http://www.youtube.com/watch?v=ZmNOAtZIgIk speak by Andrew Ng. It is a 40min speak, but he explains very simply how all this works for images and some examples about the audio case.
It is incredible how using this deep learning techniques we can teach this "neural networks" to recognize such complicated patterns. It is like reverse engineering the brain's algorithms.
BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.
BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.