For classification, I think of it simply as a nonlinear transformation + multiva...

For classification, I think of it simply as a nonlinear transformation + multivariable logistic regression where parameters are learned jointly. In particular the nonlinear transformation is assumed to be of the form of some number of affine transformations, each followed by a nonlinear component-wise mapping. I tend to intentionally avoid brain comparisons because: 1) there's more than enough of that already 2) I don't know enough of the neurophysiology to speculate. I'd like to see some mathematical analysis on what classes of function are more efficiently represented (and/or learned) by networks with increasing numbers of layers.