I find explanations like these are great for people who understand mathematical notation, sigmoid functions, derivitives, etc. These people however typically understand whats going on from a simple text description of the process.
For those without a math background, the notation is very opaque. A far better explanation is to explain it numerically with simple examples.
For example, have two bits of training data:
input -> output
1 -> 0
0 -> 1
And a simple network with zero hidden nodes and train it. By hand...
Then add another bit of training data:
0.5 -> 1.5
Notice that it is now impossible to fit the training data exactly, however many training iterations we do. Now add a hidden layer with one or two nodes. Now we can perfectly fit the data, but show that depending on initialization weights we might never get there through gradient descent. Nows the time to mention different types of optimizers, momentum, etc.
Just to add a slightly different perspective: I'm comfortable with the notation and calculus involved, but had not known how backpropagation worked until now.
I'm not sure if it's the same for others, but I don't find bare text descriptions with formulas particularly useful. Mathematical notation on a page is great for rote application of rules and computation, but by itself does not easily communicate an intuitive understanding of the system the math represents. I have to work very hard to build up mental pictures of systems described by just notation, and those mental pictures often have to move in complicated ways as well.
The relationship between maths on the page and the systems they describe is the same as seeing musical notation on a page and hearing a full orchestra. One is a dry accounting of the facts involved. The other is moving and powerful in its richness and immediacy, a living thing that defies easy communication beyond the experience itself.
Demonstrations like this show you the maths _and_ build up a picture for you at the same time. The result of that is that you can communicate a very powerful idea (e.g. backpropagation) very precisely, intuitively and quickly.
Very much worth a five minute scroll for me; YMMV!
Kind of tired of people in the programming community proudly complaining about not knowing simple math notation. Educate yourself. Everyone else in the engineering world knows these basic math notations.
Math notation is simple because it's heavily overloaded. For example, it does me no good to know about exponentials when a superscript is used in a different context. Reading any nontrivial math requires knowing what the notation means in the particular context of the work. IME mathematicians typically assume the reader is familiar already and usually don't explain.
Anyone targeting their work at non-experts should explain even what seems like trivial notation to them since they can't know what other meanings the reader may think the notation holds.
But the fact is that math notation is the most widely known notation for writing down equations (let's make things specific to the OP case and say summation equations and partial derivative equations).
Specifically for the superscript case, the vast vast vast majority of the cases, it will be obvious whether the superscript notation means exponentiation or indexing. When there are ambiguities, what the person explaining the equation should do is clarify the ambiguity.
It doesn't do the world any good to make the entire engineering world learn (relatively) obscure programming languages in order to be slightly more clear (and let's not forget, much more verbose) when writing down simple equations, when all one has to do is clarify a couple of ambiguities when writing down equations.
Let me make it very clear and say this: equations are meant to explain things. They are not standalone pieces of code that you can copy and paste into a REPL. They are tools to explain how something works. You should always have accompanying text that explains what all the variables in the equation mean and clarify unclear notations.
I fully agree with your last sentence. Sadly I don't often see that actually happen in practice. And when the notation isn't explained I personally tend to waste a lot of brain cycles trying to decide what the notation means because I'm reading material that I don't already know (of course - if I already knew the material I probably wouldn't be reading it).
I agree. The superscript example is a good one. In most contexts it refers to the exponent, but in the context of a cost function that minimizes a linear regression (for example), it indicates the index of the set.
Computer languages benefit from the fact that poorly designed syntax can be deprecated (not in all cases, e.g. C++) by introducing new features to the language.
Notation in math never advances in the same way for some reason.
Thats the point I'm making... If you've mastered this "simple" notation, then you've probably already mastered simple neural networks like this, so you are not the target for this tutorial.
I've derived backpropagation by hand many times and diagrams like these often just confuse me more.
For me it depends on whether I'm in a passive or active state of learning.
If I'm sitting down on a Sunday afternoon reading the news, backpropagation is going to make zero sense to me.
But, if I'm actively working on a problem, it's much more useful to realize that this is no different than using gradient descent for linear regression or even minimizing a quadratic.
At that point, it becomes just a mechanical calculation (the fact that the resulting gradient looks more intimidating is irrelevant).
And for me, realizing that it's no different than taking the derivative of a quadratic actually makes it more digestible than these fancy animated tutorials.
It is different though. The derivative of the L2 loss function w.r.t. the linear regression parameters is a "flat" function that is easy to derive by hand. With neutral networks you have deeply nested vector-valued functions. If you write down the chain rule, it suggests that you should compute the Jacobians of all the nested vector valued functions and then multiply them together. This would be computationally expensive.
The key idea of backpropagation is that at each layer, you only ever need the derivatives of the loss function w.r.t. the parameters of the layer, and the Jacobian-vector product with the derivatives of the loss function w.r.t. the layer outputs. You never need to compute the jacobians explicitly and you never need to do those high dimensional matrix-matrix multiplications.
These are not complicated ideas but they involve a combination of software design, calculus, and linear algebra that would probably not be obvious to the average CS undergrad.
Here's a very good explanation of the whole thing, that should be accessible to any average high school student who bothered to take a calculus course: http://neuralnetworksanddeeplearning.com/
In the US it varies by state and even city, but they did not teach calculus in high school in Philadelphia at least. Students could chose whether to take statistics or pre-calculus. Pre-calc was just basically trigonometry.
A picture of a function, its derivative and the notion, that one function shows the other's tangent's slopes, is all that's needed, that should fit in a high-school curriculum especially in physics.
But you are going to hear it anyway if you are going to study.
I didn't get taught derivatives, functions or summation in the UK at GCSE-level (16 year old).
I believe it was covered at A-level (18 year old) but you could only pick three or four subjects for A-Levels at the time, so you had to be selective about the subjects you picked depending on what you "want to be when you grow up", and what you thought you were good at so that you got good enough grades to go to a uni you liked.
It is only me who gets frustrated by networks drawn upside-down (i.e. data flowing from down to up? IMHO it is a poor convetion, mindlessly repeated.
In English we read from top to bottom. Data flows (be it equations or flow charts) typically follow the same convention, so we can read articles in in a coherent way. Even trees (both data structures and decision trees), grow from their roots downwards (so, against their original biological metaphor). At least most of researchers write neural networks from left to right, consistent with English.
I generally prefer to see neural networks depicted so that the input is on the left and the output is on the right. This is probably because I think of time on an axis as flowing left to right.
I wonder if it could be from 2D graphs where the starting point (0,0) is always at the bottom left. Or it could be that in the physical world we tend to start things at the bottom and work up (like a building, or stairs). Or maybe Microsoft Windows where you literally start in the bottom of the screen and work your way up.
Except that the first thing I do with my new Windows images is drag that damn thing to the top of the screen so that menus drop -down- like they're supposed to dammit lol
There is also the convention of the pyramid, with greater order as you go up and chaos at the bottom. Not exactly parallel to this network, but I do picture the tip of the pyramid as the output.
Another way that this "upside down" way works for me is that this isn't water flowing downhill, it's being pushed up with every layer of the network adding energy or input.
Finally there's the metaphor of the roots of a plant being under the fruit of the plant.
(I didn't read the reddit post, apologies if it's a duplicate or these examples are addressed there.)
While we're at this, I found the explainations by 3Brown1Blue to be very intuitive when it comes to neural networks, especially for folks who're new and don't necessarily grasp concepts when explained primarily through math.
Unfortunately there is no attribution, but this tool was created by Daniel Smilkov, who also built TensorFlow Playground and who is a cocreator of TensorFlow.js.
Only one, I hope constructive, criticism: Too many formulas without numbers. It will help the explanation if you include numbers and how the results are calculated. Not everybody is comfortable with the chain rule to distribute the error across the individual weights
Agree. Although I've been putting effort into learning more Calculus (former art student!), Linear Algebra, and the like, a version of this with actual numbers would go a long way.
Nice demonstration, but misses to explain the bias value in the forward propagation step. While quite important, this value is often left out when demonstrating the propagation function. So having it in warrants a short description in my opinion.
It also skips over the bias value in the back propagation step.
The bias is just another input node whose value happens to be constant (granted with full connectivity), right? So the motivating idea/derivation of backpropogation doesn't change.
The bias is often updated/corrected in the back propagation step as well. The purpose of the value is to shift the activation function along the x-axis basically. While the weights define the slope of the activation function.
https://idyll-lang.org/ is built for this. You'll still have to write code for your custom graphics but it will help you get things up and running quickly.
For those without a math background, the notation is very opaque. A far better explanation is to explain it numerically with simple examples.
For example, have two bits of training data:
input -> output
1 -> 0
0 -> 1
And a simple network with zero hidden nodes and train it. By hand...
Then add another bit of training data:
0.5 -> 1.5
Notice that it is now impossible to fit the training data exactly, however many training iterations we do. Now add a hidden layer with one or two nodes. Now we can perfectly fit the data, but show that depending on initialization weights we might never get there through gradient descent. Nows the time to mention different types of optimizers, momentum, etc.