Summary

Full Transcript

In this video we're going to derive the back propagation algorithm which performs a step of stochastic gradient descent for training a neural network. The algorithm for this update proceeds in three steps. First we use a feed-forward path to make a prediction on some data point. The predicted outputs determine the loss and we next perform a backward pass to compute partial derivatives of that loss and a final pass through the network modifies its parameters using the partial derivatives to update the weights and biases.

For much more information on the forward pass see our previous video where we talked in depth about how to compute all of the activations and we said in that video that we needed to store each of the activations we computed because they would contribute to the partial derivative calculations coming later. So now I want to talk about how we compute the partial derivatives and how we use them to update the parameters. In the last video we defined a neural network loss function where for each data point and each of the networks outputs we computed a squared error. We then summed those squared errors over the output nodes and averaged across all of the points in the data set.

Our task now in performing gradient descent is to determine how the weights of the network influence that loss. And for computing those partial derivatives the key step turns out to be computing a quantity that we'll call delta for each of the output and hidden layer nodes in the network. For any of our computing neurons we define delta to be the partial derivative of the loss with respect to the weighted sum of inputs at that neuron. This xi is what gets passed into the activation function for neuron i.

And we compute this quantity on the backward pass because it will determine the weight updates we perform for every edge coming into that neuron. But since our loss consists of a sum over many data points it will also help to think about the partial derivative of the loss on an individual data point. And if we think about the relationship between the partial derivative of the loss on one data point and the partial derivative of the loss on the whole data set since the data set loss is an average over all of the data points. When we take a derivative of this sum we will get a sum of derivatives.

And so the derivative of the loss on the data set is just an average of the derivatives of the loss on all of the data points. And so when performing our backwards pass on data point j we want to compute the delta for data point j on every hidden and output neuron. Beginning with a neuron in the output layer we should think about how the weighted sum of inputs for this neuron affects the loss function. So here I am deriving the partial derivative of the loss with respect to x12 for data point j.

Since the loss on data point j is a sum over the outputs we'll move the derivative inside the sum and then apply the chain rule to the squared function. When we apply the chain rule here we get the derivative of the outside so two times target minus activation times the derivative of the inside. So this term here represents the derivative with respect to x12 of the inside that is the target minus activation. But when we think about this derivative with respect to x12 the target is not affected by this variable and so it is a constant and the activation of the output node differs for the terms in the sum.

First it's a 11 and then it's a 12 but for the other nodes like 11 the input to node 12 has no effect on their activation. So this variable is also a constant that goes to zero for all of the terms in the sum except for the term corresponding to our current output. For the term in this sum corresponding to node 12 the derivative of the inside will be minus the derivative of the activation with respect to x12. I'll factor that minus sign out front.

Since we multiplied by zero for the terms in the sum corresponding to all the other outputs we're left with only terms that depend on this node and we now have something that looks an awful lot like what we derive for one neuron minus two times the error times the derivative of the activation function. And because in the last video we described the derivative of each of our activation functions in terms of the activation and since we stored the activation of each neuron on our forward pass we can describe the derivative of the activation function evaluated at the stored activation of this node on data point j. This same approach lets us calculate the deltas for all of the output layer neurons but then we need to continue our backward pass and calculate deltas for hidden layer neurons. And the key insight here is that for a hidden layer neuron it affects the output through each of the neurons at the next layer that it feeds into.

Here I have drawn the computational paths by which x8 influences the weighted sum of inputs at the next layer. It passes through the activation function for neuron 8 and then it gets multiplied by weights before taking part in the sum of inputs to produce x11 and x12. If we proceed backwards through the network applying the chain rule we can start from the partial derivatives we've already computed at this layer and the partial derivative of the loss with respect to x8 will be a sum over its contribution through each of these paths. So here we are summing over all of the nodes in the next layer the contribution that node 8 makes through that path and that contribution comes from applying the chain rule.

We already know the partial derivative of the loss with respect to x11 and if we want to go further back we multiply the derivative of the outside times the derivative of the inside and so now we need to find the partial derivative of x11 with respect to x8 and we can break that down using our intermediate variable which is the activation a8 since a8 participates in a weighted sum of inputs to produce x11 the derivative of x11 with respect to a8 will have constants for all of the other terms in the sum and for the term in the sum corresponding to a8 we'll get a derivative of this weight. And with one final application of the chain rule we will multiply by the partial derivative of a8 with respect to x8 and that simply comes from our known derivative of the activation function evaluated on the activation for the current data point. And since we're working on data point j really all of these variables for inputs and activations should have a superscript j and we now have a formula for the delta at a hidden layer node. It's a sum over the nodes in the next layer of the next layer delta times the weight from the current node to that next layer node times the activation derivative for the hidden node.

And what's really great about this formula is that it wasn't specific to the second to last layer of the network. Everything I said about neuron 8 also applies to 9 and 10 but it similarly applies to neurons 5, 6 and 7 and this is why we are able to compute the deltas in a single backwards pass. Once we know the deltas for the last layer we can calculate the deltas for the layer before by a sum over the next layer deltas times the weights times the derivative of the current layer nodes. And that process will proceed backwards through the network until we have a delta for every hidden neuron.

Now that we know how to calculate the deltas for the output layer and for the hidden layer we'd like to use this to actually produce the gradient vector for the parameters that is to find the partial derivatives for the weights and biases. And we haven't calculated any of those partial derivatives yet but it turns out that the deltas were the hard part and from here getting the partial derivatives for the parameters is quite simple. If we think back to the partial derivatives that we came up with for a single neuron model the derivative for each weight differed only by the activation we were multiplying by. And the way we've set up the deltas they summarize everything except for the activation we need to multiply by.

Using the delta for node 11 to get the derivative for the weight from 8 to 11 we need just one more application of the chain rule and that is applied to this summation. And since the only part of this summation that depends on the weight from 8 to 11 is the term activation 8 times the weight from 8 all of the other terms in the summation go away and the derivative of this weight is just the delta times this activation. So for data point j the partial derivative for the weight from k to l is just the delta we calculated for l times the activation we saved for k. And just like in the single neuron model the corresponding bias partial derivative drops the activation and it's just equal to the delta.

Now that we have formulas for the partial derivative of the weights and biases on a single data point we can get the partial derivative of the loss for the entire data set by simply averaging over all of the data points. And likewise the bias partial derivative is an average over the deltas for all of the data points. Now in principle we can take a gradient descent step by computing the deltas and activations for every data point using them to get the partial derivatives for each parameter. And then we can take a gradient descent step by subtracting a learning rate times the partial derivative from each parameter.

But if we have a very large data set which is frequently the case in deep learning it can take a very long time to calculate the deltas for every single data point and average them. And so instead of performing exact gradient descent what we'll often do instead is stochastic gradient descent. And the idea is if we randomly sample a subset of the data points then we can compute the average loss on that subset and we can use the gradient of the mean error on that subset as an approximation of the gradient of the loss on the entire data set. And so if we average our partial derivatives not over the entire data set but instead over a random sample then we'll get an approximation of the gradient and can move in roughly the right direction and hopefully still reduce error but do so much more quickly.

And so we can perform a stochastic gradient descent step by computing our derivatives using a sample from the data set and our update pass through the network will change each of the parameters by 8 times the average partial derivative computed on the sample. And so in summary the back propagation algorithm performs a stochastic gradient descent update by choosing a random sample from the data set and the best way to choose that random sample is to shuffle the data set and then group it into batches then on each data point computing the activations with a feed forward pass and the deltas with a backwards pass then using the activations and the deltas we can calculate the partial derivative for each data point and then we can average over our batch to determine the update that we perform to each weight and each bias in the network. This gives us one step of stochastic gradient descent and so we'll need to do this many times with many different batches sampled from the data set in order to minimize the loss and train our neural network on the data set.

Continue this lesson in the app

Install CourseHive on Android or iOS to keep learning while you move.

Related Courses

30-Day Beginner Guitar Challenge for New Players

Master the Guitar in 30 Days: Your Ultimate Beginner Challenge! Unleash your inner guitarist with step-by-step lessons designed to transform you from novice to confident player. Join Your Guitar Academy and kickstart your musical journey today!

⭐ 4.3

36 ratings

7 hours