Backpropagation is an algorithm used to train neural networks, used along with an optimization routine such as gradient descent. Gradient descent requires access to the gradient of the loss function with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.
Any layer of a neural network can be considered as an Affine Transformation followed by application of a non linear function. A vector is received as input and is multiplied with a matrix to produce an output , to which a bias vector may be added before passing the result through an activation function such as sigmoid.
Input=x,Output=f(Wx+b)
Consider a neural network with two hidden layer like this one. It has no bias units. We derive forward and backward pass equations in their matrix form.
The forward propagation equations are as follows:
Input=x0
Hidden Layer 1 Output
x1=f1(W1∗x0)
Hidden Layer 2 Output
x1=f2(W2∗x1)
Output
x3=f3(W3∗x2)
To train this neural network, you could either use Batch gradient descent or Stochastic gradient descent. Stochastic gradient descent uses a single instance of data to perform weight updates, whereas the Batch gradient descent uses a a complete batch of data.
For simplicity lets assume this is a multiple regression problem.
Stochastic update loss function: E=21∥z−t∥22
Batch update loss function :E=21∑i∈Batch∥zi−ti∥22
Here t is the ground truth for that instance.We will only consider the stochastic update loss function. All the results hold for the batch version as well.Let us look at the loss function from a different perspective. Given an input x0 output x3 is determined by W1,W2,W3. So the only tuneable parameters in E are W1,W2,W3 so reduce the value of the error function, we have to change these weights in the negative direction of the gradient of the loss function with respect to these weights.
w=w−αw∂w∂E for all the weights w.
Here αw is a scalar for this particular weight, called the learning rate. Its value is decided by the optimization technique used. I highly recommend reading An overview of gradient descent optimization algorithms for more information about various gradient decent techniques and learning rates.
Backpropagation equations can be derived by repeatedly applying the chain rule. First we derive these for the weights in W3 :
Here ∘ is the Hadamard product. Lets sanity check this by looking at the dimensionalities.∂W3∂E must have the same dimensions as W3. W3′s dimension are 2X3,Dimensions of (x3−t) is 2 x 1 and f′(W3x2) is also 2 x 1. x2 is 3 x 1, so dimensions of δ3x2T is 2 X 3, which is same as W3. Now for the weights of W2, So this checks out to be the same. Similarly for W1
Lets sanity check this too.W2′s dimensions are 3 x 5. δ3 is 2 x 1 and W3 is 2 x 3, so W3Tδ3 is 3 x 1, so δ2 is also 3 x 1. x_{1} is 5 x 1, so δ2x1T is 3 x 5.So this checks out to be the same.Similarly for W1:
∂W1∂E=[W3Tδ2∘f2′(W1x0)]x0T=δ1x0T
We can observe a recursive pattern emerging in the backpropagation equations. The Forward and Backward passes can be summarized as below :
The neural network has L layers.x0 is the input vector, xL is the output vector and t is the truth vector. The weight metrices are W1,W2,...,WL and activation functions are f1,f2,...,fL.
Forward pass:
xi=fi(Wixi−1)E=21∥x2−t∥22
Backward pass:
δL=(xL−T)∘f′(WLxL−1)δi=Wi+1T∘f′(Wixi−1)
Weight update :
∂Wi∂E=δixi−1TWi=Wi−αWi∘∂Wi∂E
Equations for Backpropagation, represented using matrices have two advantages.
One could easily convert these equations to code using either Numpy in Python or Matlab. It is much closer to the way neural networks are implemented in libraries. Using matrix operations speeds up the implementation as one could use high performance matrix primitives from BLAS. GPUs are also suitable for matrix computations as they are suitable for parallelization.
The matrix version of Backpropagation is intuitive to derive and easy to remember as it avoids the confusing and cluttering derivations involving summations and multiple subscripts.