common.title
Cloud support

TYTAN CLOUD

Nobisuke

Dekisugi

RAG


autoQAOA
RAG for dev
Fortune telling app
DEEPSCORE
Translation

Overview
Service overview
Terms of service

Privacy policy

Contact
Research

Sign in
Sign up
common.title

A Derivation of Backpropagation in Matrix Form

Anurag

2022/12/06 09:16

#machine learning #neural network

Backpropagation is an algorithm used to train neural networks, used along with an optimization routine such as gradient descent. Gradient descent requires access to the gradient of the loss function with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.

Any layer of a neural network can be considered as an Affine Transformation followed by application of a non linear function. A vector is received as input and is multiplied with a matrix to produce an output , to which a bias vector may be added before passing the result through an activation function such as sigmoid. Input=x,Output=f(Wx+b)Input = x , Output = f(Wx+b)

Consider a neural network with two hidden layer like this one. It has no bias units. We derive forward and backward pass equations in their matrix form.

The forward propagation equations are as follows: Input=x0Input=x_{0} Hidden Layer 1 Output x1=f1(W1x0)x_{1} = f_{1} (W_{1}*x_{0}) Hidden Layer 2 Output x1=f2(W2x1)x_{1} = f_{2} (W_{2}*x_{1}) Output x3=f3(W3x2)x_{3} = f_{3} (W_{3}*x_{2})

To train this neural network, you could either use Batch gradient descent or Stochastic gradient descent. Stochastic gradient descent uses a single instance of data to perform weight updates, whereas the Batch gradient descent uses a a complete batch of data.

For simplicity lets assume this is a multiple regression problem. Stochastic update loss function: E=12zt22E= \frac{1}{2}\lVert z-t \rVert_{2}^{2} Batch update loss function :E=12iBatchziti22E= \frac{1}{2}\sum_{i\in Batch}\lVert z_{i}-t_{i} \rVert_{2}^{2}

Here tt is the ground truth for that instance.We will only consider the stochastic update loss function. All the results hold for the batch version as well.Let us look at the loss function from a different perspective. Given an input x0x_{0} output x3x_{3} is determined by W1,W2,W3W_{1},W_{2},W_{3}. So the only tuneable parameters in EE are W1,W2,W3W_{1},W_{2},W_{3} so reduce the value of the error function, we have to change these weights in the negative direction of the gradient of the loss function with respect to these weights. w=wαwEww = w - \alpha_{w}\frac{\partial E}{\partial w} for all the weights ww.

Here αw\alpha_{w} is a scalar for this particular weight, called the learning rate. Its value is decided by the optimization technique used. I highly recommend reading An overview of gradient descent optimization algorithms for more information about various gradient decent techniques and learning rates.

Backpropagation equations can be derived by repeatedly applying the chain rule. First we derive these for the weights in W3W_{3} :

EW3=(x3t)x3W3=(x3t)f3(W3x2)x3W3=(x3t)f3(W3x2)x2T\begin{align} \frac{\partial E}{\partial W_{3}} = (x_{3}-t)\frac{\partial x_{3}}{\partial W_{3}} \\ = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})\frac{\partial x_{3}}{\partial W_{3}} \\ = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})x_{2}^{T} \\ \end{align}

Let δ3=(x3t)f3(W3x2)\delta_{3} = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})

EW3=δ3x2T\frac{\partial E}{\partial W_{3}} = \delta_{3}x_{2}^{T}

Here \circ is the Hadamard product. Lets sanity check this by looking at the dimensionalities.EW3\frac{\partial E}{\partial W_{3}} must have the same dimensions as W3W_{3}. W3sW_{3}'s dimension are 2X32 X 3,Dimensions of (x3t)(x_{3}-t) is 2 x 1 and f(W3x2)f'(W_{3}x_{2}) is also 2 x 1. x2x_{2} is 3 x 1, so dimensions of δ3x2T\delta_{3}x_{2}^{T} is 2 X 3, which is same as W3W_{3}. Now for the weights of W2W_{2}, So this checks out to be the same. Similarly for W1W_{1}

EW2=(x3t)x3W2=(x3t)f3(W3x2)x3W2=δ3(W3x2)(W2)=W3Tδ3(x2)(W2)=[W3Tδ3f2(W2x1)]2x1W2=δ2x1T\begin{align} \frac{\partial E}{\partial W_{2}} = (x_{3}-t)\frac{\partial x_{3}}{\partial W_{2}} \\ = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})\frac{\partial x_{3}}{\partial W_{2}} \\ = \delta_{3}\frac{\partial(W_{3}x_{2})}{\partial(W_{2})} \\ = W_{3}^{T}\delta_{3}\frac{\partial(x_{2})}{\partial(W_{2})} \\ = [W_{3}^{T}\delta_{3}\circ f'_{2}(W_{2}x_{1})]\frac{\partial_{2}x_{1}}{\partial W_{2}}\\ = \delta_{2}x_{1}^{T} \end{align}

Lets sanity check this too.W2sW_{2}'s dimensions are 3 x 5. δ3\delta_{3} is 2 x 1 and W3W_{}3 is 2 x 3, so W3Tδ3W_{3}^{T}\delta_{3} is 3 x 1, so δ2\delta_{2} is also 3 x 1. x_{1} is 5 x 1, so δ2x1T\delta_{2}x_{1}^{T} is 3 x 5.So this checks out to be the same.Similarly for W1:W_{1}:

EW1=[W3Tδ2f2(W1x0)]x0T=δ1x0T\begin{align} \frac{\partial E}{\partial W_{1}} = [W_{3}^{T}\delta_{2}\circ f'_{2}(W_{1}x_{0})]x_{0}^{T}\\ = \delta_{1}x_{0}^{T} \end{align}

We can observe a recursive pattern emerging in the backpropagation equations. The Forward and Backward passes can be summarized as below : The neural network has LL layers.x0x_{0} is the input vector, xLx_{L} is the output vector and tt is the truth vector. The weight metrices are W1,W2,...,WLW_{1},W_{2},..., W_{L} and activation functions are f1,f2,...,fLf_{1},f_{2},..., f_{L}. Forward pass:

xi=fi(Wixi1)E=12x2t22\begin{align} x_{i} = f_{i}(W_{i}x_{i-1}) \\ E = \frac{1}{2}\lVert x_{2}-t \rVert_{2}^{2}\\ \end{align}

Backward pass:

δL=(xLT)f(WLxL1)δi=Wi+1Tf(Wixi1)\begin{align} \delta_{L} = (x_{L}-T)\circ f'(W_{L}x_{L-1}) \\ \delta_{i} = W_{i+1}^{T}\circ f'(W_{i}x_{i-1})\\ \end{align}

Weight update :

EWi=δixi1TWi=WiαWiEWi\begin{align} \frac{\partial E}{\partial W_{i}}= \delta_{i}x_{i-1}^{T} \\ W_{i} = W_{i}-\alpha_{W_{i}}\circ \frac{\partial E}{\partial W_{i}} \\ \end{align}

Equations for Backpropagation, represented using matrices have two advantages.

One could easily convert these equations to code using either Numpy in Python or Matlab. It is much closer to the way neural networks are implemented in libraries. Using matrix operations speeds up the implementation as one could use high performance matrix primitives from BLAS. GPUs are also suitable for matrix computations as they are suitable for parallelization.

The matrix version of Backpropagation is intuitive to derive and easy to remember as it avoids the confusing and cluttering derivations involving summations and multiple subscripts.

© 2024, blueqat Inc. All rights reserved