A Derivation of Backpropagation in Matrix Form

Backpropagation is an algorithm used to train neural networks, used along with an optimization routine such as gradient descent. Gradient descent requires access to the gradient of the loss function with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.

Any layer of a neural network can be considered as an Affine Transformation followed by application of a non linear function. A vector is received as input and is multiplied with a matrix to produce an output , to which a bias vector may be added before passing the result through an activation function such as sigmoid.
$$ Input = x , Output = f(Wx+b) $$
Consider a neural network with two hidden layer like this one. It has no bias units. We derive forward and backward pass equations in their matrix form.
<div>
<img src="img/Neural_net_1.png" width="800"/>
</div>
The forward propagation equations are as follows:
$Input=x_{0}$
Hidden Layer 1 Output
$x_{1} = f_{1} (W_{1}*x_{0})$
Hidden Layer 2 Output
$x_{1} = f_{2} (W_{2}*x_{1})$
Output
$x_{3} = f_{3} (W_{3}*x_{2})$
To train this neural network, you could either use Batch gradient descent or Stochastic gradient descent. Stochastic gradient descent uses a single instance of data to perform weight updates, whereas the Batch gradient descent uses a a complete batch of data.

For simplicity lets assume this is a multiple regression problem.
Stochastic update loss function: $E= \frac{1}{2}\lVert z-t \rVert_{2}^{2}$
Batch update loss function : $E= \frac{1}{2}\sum_{i\in Batch}\lVert z_{i}-t_{i} \rVert_{2}^{2}$
Here $ t $ is the ground truth for that instance.We will only consider the stochastic update loss function. All the results hold for the batch version as well.Let us look at the loss function from a different perspective. Given an input $ x_{0} $ output $ x_{3} $ is determined by $ W_{1},W_{2},W_{3} $. So the only tuneable parameters in $ E $ are $ W_{1},W_{2},W_{3} $ so reduce the value of the error function, we have to change these weights in the negative direction of the gradient of the loss function with respect to these weights.
$w = w - \alpha_{w}\frac{\partial E}{\partial w}$ for all the weights $w$ .
Here $ \alpha_{w} $ is a scalar for this particular weight, called the learning rate. Its value is decided by the optimization technique used. I highly recommend reading An overview of gradient descent optimization algorithms for more information about various gradient decent techniques and learning rates.
Backpropagation equations can be derived by repeatedly applying the chain rule. First we derive these for the weights in $W_{3}$ :

\begin{align} \frac{\partial E}{\partial W_{3}} = (x_{3}-t)\frac{\partial x_{3}}{\partial W_{3}} \\ = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})\frac{\partial x_{3}}{\partial W_{3}} \\ = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})x_{2}^{T} \\ \end{align}

Let $\delta_{3} = (x_{3}-t)\circ f'_{3}(W_{3}x_{2})$

\frac{\partial E}{\partial W_{3}} = \delta_{3}x_{2}^{T}

Here $ \circ $ is the Hadamard product. Lets sanity check this by looking at the dimensionalities.$ \frac{\partial E}{\partial W_{3}} $ must have the same dimensions as $ W_{3}

. $ W_{3}'s $ dimension are $2 X 3

,Dimensions of $ (x_{3}-t) $ is 2 x 1 and $ f'(W_{3}x_{2}) $ is also 2 x 1.

x_{2}

is 3 x 1, so dimensions of

\delta_{3}x_{2}^{T}

is 2 X 3, which is same as

W_{3}

. Now for the weights of

W_{2}

, So this checks out to be the same. Similarly for

W_{1}

$$
\begin{align}
\frac{\partial E}{\partial W_{2}} = (x_{3}-t)\frac{\partial x_{3}}{\partial W_{2}} \
= (x_{3}-t)\circ f'{3}(W{3}x_{2})\frac{\partial x_{3}}{\partial W_{2}} \
= \delta_{3}\frac{\partial(W_{3}x_{2})}{\partial(W_{2})} \
= W_{3}^{T}\delta_{3}\frac{\partial(x_{2})}{\partial(W_{2})} \
= [W_{3}^{T}\delta_{3}\circ f'{2}(W{2}x_{1})]\frac{\partial_{2}x_{1}}{\partial W_{2}}\
= \delta_{2}x_{1}^{T}
\end{align}
$$
Lets sanity check this too.$ W_{2}'s $ dimensions are 3 x 5. $ \delta_{3} $ is 2 x 1 and $ W_{}3 $ is 2 x 3, so

W_{3}^{T}\delta_{3}

is 3 x 1, so

\delta_{2}

is also 3 x 1. x_{1} is 5 x 1, so

\delta_{2}x_{1}^{T}

is 3 x 5.So this checks out to be the same.Similarly for

W_{1}:

$$
\begin{align}
\frac{\partial E}{\partial W_{1}} = [W_{3}^{T}\delta_{2}\circ f'{2}(W{1}x_{0})]x_{0}^{T}\
= \delta_{1}x_{0}^{T}
\end{align}
$$
We can observe a recursive pattern emerging in the backpropagation equations. The Forward and Backward passes can be summarized as below :
The neural network has $ L $ layers.

x_{0}

is the input vector,

x_{L}

is the output vector and

t

is the truth vector. The weight metrices are

W_{1},W_{2},..., W_{L}

and activation functions are

f_{1},f_{2},..., f_{L}

.
Forward pass:
$$
\begin{align}
x_{i} = f_{i}(W_{i}x_{i-1}) \
E = \frac{1}{2}\lVert x_{2}-t \rVert_{2}^{2}\
\end{align}
$$
Backward pass:
$$
\begin{align}
\delta_{L} = (x_{L}-T)\circ f'(W_{L}x_{L-1}) \
\delta_{i} = W_{i+1}^{T}\circ f'(W_{i}x_{i-1})\
\end{align}
$$
Weight update :
$$
\begin{align}
\frac{\partial E}{\partial W_{i}}= \delta_{i}x_{i-1}^{T} \
W_{i} = W_{i}-\alpha_{W_{i}}\circ \frac{\partial E}{\partial W_{i}} \
\end{align}
$$
Equations for Backpropagation, represented using matrices have two advantages.

One could easily convert these equations to code using either Numpy in Python or Matlab. It is much closer to the way neural networks are implemented in libraries. Using matrix operations speeds up the implementation as one could use high performance matrix primitives from BLAS. GPUs are also suitable for matrix computations as they are suitable for parallelization.

The matrix version of Backpropagation is intuitive to derive and easy to remember as it avoids the confusing and cluttering derivations involving summations and multiple subscripts.

A Derivation of Backpropagation in Matrix Form

Anurag