I have a huge problem trying to derive the backpropagation equations. All the solutions I've found online are not detailed as I'd like, hence I'm here asking your help. First of all sorry for this long preface, but I think it's mandatory in order to fully understand understand what's going on here. If u want to skip it, jump to the backpropagation section.
Consider the following assumptions (for a neural net for a regression task):
- $ \cdot $ denotes the scalar product;
- $ \odot $ denotes the Hadamard (element-wise) product;
- $x \in \mathbb{R}^{n} = \begin{bmatrix} x_1\\ \vdots\\ x_n \end{bmatrix} \in \mathbb{R}^{n \times 1}$ denotes a column vector (denominator layoyut);
- $\left( x^{(i)}, y^{(i)} \right)$ denotes the $i$-th training instance;
- $L$ is the total number of layers in the net (with $1$ being the input layer and $L$ the output one);
- $\varphi(\cdot)$ is the activation function of each neuron in the net;
- $\vartheta^{(l)}_{ij}$ is the edge going from the $i$-th neuron in the $(l+1)$-th layer to the $j$-th neuron in the $l$-th layer
- $n_l$ denotes the number of neurons in the $l$-th layer, $\forall l=1..L$
- $\Theta^{(l)} = \begin{bmatrix} \vartheta^{(l)}_{11} & \vartheta^{(l)}_{12} & \ldots & \vartheta^{(l)}_{1 n_{l}} \\ \vartheta^{(l)}_{21} & \vartheta^{(l)}_{22} & \ldots & \vartheta^{(l)}_{2 n_{l}} \\ \vdots & \vdots & \ddots & \ldots \\ \vartheta^{(l)}_{n_{l+1} 1} & \vartheta^{(l)}_{n_{l+1} 2} & \ldots & \vartheta^{(l)}_{n_{l+1} n_{l}} \\ \end{bmatrix} \in \mathbb{R}^{n_{l+1} \times n_{l}}$ is the weight matrix mapping all the weights between the $l$-th and $(l+1)$-th layer, $\forall l=1..L-1$
- $b^{(l)} \in \mathbb{R}^{n_l}$ is the bias vector for the $l$-th layer
- $a^{(l)} \in \mathbb{R}^{n_l}$ is the output vector of the $l$-th layer
Forward propagation (optional reading)
Then, for the forward propagation you have that \begin{align} a^{(1)} &= x^{(i)} \in \mathbb{R}^{n_1} \\ a^{(2)} &= \varphi \left( \Theta^{(1)} \cdot a^{(1)} + b^{(2)} \right) = \varphi \left( z^{(2)} \right) \\ \vdots \\ a^{(L)} &= h_\Theta \left( x^{(i)} \right) = \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} = z^{(L)} \end{align}
We can easily generalize the previous equations easily as \begin{align} &\begin{cases} a^{(1)} = x^{(i)} \\[1ex] z^{(1)} = \text{undefined} \end{cases}\\[2ex] &\begin{cases} a^{(l)} &= \varphi \left( z^{(l)} \right) \\[1ex] z^{(l)} &= \Theta^{(l-1)} \cdot a^{(l-1)} + b^{(l)} \end{cases}, \quad\forall l=2..L-1 \\[2ex] &\begin{cases} a^{(L)} &= h_\Theta (x) = z^{(L)} \\[1ex] z^{(L)} &= \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \end{cases} \end{align}
From a dimensional pov everything works fine, indeed the dimension of $z^{(l)}$ can be computed as follows: \begin{equation} (n_{l} \times n_{l-1}) \cdot (n_{l-1} \times 1) + (n_l \times 1) = (n_l \times 1) + (n_L \times 1) = (n_l \times 1) \end{equation} hence, also $a^{(l)} \in \mathbb{R}^{n_l}$ (like we would expect since in the $l$-th layer there are exactly $n_l$ neurons and $a^{(l)}$ denotes the output of each neuron in such a layer).
Gradient descent (optional reading)
The problems arise when I try to derive the backpropagation equations. Knowing that for the gradient descent \begin{equation} \Theta \leftarrow \Theta - \eta \nabla_{\Theta} J \end{equation} where $\eta$ is the learning rate and $J(\Theta)$ denotes the cost function, defined as \begin{equation} J(\Theta) = \frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{equation} with $J^{(i)}$ being the cost function for the $i$-th training sample: \begin{align} J^{(i)} &= \frac{1}{2} \left[ h_\Theta \left(x^{(i)}\right) - y^{(i)} \right]^2 = \frac{1}{2} \left[ a^{(L)} - y^{(i)} \right]^2 \end{align} Note that the gradient $\nabla_\Theta J$ can be written as \begin{equation} \nabla_\Theta J = \begin{bmatrix} \frac{\partial J}{\partial \Theta^{(1)}} \\ \frac{\partial J}{\partial \Theta^{(2)}} \\ \vdots \\ \frac{\partial J}{\partial \Theta^{(L-1)}} \end{bmatrix} \end{equation} Hence we can focus on the $l$-th partial derivative as follows \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \eta \frac{\partial J}{\partial \Theta^{(l)}} \end{equation} By replacing the definition of $J$ in the previous equation, we can write \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \eta \frac{\partial}{\partial \Theta^{(l)}} \left[ \frac{1}{m} \sum_{i=1}^{m} J^{(i)} \right] \end{equation} where for the linearity of differentiation, it becomes \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \frac{\eta}{m} \sum_{i=1}^{m} \frac{\partial J^{(i)}}{\partial \Theta^{(l)}} \end{equation}
Backpropagation
Starting from the last equation, we need to compute the partial derivative of $J^{(i)}$ wrt all the weight matrices $\Theta^{(l)}$ of the net, $\forall l=1..L-1$. Starting from the $L$-th (output) layer and continuing up to the 1-st (input) one, we can write
\begin{align} \frac{\partial J^{(i)}}{\partial \Theta^{(L-1)}} &= \frac{\partial}{\partial \Theta^{(L-1)}} \left[ \frac{1}{2} \left( a^{(L)} - y^{(i)} \right)^2 \right] \\[1ex] &= \underbrace{\left( a^{(L)} - y^{(i)} \right)}_{\delta^{(L)}} \odot \frac{\partial a^{(L)}}{\partial \Theta^{(L-1)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial z^{(L)}}{\partial \Theta^{(L-1)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial}{\partial \Theta^{(L-1)}} \left( \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \left( a^{(L-1)} \right)^\top \end{align} As you may notice, $\delta^{(L)} \in \mathbb{R}^{n_L \times 1}$, while $\left(a^{L-1}\right)^\top \in \mathbb{R}^{1 \times n_{L-1}}$, hence the Hadamard product is not defined.Based on the dimensions of these vectors, I would expect that product to be scalar $\cdot$ and not a Hadamard $\odot$ one.
The problems continue also if I keep backpropagating the errror: \begin{align} \frac{\partial J^{(i)}}{\partial \Theta^{(L-2)}} &= \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \frac{1}{2} \left( a^{(L)} - y^{(i)} \right)^2 \right] \\[1ex] &= \underbrace{\left( a^{(L)} - y^{(i)} \right)}_{\delta^{(L)}} \odot \frac{\partial a^{(L)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial z^{(L)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left( \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \Theta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left( a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \Theta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \varphi \left(z^{(L-1)} \right) + b^{(L)} \right] \\[1ex] &= \underbrace{\delta^{(L)} \odot \Theta^{(L-1)} \odot \varphi' \left(z^{(L-1)} \right)}_{\delta^{(L-1)}} \odot \frac{\partial z^{(L-1)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \Theta^{(L-2)} \cdot a^{(L-2)} + b^{(L-1)} \right] \\[1ex] &= \delta^{(L-1)} \odot \left( a^{(L-2)} \right)^\top \end{align}
And here we go again. As you may notice, $\delta^{(L)} \in \mathbb{R}^{n_L}$, $\Theta^{(L-1)} \in \mathbb{R}^{n_L \times n_{L-1}}$ and $\varphi' \left( z^{(L-1)} \right) \in \mathbb{R}^{n_{L-1} \times 1}$, hence the dimension of $\delta^{(L-1)}$ is equal to: \begin{align} (n_L \times 1) \odot (n_L \times n_{L-1}) \odot(n_{L-1} \times 1) \end{align} which, again, cannot be computed, and their dimensions make me think it should be a chain of scalar products $\cdot$. Where am I doing wrong?