1

I have a huge problem trying to derive the backpropagation equations. All the solutions I've found online are not detailed as I'd like, hence I'm here asking your help. First of all sorry for this long preface, but I think it's mandatory in order to fully understand understand what's going on here. If u want to skip it, jump to the backpropagation section.

Consider the following assumptions (for a neural net for a regression task):

  • $ \cdot $ denotes the scalar product;
  • $ \odot $ denotes the Hadamard (element-wise) product;
  • $x \in \mathbb{R}^{n} = \begin{bmatrix} x_1\\ \vdots\\ x_n \end{bmatrix} \in \mathbb{R}^{n \times 1}$ denotes a column vector (denominator layoyut);
  • $\left( x^{(i)}, y^{(i)} \right)$ denotes the $i$-th training instance;
  • $L$ is the total number of layers in the net (with $1$ being the input layer and $L$ the output one);
  • $\varphi(\cdot)$ is the activation function of each neuron in the net;
  • $\vartheta^{(l)}_{ij}$ is the edge going from the $i$-th neuron in the $(l+1)$-th layer to the $j$-th neuron in the $l$-th layer
  • $n_l$ denotes the number of neurons in the $l$-th layer, $\forall l=1..L$
  • $\Theta^{(l)} = \begin{bmatrix} \vartheta^{(l)}_{11} & \vartheta^{(l)}_{12} & \ldots & \vartheta^{(l)}_{1 n_{l}} \\ \vartheta^{(l)}_{21} & \vartheta^{(l)}_{22} & \ldots & \vartheta^{(l)}_{2 n_{l}} \\ \vdots & \vdots & \ddots & \ldots \\ \vartheta^{(l)}_{n_{l+1} 1} & \vartheta^{(l)}_{n_{l+1} 2} & \ldots & \vartheta^{(l)}_{n_{l+1} n_{l}} \\ \end{bmatrix} \in \mathbb{R}^{n_{l+1} \times n_{l}}$ is the weight matrix mapping all the weights between the $l$-th and $(l+1)$-th layer, $\forall l=1..L-1$
  • $b^{(l)} \in \mathbb{R}^{n_l}$ is the bias vector for the $l$-th layer
  • $a^{(l)} \in \mathbb{R}^{n_l}$ is the output vector of the $l$-th layer

Forward propagation (optional reading)

Then, for the forward propagation you have that \begin{align} a^{(1)} &= x^{(i)} \in \mathbb{R}^{n_1} \\ a^{(2)} &= \varphi \left( \Theta^{(1)} \cdot a^{(1)} + b^{(2)} \right) = \varphi \left( z^{(2)} \right) \\ \vdots \\ a^{(L)} &= h_\Theta \left( x^{(i)} \right) = \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} = z^{(L)} \end{align}

We can easily generalize the previous equations easily as \begin{align} &\begin{cases} a^{(1)} = x^{(i)} \\[1ex] z^{(1)} = \text{undefined} \end{cases}\\[2ex] &\begin{cases} a^{(l)} &= \varphi \left( z^{(l)} \right) \\[1ex] z^{(l)} &= \Theta^{(l-1)} \cdot a^{(l-1)} + b^{(l)} \end{cases}, \quad\forall l=2..L-1 \\[2ex] &\begin{cases} a^{(L)} &= h_\Theta (x) = z^{(L)} \\[1ex] z^{(L)} &= \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \end{cases} \end{align}

From a dimensional pov everything works fine, indeed the dimension of $z^{(l)}$ can be computed as follows: \begin{equation} (n_{l} \times n_{l-1}) \cdot (n_{l-1} \times 1) + (n_l \times 1) = (n_l \times 1) + (n_L \times 1) = (n_l \times 1) \end{equation} hence, also $a^{(l)} \in \mathbb{R}^{n_l}$ (like we would expect since in the $l$-th layer there are exactly $n_l$ neurons and $a^{(l)}$ denotes the output of each neuron in such a layer).

Gradient descent (optional reading)

The problems arise when I try to derive the backpropagation equations. Knowing that for the gradient descent \begin{equation} \Theta \leftarrow \Theta - \eta \nabla_{\Theta} J \end{equation} where $\eta$ is the learning rate and $J(\Theta)$ denotes the cost function, defined as \begin{equation} J(\Theta) = \frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{equation} with $J^{(i)}$ being the cost function for the $i$-th training sample: \begin{align} J^{(i)} &= \frac{1}{2} \left[ h_\Theta \left(x^{(i)}\right) - y^{(i)} \right]^2 = \frac{1}{2} \left[ a^{(L)} - y^{(i)} \right]^2 \end{align} Note that the gradient $\nabla_\Theta J$ can be written as \begin{equation} \nabla_\Theta J = \begin{bmatrix} \frac{\partial J}{\partial \Theta^{(1)}} \\ \frac{\partial J}{\partial \Theta^{(2)}} \\ \vdots \\ \frac{\partial J}{\partial \Theta^{(L-1)}} \end{bmatrix} \end{equation} Hence we can focus on the $l$-th partial derivative as follows \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \eta \frac{\partial J}{\partial \Theta^{(l)}} \end{equation} By replacing the definition of $J$ in the previous equation, we can write \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \eta \frac{\partial}{\partial \Theta^{(l)}} \left[ \frac{1}{m} \sum_{i=1}^{m} J^{(i)} \right] \end{equation} where for the linearity of differentiation, it becomes \begin{equation} \Theta^{(l)} \leftarrow \Theta^{(l)} - \frac{\eta}{m} \sum_{i=1}^{m} \frac{\partial J^{(i)}}{\partial \Theta^{(l)}} \end{equation}

Backpropagation

Starting from the last equation, we need to compute the partial derivative of $J^{(i)}$ wrt all the weight matrices $\Theta^{(l)}$ of the net, $\forall l=1..L-1$. Starting from the $L$-th (output) layer and continuing up to the 1-st (input) one, we can write

\begin{align} \frac{\partial J^{(i)}}{\partial \Theta^{(L-1)}} &= \frac{\partial}{\partial \Theta^{(L-1)}} \left[ \frac{1}{2} \left( a^{(L)} - y^{(i)} \right)^2 \right] \\[1ex] &= \underbrace{\left( a^{(L)} - y^{(i)} \right)}_{\delta^{(L)}} \odot \frac{\partial a^{(L)}}{\partial \Theta^{(L-1)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial z^{(L)}}{\partial \Theta^{(L-1)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial}{\partial \Theta^{(L-1)}} \left( \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \left( a^{(L-1)} \right)^\top \end{align} As you may notice, $\delta^{(L)} \in \mathbb{R}^{n_L \times 1}$, while $\left(a^{L-1}\right)^\top \in \mathbb{R}^{1 \times n_{L-1}}$, hence the Hadamard product is not defined.Based on the dimensions of these vectors, I would expect that product to be scalar $\cdot$ and not a Hadamard $\odot$ one.

The problems continue also if I keep backpropagating the errror: \begin{align} \frac{\partial J^{(i)}}{\partial \Theta^{(L-2)}} &= \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \frac{1}{2} \left( a^{(L)} - y^{(i)} \right)^2 \right] \\[1ex] &= \underbrace{\left( a^{(L)} - y^{(i)} \right)}_{\delta^{(L)}} \odot \frac{\partial a^{(L)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial z^{(L)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left( \Theta^{(L-1)} \cdot a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \Theta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left( a^{(L-1)} + b^{(L)} \right) \\[1ex] &= \delta^{(L)} \odot \Theta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \varphi \left(z^{(L-1)} \right) + b^{(L)} \right] \\[1ex] &= \underbrace{\delta^{(L)} \odot \Theta^{(L-1)} \odot \varphi' \left(z^{(L-1)} \right)}_{\delta^{(L-1)}} \odot \frac{\partial z^{(L-1)}}{\partial \Theta^{(L-2)}} \\[1ex] &= \delta^{(L-1)} \odot \frac{\partial}{\partial \Theta^{(L-2)}} \left[ \Theta^{(L-2)} \cdot a^{(L-2)} + b^{(L-1)} \right] \\[1ex] &= \delta^{(L-1)} \odot \left( a^{(L-2)} \right)^\top \end{align}

And here we go again. As you may notice, $\delta^{(L)} \in \mathbb{R}^{n_L}$, $\Theta^{(L-1)} \in \mathbb{R}^{n_L \times n_{L-1}}$ and $\varphi' \left( z^{(L-1)} \right) \in \mathbb{R}^{n_{L-1} \times 1}$, hence the dimension of $\delta^{(L-1)}$ is equal to: \begin{align} (n_L \times 1) \odot (n_L \times n_{L-1}) \odot(n_{L-1} \times 1) \end{align} which, again, cannot be computed, and their dimensions make me think it should be a chain of scalar products $\cdot$. Where am I doing wrong?

frad
  • 111
  • 2
    I agree with you, the use of Hadamard products in the backpropagation makes no sense. It should be a standard matrix product as in greg's answer below. After replacing Hadamard products with matrix products, if you also replace greg's $g_k$ variables with $\delta^{(k)}$ then your gradient formulas are identical to his. – lynn Jan 24 '23 at 03:24
  • But how? Using the scalar product, I can compute $\delta^{(L)}$, resulting in a $n_L \times 1$ matrix, but then $\delta^{(L-1)}$ cannot be computed: $\delta^{(L-1)} = \delta^{(L)} \cdot \Theta^{(L-1)} \cdot \varphi'\left(z^{(L-1)}\right)$, from a dimensional pov, $(n_L \times 1) \cdot (n_L \times n_{L-1}) \cdot (n_{L-1} \times 1)$, only the product between the second and third terms can be computed, resulting in a $n_L \times 1$ (column vector), which corresponds to $tmp = \Theta^{(L-1)} \cdot \varphi'\left(z^{(L-1)}\right)$. Now, the scalar product $\delta^{(L)} \cdot tmp$ cannot be computed. – frad Jan 25 '23 at 10:56
  • The problem you have is you are doing backpropagation incorrectly. You are trying to compute the whole Jacobian in the chain and then accumulating them manually, but backpropagation means you compute Vector-Jacobian-Products (VJPs), which means using the trick $v^⊤ \frac{∂u}{∂x} = \frac{∂(v^⊤u)}{∂x}$. Also the occurence of the hadamard product in the equations is simply wrong. hadamard products should only appear for the derivatives of the activation functions, as they are applied element-wise. – Hyperplane Mar 29 '23 at 12:51

1 Answers1

3

$ \def\W{\Theta} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\smL{{\small L}} $By using subscripts to denote the layer, superscripts can be reserved to denote standard matrix operations (like transposes!).

The layer-specific variables then become $$\eqalign{ z_k &= b_k + \W_{k-1}a_{k-1} &\qiq dz_k = d\W_{k-1}a_{k-1}+\W_{k-1}da_{k-1} \\ a_k &= \varphi(z_k) &\qiq da_k = d\varphi'(z_k)\odot dz_k \\ P_k &= \Diag{\varphi'(z_k)} &\qiq da_k = P_k\:dz_k \\ }$$ The Frobenius product (used below) is extraordinarily useful in Matrix Calculus $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_{\c{F}} \qquad \big({\rm \c{Frobenius}\:norm}\big) \\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Using a bit of clairvoyance, define the following matrix and vector variables $$\eqalign{ G_\smL &= 0\qquad &g_\smL = z_\smL-y \\ G_k &= g_{k+1}a_k^T\qquad &g_k = P_k^T\W_k^Tg_{k+1}\qquad({\rm for}\:k<L) \\ }$$ Calculate the differential of the cost function (for an arbitrary training set) starting at the final layer $$\eqalign{ J &= \frac 12\LR{g_\smL:g_\smL} \\ dJ &= g_\smL:{dg_\smL} \\ &= g_\smL:{dz_\smL} \\ &= g_\smL:\LR{d\W_{\smL-1}a_{\smL-1}+\W_{\smL-1}\c{da_{\smL-1}}} \\ &= \LR{g_\smL a_{\smL-1}^T}:d\W_{\smL-1} \;+\; g_\smL:\LR{\W_{\smL-1}\c{P_{\smL-1}dz_{\smL-1}}} \\ &= G_{\smL-1}:d\W_{\smL-1} \;+\; g_{\smL-1}:dz_{\smL-1} \\ }$$ Pause here to note that three gradients can already be identified $$\eqalign{ g_\smL &= \grad{J}{z_\smL} \qquad g_{\smL-1} &= \grad{J}{z_{\smL-1}} \qquad G_{\smL-1} &= \grad{J}{\W_{\smL-1}} \\ }$$ Continue to the next level by expanding $\:dz_{\smL-1}$ $$\eqalign{ dJ &= G_{\smL-1}:d\W_{\smL-1} \;+\; &g_{\smL-1}:\LR{d\W_{\smL-2}a_{\smL-2}+\W_{\smL-2}\c{da_{\smL-2}}} \\ &= G_{\smL-1}:d\W_{\smL-1} \;+\;&\LR{g_{\smL-1}a_{\smL-2}^T}:d\W_{\smL-2} \;+\; g_{\smL-1}:\LR{\W_{\smL-2}\c{P_{\smL-2}dz_{\smL-2}}} \\ &= G_{\smL-1}:d\W_{\smL-1} \;+\; &G_{\smL-2}:d\W_{\smL-2} \;+\; g_{\smL-2}:dz_{\smL-2} \\ &&G_{\smL-2} = \grad{J}{\W_{\smL-2}} \quad g_{\smL-2} = \grad{J}{z_{\smL-2}} \\ }$$ This pattern repeats for all subsequent layers.

The key idea is that replacing the Hadamard product with a Diag() function allows the use of standard matrix notation throughout the derivation.

greg
  • 40,033
  • Thanks for your detailed comment, but this is not the answer to my question. I'd like to avoid further complicated notations. I'd just like to know where I'm doing wrong in order to complete my proof. Anyway, using the index $l$ of the layer as superscript is just the notation used by my teacher. Changing it would only make things uneven, leading me to make mistakes more easily – frad Jan 22 '23 at 14:55