1

Xpost: I put this question here on cross-validated, the neural networks StackExchange, but got no response so I'm hoping I'll have better luck here.

I'm trying to find a rigorous derivation for the backpropagation algorithm, and I've gotten myself into something of a confusion. The confusion comes from when and why people transpose the weight matrices, and how we know when to use the Hadamard product and when to use the dot product. When these things are worked through element by element, as is the case in a wonderful answer here, the arguments provided seem to make sense. That said, there's always something a little artificial about the derivations, and people often write `we do this to make the dimensions agree', which is of course not at all rigorous, and not really actual maths.

If I were approaching the problem without having seen the solution, I would come up with the solution below. Although I know this solution is definitely incorrect, I can't work out why.

Beginning with $$ a^l = \sigma(z^l)\\ z^l=w^l\cdot a^{l-1}+b^l $$ in which $a, z$ and $b$ are vectors and $w$ is a matrix, and the index $l$ indicates a layer number. $\sigma$ is an activation function, and $C$ is the cost function.

We want to find $$\frac{\partial C}{\partial z^l}.$$ Let's assume we have $$\delta^{l+1}=\frac{\partial C}{\partial z^{l+1}}$$ Now, via the chain rule, I would find that $$ \begin{align} \frac{\partial C}{\partial z^l}&=\frac{\partial C}{\partial a^l}\frac{\partial a^l}{\partial z^l}\\ &=\underbrace{\frac{\partial C}{\partial z^{l+1}}}_A\underbrace{\frac{\partial z^{l+1}}{\partial a^l}}_B\underbrace{\frac{\partial a^l}{\partial z^l}}_C \end{align} $$ Now each of these are simple. We have that $$\begin{align} A&=\delta^{l+1}\\ B&=\frac{\partial}{\partial a^l} w^{l+1}a^l+b^{l+1}\\ &=w^{l+1}\\ C&=\frac{\partial}{\partial z^l} \sigma(z^l)\\ &=\sigma'(z^l) \end{align}$$ So, putting these back in, I ought to get $$ \frac{\partial C}{\partial z^l} = \delta^{l+1}\cdot w^{l+1}\cdot\sigma'(z^l) $$ which is of course completely wrong, the correct answer being $$\frac{\partial C}{\partial z^l}=((w^{l+1})^T\cdot\delta^{l+1})\odot\sigma'(z^l).$$

I can see that my answer couldn't be right anyway, since it would end up with the product of two vectors. But what I can't see is where I've actually gone wrong, or done something mathematically incorrect.

Any help much appreciated!

0 Answers0