9

With neural networks, back-propagation is an implementation of the chain rule. However, the chain rule is only applicable for differentiable functions. With non-differentiable functions, there is no chain rule that works in general. And so, it seems that back-propagation is invalid when we use a non-differentiable activation function (e.g. Relu).

The words that are stated around this seeming error is that "the chance of hitting a non-differentiable point during learning is practically 0". It's not clear to me, though, that landing on a non-differentiable point during learning is required in order to invalidate the chain rule.

Is there some reason why we should expect back-propagation to yield an estimate of the (sub)gradient? If not, why does training a neural network usually work?

RobPratt
  • 50,938
NicNic8
  • 7,120
  • Just a note, there is a SE site focused on machine learning: stats.stackexchange.com , just in case you won't get an answer here. Also somehow related (on yet another SE site that is currently in the beta): Differentiable activation function. DuttaA's answer seems to be especially interesting. Also look at https://www.quora.com/Why-does-ReLU-work-with-backprops-if-its-non-differentiable , it mentions interesting concept called Subderivative (https://en.m.wikipedia.org/wiki/Subderivative). – Sil Jul 01 '18 at 17:02
  • 1
    A NN with a smooth activating function like the logistic function, depends continuously on it's parameters. The back-propagation process appears when minimizing the data fitting error which is also a continuous function of it's parameters. Concluding, with a smooth activating function, the chain rule is quite operative and represents the chain rule. – Cesareo Jul 01 '18 at 17:07
  • 3
    @Cesareo The point is that non-smooth activation functions are used, like ReLU. What happens then? – rubik Jul 02 '18 at 06:40
  • It's generally not a problem as long as the functions are differentiable almost everywhere. So Relu is fine, as are the max & min functions. Keep in mind we are taking random steps of finite size in a huge parameter space, on random inputs, meaning the probability we hit a non-differentiable point on an input is rather tiny. Even if we hit or are near the point, we will simply get a biased estimate of the stochastic gradient. Empirically, this seems to generally not be an issue (a biased estimate being better than nothing). – user3658307 Jul 02 '18 at 16:35
  • @user3658307 You state ""It's generally not a problem as long as the functions are differentiable almost everywhere." How do you know this? Why do you believe this? (I understand that neural networks generally work. I'm looking for a rigorous mathematical explanation. For example, there are non-differentiable functions that can't be optimized with gradient descent even when the non-differentiable points aren't hit during the optimization. Why isn't this a problem with neural networks?) – NicNic8 Jul 03 '18 at 01:57
  • @NicNic8 (1) It's differentiable almost everywhere (in the rigorous sense) so we are unlikely to hit it, (2) we get to "choose" the derivative at 0, meaning we can choose it to be a sub-derivative, so even there it is "ok" in some sense, (3) the noise in SGD overshadows the bias in the gradient estimate near 0 anyway. Overall, I guess my belief is due to how well it seems to work in practice :3. (Sorry I know that's not rigorous I guess). – user3658307 Jul 03 '18 at 02:50
  • @NicNic8 As for applying gradient descent in strange situations, I'd be interested in some examples of such problems. Derivatives are fundamentally local, so I'm not sure why behaviour far away would be an issue :) Actually I would argue we have gotten good at estimating gradients in very non-differentiable situations: e.g. the REINFORCE rule for gradient estimation for policies, or other methods for even backpropagating through discrete random variables – user3658307 Jul 03 '18 at 02:53
  • 1
    @user3658307 Vandenberghe's 236c notes on gradient descent contain an example where gradient descent with exact line search fails to find a global minimizer for a convex but nondifferentiable function, despite the fact that the method never encounters a point where the objective function is nondifferentiable. See slide 1-5 here: http://www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf – littleO Jul 04 '18 at 09:19
  • 1
    "It's not clear to me, though, that landing on a non-differentiable point during learning is required in order to invalidate the chain rule" The chain rule itself works fine as long as we have not landed on a non-differentiable point. If $f = g \circ h$ and $h$ is differentiable at $x$, and $g$ is differentiable at $h(x)$, then $f$ is guaranteed to be differentiable at $x$ and $f'(x) = g'(h(x)) h'(x)$. It seems to me that the real question is: is there any theoretical guarantee that gradient descent performs well provided that we avoid nondifferentiable points. (See previous comment.) – littleO Jul 04 '18 at 09:31

1 Answers1

5

The answer to this question might be more clear now with the following two papers:

  1. Kakade and Lee (2018) https://papers.nips.cc/paper/7943-provably-correct-automatic-sub-differentiation-for-qualified-programs.pdf

  2. Bolte and Pauwels (2019) https://arxiv.org/pdf/1909.10300.pdf

As you say, it is wrong to use the chain rule with ReLU activation functions. Evenmore the argument that "the output is differentiable almost everywhere implies that the classical chain rule of differentiation applies almost everywhere" is False. see Remark 12 in the second reference.