Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training?

Question

ReLU is an activation function defined as $h = \max(0, a)$ where $a = Wx + b$.

Normally, we train neural networks with first-order methods such as SGD, Adam, RMSprop, Adadelta, or Adagrad. Backpropagation in first-order methods requires first-order derivative. Hence $x$ is derived to $1$.

But if we use second-order methods, would ReLU's derivative be $0$? Because $x$ is derived to $1$ and is derived again to $0$. Would it be an error? For example, with Newton's method, you'll be dividing by $0$. (I don't really understand Hessian-free optimization, yet. IIRC, it's a matter of using an approximate Hessian instead of the real one).

What is the effect of this $h''=0$? Can we still train the neural network with ReLU with second-order methods? Or would it be non-trainable/error (nan/infinity)?

For clarity, this is ReLU as $f(x)$:

$f(x) =$ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ x & \mbox{for} & x \ge 0\end{array}

$f'(x) =$ \begin{array}{rcl} 0 & \mbox{for} & x < 0\\ 1 & \mbox{for} & x \ge 0\end{array}

$f''(x) = 0$

Neil Slater · Accepted Answer · 2016-07-13T09:23:23.977

Yes the ReLU second order derivative is 0. Technically, neither $\frac{dy}{dx}$ nor $\frac{d^2y}{dx^2}$ are defined at $x=0$, but we ignore that - in practice an exact $x=0$ is rare and not especially meaningful, so this is not a problem. Newton's method does not work on the ReLU transfer function because it has no stationary points. It also doesn't work meaningfully on most other common transfer functions though - they cannot be minimised or maximised for finite inputs.

When you combine multiple ReLU functions with layers of matrix multiplications in a structure such as a neural network, and wish to minimise an objective function, the picture is more complicated. This combination does have stationary points. Even a single ReLU neuron and a mean square error objective will have different enough behaviour such that the second-order derivative of the single weight will vary and is not guaranteed to be 0.

Nonlinearities when multiple layers combine is what creates a more interesting optimisation surface. This also means that it is harder to calculate useful second-order partial derivatives (or Hessian matrix), it is not just a matter of taking second order derivatives of the transfer functions.

The fact that $\frac{d^2y}{dx^2} = 0$ for the transfer function will make some terms zero in the matrix (for the second order effect from same neuron activation), but the majority of terms in the Hessian are of the form $\frac{\partial^2E}{\partial x_i\partial x_j}$ where E is the objective and $x_i$, $x_j$ are different parameters of the neural network. A fully-realised Hessian matrix will have $N^2$ terms where $N$ is number of parameters - with large neural networks having upwards of 1 million parameters, even with a simple calculation process and many terms being 0 (e.g. w.r.t. 2 weights in same layer) this may not be feasible to compute.

There are techniques to estimate effects of second-order derivatives used in some neural network optimisers. RMSProp can be viewed as roughly estimating second-order effects, for example. The "Hessian-free" optimisers more explicitly calculate the impact of this matrix.

Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training?

1 Answers1