10

I am unable to understand when to use ReLU, Leaky ReLU and ELU. How do they compare to other activation functions(like the sigmoid and the tanh) and their pros and cons.

Ayazzia01
  • 113
  • 1
  • 1
  • 6

2 Answers2

5

Look at this ML glossary:

ELU

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.

Pros

  • ELU becomes smooth slowly until its output equal to $-\alpha$ whereas RELU sharply smoothes.
  • ELU is a strong alternative to ReLU.
  • Unlike to ReLU, ELU can produce negative outputs.

Cons

  • For $x > 0$, it can blow up the activation with the output range of [0, inf].

ReLU

Pros

  • It avoids and rectifies vanishing gradient problem.
  • ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

Cons

  • One of its limitations is that it should only be used within hidden layers of a neural network model.
  • Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. In other words, ReLu can result in dead neurons.
  • In another words, For activations in the region ($x<0$) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.
  • The range of ReLu is $[0,\infty)$. This means it can blow up the activation.

LeakyRelu

LeakyRelu is a variant of ReLU. Instead of being 0 when $z<0$, a leaky ReLU allows a small, non-zero, constant gradient α (Normally, $\alpha=0.01$). However, the consistency of the benefit across tasks is presently unclear. [1]

Pros

  • Leaky ReLUs are one attempt to fix the “dying ReLU” problem by having a small negative slope (of 0.01, or so).

Cons

  • As it possess linearity, it can’t be used for the complex Classification. It lags behind the Sigmoid and Tanh for some of the use cases.

Further reading

  • Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He et al. (2015)
OmG
  • 1,249
  • 9
  • 19
0

<ReLU>

Pros:

  • It mitigates Vanishing Gradient Problem.

Cons:

  • It causes Dying ReLU Problem.
  • It's non-differentiable at x=0.

<Leaky ReLU>

Pros:

  • It mitigates Vanishing Gradient Problem.
  • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.

Cons:

  • It's non-differentiable at x=0.

<ELU>

Pros:

  • It normalizes negative input values so the convergence with negative input values is stable.
  • It mitigates Vanishing Gradient Problem.
  • It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.

Cons:

  • It's computationally expensive because of exponential operation.
  • It's non-differentiable at x = 0 if a is not 1.