1

Let's say I have a deep neural network with 50 hidden layers and at each neuron of hidden layer the ReLU activation function is used. My question is:

  • Is it possible for vanishing gradient problem to get occur during the backpropogation for weights updates even after the existence of ReLU?
  • or we can say that vanishing gradient problem will never occur when all the activation functions are ReLU?"
Bits
  • 131
  • 6

3 Answers3

0

It can always happen , You see if the Weights are really tiny numbers close to zero, gradients are just the same if the dot product per neuron is positive then the gradients are just equal to the weights of that layer which can be small or if its negative , then the gradients are exactly equal to zero , small enough , so the answer to your question is Yes I think , The chances ofcourse are way better than something like sigmoid . But saying that it will never happen is I think totally wrong.

Lucid
  • 111
  • 3
0

ReLU activation function doesn't easily cause Vanishing Gradient Problem but it sometimes causes Dying ReLU Problem.

  • Dying ReLU Problem is during backpropagation, once the nodes(neurons) with ReLU activation function recieve negative input values, they always produce zero for any input values, finally, they are never recovered to produce positive values, then a model cannot be trained effectively.

So instead of using ReLU activation function, you should use Leaky ReLU, PReLU or ELU activation function which doesn't easily cause both Vanishing Gradient Problem and Dying ReLU Problem.

0

Are you talking about LeakyReLU by chance and not ReLU? Because ReLU is known for vanishing gradients, since any values less than zero are mapped to zero. This is true regardless of the number of layers. LeakyReLU on the other hand, maps the values less than zeros to a very small positive number. This prevents vanishing gradient from occurring.

EDIT: LeakyReLU prevents dying ReLU from occurring not vanishing gradient. PReLU prevents vanishing gradient from occurring.

EDIT 2: To answer the comments. VGG was proposed in the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" and was one of the top performance models for the ImageNet challenge at its time. Architecture wise, VGG wasn't completely different from what was done in the past. However, it was much deeper. This is impart where the vanishing gradient becomes a problem. It does not really have to do with ReLU alone but a combination of every single layer.

Enter ResNet, which uses skip connections. These actually make parts of the networks shallow and makes it easier for the network to learn both easy and difficult tasks (ie. low and high frequencies in images). More difficult tasks require more learnable parameters while easier tasks require fewer parameters.

I believe PReLU being a learnable activation function can help deal with this task.

J Houseman
  • 58
  • 5