2

I'm writing my own CNN code from scratch. Though I got fast, converged and satisfactory results, the trained weights change very little in value (while cost/loss function drops in time rapidly in a seemingly converged manner). My initial weights: convolution kernels as non zero unit matrices; fully connected layer weights as 0's. The activation function is sigmoid. The data scale from 0 to 1. Why do the weights change so little?

feynman
  • 237
  • 1
  • 8

3 Answers3

2

In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

Source :: https://en.wikipedia.org/wiki/Vanishing_gradient_problem

Thus, gradient is vanishing till it reaches initial layer of neural network and in turn very little change in weights.

Preet
  • 638
  • 3
  • 5
1

I would venture that the problem you are having is at least due to bad initialisation, and it could also be a bad learning rate or activation functions.

You mentioned that you are initializing conv kernels to "non zero unit matrices" and "fully connected layer weights as 0's". Firstly I do not know what you mean by "non zero unit matrices", but definitely you should not be initializing fully connected layers to 0. The reason for this is because if all neurons in that layer have the same value, they will all behave very similar if not all exactly the same (depending on the network). This will produce very similar features that do not bias the network well.

I recommend using a random initialisation for both the Conv kernels and Dense kernels, and zeros for any biases.

Secondly, your activation functions are not stated, but if you are using an activation function with a gradient that is sensitive to learning rate vanishing, then this is also a problem. For instance using a sigmoid activation and a high learning rate will result in almost no weight update value.

Also, what do you consider [why the weights change] "so little"? I would expect for data scaled from [-1, 1] the update magnitude should be around 1e-3 for the first layers and 1e-4 for the last ones. (Off the top of my head)

Gouda
  • 171
  • 4
0

@feynman Because Batch Normalization do not let weights vanish or explode, it normalizes the batches in between every layer. Since each activation function will have inputs from the previous layers as close to zero, the effect of the increasing/decreasing weights of the previous layers will be suppressed; avoiding the snowball effect.

Ugur MULUK
  • 500
  • 3
  • 8