ReLU vs sigmoid in mnist example

Question

PLEASE NOTE: I am not trying to improve on the following example. I know you can get over 99% accuracy. The whole code is in the question. When I tried this simple code I get around 95% accuracy, if I simply change the activation function from sigmoid to relu, it drops to less than 50%. Is there a theoretical reason why this happens?

I have found the following example online:

from keras.datasets import mnist
from keras.models import Sequential 
from keras.layers.core import Dense, Activation
from keras.utils import np_utils

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

X_train = X_train.reshape(60000, 784)     
X_test = X_test.reshape(10000, 784)

Y_train = np_utils.to_categorical(Y_train, classes)     
Y_test = np_utils.to_categorical(Y_test, classes)

batch_size = 100      
epochs = 15

model = Sequential()     
model.add(Dense(100, input_dim=784)) 
model.add(Activation('sigmoid'))     
model.add(Dense(10)) 
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='sgd')

model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs, verbose=1)

score = model.evaluate(X_test, Y_test, verbose=1)
print('Test accuracy:', score[1])

This gives about 95% accuracy, but if I change the sigmoid with the ReLU, I get less than 50% accuracy. Why is that?

score 6 · Accepted Answer · edited Aug 09 '20 at 09:54

I took your exact code, replaced

model.add(Activation('sigmoid'))

by

model.add(Activation('relu'))

and indeed I experienced the same problem than you: only 55% accuracy, which is bad...

Solution: I rescaled the input image values from [0, 255] to [0,1] and it worked: 93% accuracy with ReLU! (inspired from here):

from keras.datasets import mnist
from keras.models import Sequential 
from keras.layers.core import Dense, Activation
from keras.utils import np_utils
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)

X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
Y_train = np_utils.to_categorical(Y_train, 10)
Y_test = np_utils.to_categorical(Y_test, 10)
batch_size = 100
epochs = 15
model = Sequential()

model.add(Dense(100, input_dim=784)) 
model.add(Activation('relu'))
model.add(Dense(10)) 
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='sgd')
model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs, verbose=1)
score = model.evaluate(X_test, Y_test, verbose=1)
print('Test accuracy:', score[1])

Output:

Test accuracy: 0.934

Potential explanation: when using an input in [0, 255], then when doing the weighted sum for the layer $L$: $z = a^{(L-1)} w^{(L)} + b^{(L)}$, the value $z$ will often be big too. If $z$ is often big (or even if it's often > 0), let's say around 100, than $ReLU(z) = z$, and we totally lose the "non-linear" aspect of this activation function! Said in another way: if the input is in [0, 255], then $z$ is often far from 0, and we totally avoid the place where "interesting non-linear things" are going on (around 0 the ReLU function is non linear and looks like __/)... Now when the input is in [0,1], then the weighted sum $z$ can often be close to 0: maybe it sometimes goes below 0 (since the weights are randomly-initialized on [-1, 1], it's possible!), sometimes higher than 0, etc. Then more neuron activation/deactivation is happening... This could be a potential explanation of why it works better with input in [0, 1].

score 2 · Answer 2 · edited Mar 17 '18 at 16:33

I got around 98% accuracy using ReLu activation function. I have used the following architecture :

fully connected layer with 300 hidden units
ReLu activation
fully connected layer with 10 hidden units
Softmax layer
Output Clipping 1e-10 to 0.999999 to avoid log(0) and value greater than 1
Cross entropy loss

I think you should add output clipping and then train it, hope that will work fine.

score -1 · Answer 3 · edited Feb 14 '18 at 21:57

Because with MNIST, you are trying to predict based on probabilities.

The sigmoid function squishes the $x$ value between $0$ and $1$. This helps to pick the most probable digit that matches the label.

The ReLU function doesn't squish anything. If the $x$ value is less than $0$, the the output is $0$. If its more than $0$, the answer is the $x$ value itself. No probabilities are being created.

Honestly, I'm suprised you got anything more than 10% when you plug it in.

ReLU vs sigmoid in mnist example

3 Answers3