4

Are there publications which mention numerical problems in neural network optimization?

(Blog posts, articles, workshop notes, lecture notes, books - anything?)

Background of the question

I've recently had a strange phenomenon: When I trained a convolutional network on the GTSRB dataset with a given script on my machine, it got state of the art results (99.9% test accuracy). 10 times. No outlier. When I used the same scripts on another machine, I got much worse results (~ 80% test accuracy or so, 10 times, no outliers). I thought that I probably didn't use the same scripts and as it was not important for my publication I just removed all results of that dataset. I thought I probably made a mistake one one of the machines (e.g. using different pre-processed data) and I couldn't find out where the mistake happened.

Now a friend wrote me that he has a network, a training script and a dataset which converges on machine A but does not converge on machine B. Exactly the same setup (a fully connected network trained as an autoencoder).

I have only one guess what might happen: The machines have different hardware. It might be possible that Tensorflow uses different algorithms for matrix multiplication / gradient calculation. Their numeric properties might be different. Those differences might cause one machine to be able to optimize the network while the other can't.

Of course, this needs further investigation. But no matter what is happening in those two cases, I think this question is interesting. Intuitively, I would say that numeric issues should not be important as sharp minima are not desired anyway and differences in one multiplication are less important than the update of the next step.

Martin Thoma
  • 19,540
  • 36
  • 98
  • 170

1 Answers1

1

Are there publications which mention numerical problems in neural network optimization?

Of course, there has been a lot of research on vanishing gradients, which is entirely a numerical problem. There is also a fair amount of research of training with low precision operations, but the result is surprising: reduced floating point precision doesn't seem to affect neural network training. This means that precision loss is pretty unlikely to be the cause of this phenomenon.

Still, the environment can affect the computation (as suggested in the comments):

  • Most obviously, random-number generator. Use a seed in your script and try to make a reproducible result at least on a single machine. After that you can compute the summary of activations and gradients (e.g. via tf.summary in tensorflow) and compare the tensors across the machines. Clearly, basic operations such as matrix multiplication or piece-wise exponent should give very close if not identical result, no matter what hardware is used. You should be able to see if the tensors diverge immediately (which means there is another source of randomness) or gradually.

  • python interpreter, cuda, cudnn driver and key libraries versions (numpy, tensorflow, etc). You can go as far the same linux kernel and libc version, but I think you should expect reproducibility even without it. cudnn version is important, because the convolution is likely to be natively optimized. tensorflow is also very important, because Google rewrites the core all the time.

  • Environment variables, e.g. PATH, LD_LIBRARY_PATH, etc., linux configuration parameters, e.g. limits.conf, permissions.

Extra precautions:

  • Explicitly specify the type of each variable, don't rely on defaults.

  • Double check that the training / test data is identical and is read in the same order.

  • Does your computation use any pre-trained models? Any networking involved? Check that as well.

I would suspect hardware differences the last: it's an extraordinary case if the high-level computation without explicit concurrency leads to different results (except floating point precision differencies) depending on the number of cores or cache size.

Maxim
  • 920
  • 1
  • 9
  • 20