Difference between training and test data distributions

Question

A basic assumption in machine learning is that training and test data are drawn from the same population, and thus follow the same distribution. But, in practice, this is highly unlikely. Covariate shift addresses this issue. Can someone clear the following doubts regarding this?

How does one check whether two distribution are statistically different? Can kernel density estimate (KDE) be used to estimate the probability distribution to tell the difference? Let's say I have 100 images of a specific category. The number of test images is 50, and I'm changing the number of training images from 5 to 50 in steps of 5. Can I say the probability distributions are different when using 5 training images and 50 testing images after estimating them by KDE?

score 1 · Answer 1 · answered Dec 25 '15 at 16:21

A good way to measure the difference between two probabilistic distributions is Kullbak-Liebler. You have to take into account that the distribution has integrate to one. Also you have to take into account that it's not a distance because it's not symmetric. KL(A,B) not equal to KL(B,A)

score 0 · Answer 2 · answered Jan 18 '16 at 08:45

If you are working with large dataset. Training and test set disribution may not be too different. In theory "law of large numbers" ensures that the distribution remains same. For smaller set of data probably this is a good point to take care of distribution. As said by Hoap Humanoid "Kullbak-Liebler" can be used to find the difference of distributions of two sets.

Difference between training and test data distributions

2 Answers2