6

I have data currently available that is very accurate and I would like to train my classification methods on this set of clean data to learn the important markers for distinguishing between classes. But in the future, my trained classifiers will not be seeing and performing decisions on this cleaned data; instead, it will likely have a lot more noise following some unknown distribution(s). Thus I am wondering, is it 'better' to train on noisy data if I'm going to likely see noisy data in the future, or train on good data since the noisy data should (ideally) correspond to the cleaned data if noise was removed?

Intuitively, if my goal is to simply perform classifications, then training on noisy data seems the 'better' approach since this is better representative of my expected future inputs. But if my goal is to learn about the data and what constitutes a particular decision, then training on cleaned data appears the 'better' approach.

But am I overlooking anything? Would training on the clean and/or noisy data be preferable for different reasons (e.g. generalization, overfitting, reducing dimensionality)?

Mathews24
  • 195
  • 9

1 Answers1

3

The answer to this severely depends on what you mean by 'noisy' data. Are the labels noisy i.e. wrong? Or are the features noisy? Or both? If only the features are noisy, definitely use the noisy data and probably also the clean data. If only the labels, definitely do not use the noisy data. If both, is it possible to correct the labels? At the very least, can you get a reliable test set (representative with correct labels)? You could try training on both the noisy and clean data and see which gives better performance. Important methods to consider here would be regularization and early stopping.

It also depends on what algorithm you are using. Linear regression, for example, is unlikely to over-fit, while neural networks are extremely sensitive to noise, and there are a variety of approaches in between those two extremes.

Prachi
  • 31
  • 2