9

Continous feature discretization usually leads to lose of information due to the binning process. However most of the Top solutions for Kaggle Titanic are based on discretization(age,fare).

When should continuous features be discretized ? Is there any criteria and pros and cons on accuracy.

timleathart
  • 3,960
  • 22
  • 35
drichlet
  • 91
  • 1
  • 4

2 Answers2

7

One reason to discretize continuous features is to improve the signal-to-noise ratio. Fitting a model using features that have been binned reduces the impact that small fluctuations in the data have on the model, often small fluctuates are just noise. Each bin "smooths" out the fluctuates/noises in sections of the data.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
3

I can think of three reasons why discretization might help in some problems.

It makes sense for your problem

Continuous variables such as age are better understood when discretized into meaningful groups: infants, youngsters, young adults, adults, senior, ... this is common in the field of marketing, because a small number of years do not really make much different in one person's interests.

To give another example, when working on a dataset with GPS locations, it might be more useful to discretize those into contry/state locations.

Interpretability

A continuous feature might not linearly correlate with your target but have a more complex non-linear correlation. In that case, obtaining an interpretable explanation of such feature won't be easy. However it you discretize it into a set of groups or levels, you might find that some of them correlate (or anticorrelate) with your target, giving you some interpretability.

Model limitations

Some machine learning models and feature selection methods can't handle continuous features, such as entropy-based methods, or some variants of decision trees or neural networks. Either you discretize your features or forget about using such models.

albarji
  • 241
  • 2
  • 3