4

When dealing with missing data in categorical variables, common approaches include imputation by mode or predictive models. However, in some cases, certain categories have extremely low frequency or exhibit characteristics that could be considered outliers within the dataset. I wonder if these outliers could distort the distribution of the imputed variable and introduce bias into the final model. How should this situation be handled to ensure accurate imputation without compromising the quality of the dataset?

desertnaut
  • 2,154
  • 2
  • 16
  • 25
Celine Yvone
  • 371
  • 3

1 Answers1

6

The question here suggests that there is an inherent problem with imputation if there are outlying points. However, there are few key elements one must consider:

  1. Is the outlier because of distributional aspects? For example, outliers in inverse Gaussian distributions are often just a consequence of its high skew, which is a natural part of the data generating process.

  2. Is the outlier because of poor fitting? If we look at this from a model perspective (which should be the default), outliers can suggest many things, but sometimes they are a consequence of bad fit (e.g. a quadratic-like association being fit with no polynomial terms, which induces outliers that are purely a consequence of the regression fit).

  3. Outliers are important, and we should keep them in if they are not erroneous, and understand why they exist. Indeed, robust methods were developed for the reason that one should average over the effects while incorporating realistic points that are outside the norm.

Given that imputation is, like many things in statistics, a regression-based method, then we need only understand that the outlier informs our predictions which produce imputations, and are thus desired.