How to deal with missing data for only some categories

Question

Or in other words, data for category A is irrelevant for category B. So it is not present, how can imputing missing data distort/effect learning models broadly. I can't find any logic how to deal with this relative data. So I am sorry that I don't show any effort.

In the following example, geographical zone is only present for Gaz entries.

Data sample:

Mnng · Answer 1 · 2018-09-20T21:30:43.067

There are three types of missing data: Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR).

Your case is the second, where according to wikipedia it:

occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information

This means that the presence or not of entries in zone can be derived from the column Produit.

Because the values aren't missing completely at random, normal imputation techniques (e.g. fill with the most common value) shouldn't be applied. Instead I'd recommend treating the missing values as their own category. Just create a category (let's say not available) and fill the missing with this value. From a data science view this makes more sense.

How to deal with missing data for only some categories

1 Answers1