Treating missing data in categorical features

Question

I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .

For eg :

column 1                        column2
========================================
Google                             -
Google                             -
Google                             -
Google                             -
Facebook                        Image
Facebook                        Video
Facebook                        Image

My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.

In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?

score 3 · Answer 1 · answered Aug 23 '20 at 20:39

3

You could break the column 2 from your example into number of columns : Image,Video....

So the new features will be like:

Column1  Image  Video  
Google     0      0
Google     0      0
Facebook   1      0
Facebook   0      1

answered Aug 23 '20 at 20:39

Shiv

719
6
20

score 2 · Answer 2 · answered Aug 24 '20 at 18:35

You can try this:

import pandas as pd
df_new = pd.get_dummies(df, columns=['column2'])
print(df_new)

Output:

    column1  column2_Image  column2_Video
0    Google              0              0
1    Google              0              0
2    Google              0              0
3    Google              0              0
4  Facebook              1              0
5  Facebook              0              1
6  Facebook              1              0

Treating missing data in categorical features

2 Answers2