3

I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .

For eg :

column 1                        column2
========================================
Google                             -
Google                             -
Google                             -
Google                             -
Facebook                        Image
Facebook                        Video
Facebook                        Image

My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.

In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Bharathi
  • 277
  • 8
  • 16

2 Answers2

3

You could break the column 2 from your example into number of columns : Image,Video....

So the new features will be like:

Column1  Image  Video  
Google     0      0
Google     0      0
Facebook   1      0
Facebook   0      1
Shiv
  • 719
  • 6
  • 20
2

You can try this:

import pandas as pd

df_new = pd.get_dummies(df, columns=['column2']) print(df_new)

Output:

    column1  column2_Image  column2_Video
0    Google              0              0
1    Google              0              0
2    Google              0              0
3    Google              0              0
4  Facebook              1              0
5  Facebook              0              1
6  Facebook              1              0
Soumendra Mishra
  • 262
  • 2
  • 12