6

I am trying to learn data analysis and machine learning by trying out some problems.

I found a competition "House prices" which is actually a playground competition. Since I am very new to this field, I got confused after exploring the data. The data has 81 columns out of which 1 is the target column which is the house value. This data contains multiple columns where majority of values are "NaN". When I ran:

nulls = data.isnull().sum()
nulls[nulls > 0]

This shows the columns with missing values:

LotFrontage     259 
Alley           1369
MasVnrType      8   
MasVnrArea      8   
BsmtQual        37  
BsmtCond        37  
BsmtExposure    38  
BsmtFinType1    37  
BsmtFinType2    38  
Electrical      1   
FireplaceQu     690 
GarageType      81  
GarageYrBlt     81  
GarageFinish    81  
GarageQual      81  
GarageCond      81  
PoolQC          1453
Fence           1179
MiscFeature     1406

At this point I am totally lost and I don't know how to get rid of these "NaN" values.
Any help would be appreciated.

Ola Ström
  • 111
  • 1
  • 1
  • 6
Ahmed Dhanani
  • 163
  • 1
  • 1
  • 5

3 Answers3

7

You can use the DataFrame.fillna function to fill the NaN values in your data. For example, assuming your data is in a DataFrame called df,

df.fillna(0, inplace=True)

will replace the missing values with the constant value 0. You can also do more clever things, such as replacing the missing values with the mean of that column:

df.fillna(df.mean(), inplace=True)

or take the last value seen for a column:

df.fillna(method='ffill', inplace=True)

Filling the NaN values is called imputation. Try a range of different imputation methods and see which ones work best for your data.

timleathart
  • 3,960
  • 22
  • 35
1
  # Taking care of missing data
  from sklearn.preprocessing import Imputer
  imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
  imputer = imputer.fit(X[:, 1:3])
  X[:, 1:3] = imputer.transform(X[:, 1:3])

suppose the name of my array is $X$ and I want to take care of missing data in columns indexed $1$ and $2$ by replacing it with mean. Imputer is a great class to do this from sklearn library

Siong Thye Goh
  • 3,138
  • 2
  • 17
  • 23
smit patel
  • 11
  • 2
0

While Tim Earhart has already provided the answer, I would like to add here there are cases when rather than using choosing df.mean() to substitute your NA values, it is better to choose df.median() - which calculates your median value.

Mean is notorious for taking into consideration even the outliers.

Since you are a beginner, you might want to try both.

Ethan
  • 1,657
  • 9
  • 25
  • 39