I want to impute missing values in German Credit Risk dataset.
df['Saving accounts'].value_counts(dropna=False)
output:
little 603
NaN 183
moderate 103
quite rich 63
rich 48
There is almost 20% of data missing, but this predictor seems to be one of the most powerful to predict the credit risk.
Let's see, how 'Risk' depends on 'Saving accounts'
field = 'Saving accounts'
unique = [np.NaN, 'little', 'moderate', 'quite rich', 'rich']
for acc in unique:
if pd.isna(acc):
slc = df[df[field].isna()]
else:
slc = df[df[field] == acc]field = 'Saving accounts'
unique = [np.NaN, 'little', 'moderate', 'quite rich', 'rich']
for acc in unique:
if pd.isna(acc):
slc = df[df[field].isna()]
else:
slc = df[df[field] == acc]
vc = slc['Risk'].value_counts()
print('{:<10} {}'.format(acc, np.round(vc['bad'] / (vc['good'] + vc['bad']), 4)))
output:
nan 0.1749
little 0.3599
moderate 0.3301
quite rich 0.1746
rich 0.125
Clearly, more money on account, less credit risk.
But how do I have to deal with NaN values? I can try to fill them by mode, and mode is 'little'. Also, I could suppose, that NaN value in that field means the absence of an account (and I can fill NaNs by 'little' or even 'absence')
But the ratio of failed credits if 'Saving accounts' is NaN which is 0.1749, and that is different from 0.3599 (if 'Saving accounts' is 'little') According to this ratio, I have to fill NaNs as 'quite rich'.
So the question is - could I fill in missing values based on the target variable?