Imputation of missing values based on target variable

Question

I want to impute missing values in German Credit Risk dataset.

df['Saving accounts'].value_counts(dropna=False)

output:

little        603
NaN           183
moderate      103
quite rich     63
rich           48

There is almost 20% of data missing, but this predictor seems to be one of the most powerful to predict the credit risk.

Let's see, how 'Risk' depends on 'Saving accounts'

field = 'Saving accounts'
unique = [np.NaN, 'little', 'moderate', 'quite rich', 'rich']
for acc in unique:
    if pd.isna(acc):
        slc = df[df[field].isna()]
    else:
        slc = df[df[field] == acc]field = 'Saving accounts'
unique = [np.NaN, 'little', 'moderate', 'quite rich', 'rich']
for acc in unique:
    if pd.isna(acc):
        slc = df[df[field].isna()]
    else:
        slc = df[df[field] == acc]
    vc = slc['Risk'].value_counts()
    print('{:<10} {}'.format(acc, np.round(vc['bad'] / (vc['good'] + vc['bad']), 4)))

output:

nan        0.1749
little     0.3599
moderate   0.3301
quite rich 0.1746
rich       0.125

Clearly, more money on account, less credit risk.

But how do I have to deal with NaN values? I can try to fill them by mode, and mode is 'little'. Also, I could suppose, that NaN value in that field means the absence of an account (and I can fill NaNs by 'little' or even 'absence')

But the ratio of failed credits if 'Saving accounts' is NaN which is 0.1749, and that is different from 0.3599 (if 'Saving accounts' is 'little') According to this ratio, I have to fill NaNs as 'quite rich'.

So the question is - could I fill in missing values based on the target variable?

score 6 · Answer 1 · answered Feb 14 '23 at 16:20

You can absolutely do this. Whether it's optimal depends on the missingness mechanism.

If the missing values in this column are independent of other columns, then this is probably the best possible handling. If not, then it's hard to say: maybe a missing value here doesn't actually contribute new signal for higher default rate, because these rows also have other feature values that already account for that increase. I'm not sure there's a great way to tease this out beside cross-validating different imputation strategies.

You do need to be careful when looking at the target in preprocessing, but it's not necessarily leakage. You need to be doing this just on the training set/folds: if you perform this analysis on the entire dataset (including your test sets/folds), then you have leakage and cannot trust your scores as unbiased. But if you do this analysis only on training, then it's fine. Your model gets perhaps access to more information, but the model eventually stares at the training targets anyway. You might want to more strongly regularize your final model to reduce overfitting, but the test scores will be unbiased in any case.

Notice that there's a reasonably common strategy that reveals even more information directly to the model: "target encoding" replaces the categories with their average target value.

score 3 · Answer 2 · answered Mar 27 '24 at 07:11

On the one hand, in your case the meaning of the missing values isn't fully understood, and in particular you can't assume that they're MCAR (missing completely at random), and you even have good reason to suppose otherwise.

On the other hand, it's a categorical variable and it's fair to treat "missing" as its own category.

So, I would treat the missing values as a new category - "missing account".

In general, using the target values in imputation has its pros and cons and they've been discussed in other answers here, but in your case as described in the question, you don't need to do it.

score 0 · Answer 3 · answered Feb 13 '23 at 18:16

could I fill missing values based on target variable?

No, it wouldn't be proper methodology in my opinion.

For starters performance would be over-evaluated, since all these cases would be easy to solve for the model due to the artificial correlation between the feature and the target.

More generally, what would be the point of this? Normally imputation can be used in the context of a moderate amount of missing data, assuming the rest of the data in the imputed instances is still useful for the model to predict the target variable. But here the model would probably not even use the other variables, since it has a very good predictor designed to correlate with the target. As far as I can tell these instances would be pointless, so they can simply be discarded. Sometimes removing noisy or bad quality data is the best option.

Brian Spiering · Answer 4 · 2023-02-15T12:35:00.660

0

No - missing feature values should not be imputed based on target values.

That would be an example of data leakage. The model is being provided with information during training that will not be available during prediction-only. The trained model will fail to perform on new data because the new data will not have target values.

edited Feb 15 '23 at 12:35

answered Feb 14 '23 at 13:55

Brian Spiering

23,131
2
29
113

Imputation of missing values based on target variable

4 Answers4