Let's suppose the following dataset
| code | category | energy | sugars | proteins | |
|---|---|---|---|---|---|
| 0 | 01 | B | 936 | NaN | 7.8 |
| 1 | 02 | NaN | NaN | 15.0 | NaN |
| 2 | 03 | A | 1569.0 | 23 | 4.1 |
| 3 | 04 | NaN | 826 | NaN | 3 |
| 4 | 05 | B | 1345 | 22 | 5.1 |
| 5 | 06 | A | NaN | 17 | NaN |
| 6 | 10 | C | 826 | NaN | 3 |
| 7 | 11 | C | 1345 | 26 | 5.1 |
| 8 | 101 | B | NaN | 18 | 6.1 |
| 9 | 102 | B | 636 | NaN | 7.8 |
| 10 | 103 | NaN | NaN | 15.0 | NaN |
| 11 | 104 | A | 1569.0 | 23 | 4.1 |
| 12 | 105 | C | 813 | NaN | 3.5 |
I would like to make the imputation with SimpleImputer considering the column category.
Namely, I would like to assign the mean considering the product's category.
If the product doesn't have a category, so, I would like to consider the mean of products without category.
So, to complete sugar for code 01.
I am only going to consider all sugars of products with category B
| code | category | energy | sugars | proteins | |
|---|---|---|---|---|---|
| 0 | 01 | B | 936 | NaN | 7.8 |
| 4 | 05 | B | 1345 | 22 | 5.1 |
| 8 | 101 | B | NaN | 18 | 6.1 |
| 9 | 102 | B | 636 | NaN | 7.8 |
I did something similar, as I show below. But I need to do it with SimpleImputer.
To clarify, in the case below, I completed the NaN without category with the mean of the column.
for col in df.columns:
if df[col].dtypes == "float64":
df.loc[df[col].isna() & df["category"].notnull(), col] = df["categories"].map(df.groupby("category")[col].mean())
df[col].fillna(df[col].mean(), inplace=True)