SimpleImputer with groupby

Question

Let's suppose the following dataset

	code	category	energy	sugars	proteins
0	01	B	936	NaN	7.8
1	02	NaN	NaN	15.0	NaN
2	03	A	1569.0	23	4.1
3	04	NaN	826	NaN	3
4	05	B	1345	22	5.1
5	06	A	NaN	17	NaN
6	10	C	826	NaN	3
7	11	C	1345	26	5.1
8	101	B	NaN	18	6.1
9	102	B	636	NaN	7.8
10	103	NaN	NaN	15.0	NaN
11	104	A	1569.0	23	4.1
12	105	C	813	NaN	3.5

I would like to make the imputation with SimpleImputer considering the column category.

Namely, I would like to assign the mean considering the product's category.
If the product doesn't have a category, so, I would like to consider the mean of products without category.

So, to complete sugar for code 01. I am only going to consider all sugars of products with category B

	code	category	energy	sugars	proteins
0	01	B	936	NaN	7.8
4	05	B	1345	22	5.1
8	101	B	NaN	18	6.1
9	102	B	636	NaN	7.8

I did something similar, as I show below. But I need to do it with SimpleImputer.
To clarify, in the case below, I completed the NaN without category with the mean of the column.

for col in df.columns:
    if df[col].dtypes == "float64":
        df.loc[df[col].isna() & df["category"].notnull(), col] = df["categories"].map(df.groupby("category")[col].mean())
        df[col].fillna(df[col].mean(), inplace=True)

See also https://stackoverflow.com/q/42724040/10495893, https://stackoverflow.com/q/64048937/10495893 — Ben Reiniger, May 13 '21 at 14:31

Ric S · Answer 1 · 2021-05-13T08:34:55.840

I'm afraid you cannot use only SimpleImputer for this kind of problem (at least as far as I know).

However, you can create a custom class of Imputer using scikit-learn's very flexible classes BaseEstimator and TransformerMixin.

A very basic class would be something like the following:

from sklearn.base import BaseEstimator, TransformerMixin

class WithinGroupMeanImputer(BaseEstimator, TransformerMixin):
    def __init__(self, group_var):
        self.group_var = group_var
    
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        # the copy leaves the original dataframe intact
        X_ = X.copy()
        for col in X_.columns:
            if X_[col].dtypes == 'float64':
                X_.loc[(X[col].isna()) & X_[self.group_var].notna(), col] = X_[self.group_var].map(X_.groupby(self.group_var)[col].mean())
                X_[col] = X_[col].fillna(X_[col].mean())
        return X_

On your sample dataset:

imp = WithinGroupMeanImputer(group_var='category')

imp.fit(df)

imp.transform(df)

   code category       energy     sugars  proteins
0    01        B   936.000000  20.000000  7.800000
1    02     None  1127.848485  15.000000  4.881818
2    03        A  1569.000000  23.000000  4.100000
3    04     None   826.000000  20.916667  3.000000
4    05        B  1345.000000  22.000000  5.100000
5    06        A  1569.000000  17.000000  4.100000
6    10        C   826.000000  26.000000  3.000000
7    11        C  1345.000000  26.000000  5.100000
8   101        B   972.333333  18.000000  6.100000
9   102        B   636.000000  20.000000  7.800000
10  103     None  1127.848485  15.000000  4.881818
11  104        A  1569.000000  23.000000  4.100000
12  105        C   813.000000  26.000000  3.500000

Original data:

import pandas as pd

df = pd.DataFrame({
    'code': ['01', '02', '03', '04', '05', '06', '10', '11', '101', '102', '103', '104', '105'],
    'category': ['B', None, 'A', None, 'B', 'A', 'C', 'C', 'B', 'B', None, 'A', 'C'],
    'energy': [936, None, 1569, 826, 1345, None, 826, 1345, None, 636, None, 1569, 813],
    'sugars': [None, 15, 23, None, 22, 17, None, 26, 18, None, 15, 23, None],
    'proteins': [7.8, None, 4.1, 3, 5.1, None, 3, 5.1, 6.1, 7.8, None, 4.1, 3.5]
})

You should probably learn the relevant statistics at `fit` time, so that test data doesn't impute using its own distribution. — Ben Reiniger, May 13 '21 at 14:30
Yeah you're right.. How should you implement it? Do you have a link to a guide where I can learn it? — Ric S, May 13 '21 at 14:35
In the questions I linked, I've also linked these two implementations: https://datascience.stackexchange.com/q/71856/55122 and https://towardsdatascience.com/coding-a-custom-imputer-in-scikit-learn-31bd68e541de — Ben Reiniger, May 13 '21 at 14:44

SimpleImputer with groupby

1 Answers1

Linked