1

I am building a classification model based on some machine performance data. Unfortunately for me, it seems to over-fit no matter what I change. The dataset is quite large so I'll share the final feature importance & cross validation scores after feature selection.

#preparing the data    
X = df.drop('target', axis='columns')
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=10, stratify=y)

enter image description here

I then cross validate as follows;

logreg=LogisticRegression()
kf=KFold(n_splits=25)
score=cross_val_score(logreg,X,y,cv=kf)

print("Cross Validation Scores: {}".format(score)) print("Average Cross Validation score : {}".format(score.mean()))

Here are the results that I get:

>  Cross Validation Scores [1.  1.  1.  1.  1.  1.   1.  1.  1.  1.  1.  1.  1.  1. 1.  1.  1.  1.  0.94814175 1.  1.  1.   1.  1.  1.] 
> Average Cross Validation score : 0.9979256698357821

When I run RandomForests, the accuracy is 100%. What could be the problem? PS. The classes were imbalanced so I "randomly" under-sampled the majority class.

UPDATE: I overcame this challenge by eliminating some features from the final dataset. I retrained my models using a few features at a time and was able to find out the ones that caused the "over-fitting". In short, better feature selection did the trick.

1 Answers1

2

This isn't overfitting. You're reporting cross-validation scores as very high (and are not reporting training set scores, which are presumably also very high); your model is just performing very well (on unseen data).

That said, you should be asking yourself if something is wrong. There are two common culprits that come to mind:

  1. One of your features is very informative, but wouldn't be available at prediction time ("future information", or in the extreme case, you accidentally left the target variable in the independent variable dataframe)
  2. Your train-test splits don't respect some grouping (in the extreme case, rows of the frame are repeated and show up in both training and test folds).

Otherwise, it's entirely possible your problem is just easily solved by your model.

See also
Why does my model produce too good to be true output?
Quote on too good to be true model performance?

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63