1

Below is the linear regression model I fitted and not sure if I am doing the right way as I am getting neat to 99% accuracy

Fitting Simple Linear Regression to the Training set

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

ln_regressor = LinearRegression()
mse = cross_val_score(ln_regressor, X_train, Y_train , scoring = 'neg_mean_squared_error', cv = 5)
mean_mse = np.mean(mse)
print(mean_mse)

ln_regressor.fit(X_train, Y_train)

** MSE SCORE =-6.612466691367042e-06** 

Predicting the Test set results

y_pred = ln_regressor.predict(X_test)

Evaluating accuracy of test data

mse2 = cross_val_score(ln_regressor, X_test, y_pred , scoring = 'neg_mean_squared_error', cv = 5)
mean_mse2 = np.mean(mse2)
print(mean_mse2)

**MSE score = -4.645751512870382e-31**

Please Note: My data is in log scale & transformed to standard scaling later on

R2= cross_val_score(ln_regressor,X_test, y_pred,cv = 10)

R2.mean()

R2 mean is '0.9999030728571852'

yathislax
  • 99
  • 1
  • 1
  • 6

2 Answers2

2

Your first code block seems fine, and you do get a low cross-validated mse, but we'd need more details to diagnose whether that's real. I want to point out though that your second code block uses cross_val_score incorrectly. With this code:

cross_val_score(ln_regressor, X_test, y_pred, ...)

you ignore that ln_regressor is fitted, refitting it from scratch on some folds from X_test in a cross-validation and scoring on the remaining fold.

But worse, the targets of these models are not the true labels y_test, but instead your first model's predictions on the test set, y_pred. And of course, you can nearly perfectly recreate those, by just recreating the original model!

If you want the test score, just compute mean_squared_error(y_test, y_pred).

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
2

So first thing first, accuracy is a classification concept. You can't say you have 99% accuracy for a regression problem.

Your code seems ok. Cross validation is not necessary here since you are not doing any hyper-parameter tuning or model selection. The mse error is indeed low, so I would suggest you go back to normalize your data, since if your target $y$ has a very small span, i.e. low $\sigma$ in Gaussian case, you will get a meaningless low mse guaranteed.

plpopk
  • 308
  • 1
  • 10