2

I was doing the modeling on the House Pricing dataset. My target is to get the mse result and predict with the input variable

I have done the modeling, I'm doing the modeling with scaling the data using MinMaxSclaer(), and the model is trained with LinearRegression(). After this I got the score, mse, mae, dan rmse result.

But when I want to predict it with the actual result. It got scaled, how to predict the after result with the actual price?

Dataset: https://www.kaggle.com/code/bsivavenu/house-price-calculation-methods-for-beginners/data

This is my script:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

train = pd.read_csv('train.csv')

column = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']

train = train[column]

Convert Feature/Column with Scaler

scaler = MinMaxScaler() train[column] = scaler.fit_transform(train[column])

X = train.drop('SalePrice', axis=1) y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

Calling LinearRegression

model = LinearRegression()

Fit linearregression into training data

model = model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Calculate MSE (Lower better)

mse = mean_squared_error(y_test, y_pred) print("MSE of testing set:", mse)

Calculate MAE

mae = mean_absolute_error(y_test, y_pred) print("MAE of testing set:", mae)

Calculate RMSE (Lower better)

rmse = np.sqrt(mse) print("RMSE of testing set:", rmse)

Predict the Price House by input:

overal_qual = 6 grlivarea = 1217 garage_cars = 1 totalbsmtsf = 626 fullbath = 1 year_built = 1980

predicted_price = model.predict([[overal_qual, grlivarea, garage_cars, totalbsmtsf, fullbath, year_built]]) print("Predicted price:", predicted_price)

The result:

MSE of testing set: 0.0022340806066149734
MAE of testing set: 0.0334447655149599
RMSE of testing set: 0.04726606189027147

Predicted price: [811.51843959]

Where the price is should be for example 208500, 181500, or 121600 with grands value in $.

What step I missed here?

desertnaut
  • 2,154
  • 2
  • 16
  • 25
MADFROST
  • 123
  • 1
  • 3

1 Answers1

3
  • First, you can't use anything from the test set before training. This means that the scaling should be done using only the test set, otherwise there's a risk of data leakage.
  • Then remember that scaling your features means that the model learns to predict with scaled features, therefore the test set should be passed after it has been scaled as well (using the same scaling as the training set, of course).
  • Finally you could obtain the real price value by "unscaling" with inverse_transform. But instead I decided not to scale the target variable in the code below because it's not needed (except if you really want to obtain evaluation scores scaled). It's also simpler ;)
full = pd.read_csv('train.csv')

column = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']

full = full[column]

X = train.drop('SalePrice', axis=1) y = train['SalePrice']

always split between training and test set first

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

Then fit the scaling on the training set

Convert Feature/Column with Scaler

scaler = MinMaxScaler()

Note: the columns have already been selected

X_train_scaled = scaler.fit_transform(X_train)

Calling LinearRegression

model = LinearRegression()

Fit linearregression into training data

model = model.fit(X_train_scaled, y_train)

Now we need to scale the test set features

X_test_scaled = scaler.transform(X_test) y_pred = model.predict(X_test_scaled)

y has not been scaled so nothing else to do

Calculate MSE (Lower better)

mse = mean_squared_error(y_test, y_pred) print("MSE of testing set:", mse)

Calculate MAE

mae = mean_absolute_error(y_test, y_pred) print("MAE of testing set:", mae)

Calculate RMSE (Lower better)

rmse = np.sqrt(mse) print("RMSE of testing set:", rmse)

... evaluation etc.

```

Erwan
  • 26,519
  • 3
  • 16
  • 39