RMSE too high when trying to create a machine learning model in Python

Question

I am new using Python/ML. Right now, I am trying to create a model to forecast the expected call volume for a company. However, the RMSE that I am getting is higher than expected

Here is my code. I don't know if its something regarding the hyperparameters that are not optimizing properly the model.

I also tested XGBoost, ARIMA and Neural Prophet, but I'm getting worst results.

Sensitive information was deleted from the dataset that I'm using

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
from prophet import Prophet
data = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTwg4wrDxAIyyig9y_jtyfe8fvY-1NjShR6-sHacQebyip9ahowWynJ5kC2gBX_mfP6V4hTpo8axyl_/pub?gid=0&single=true&output=csv")
Changing the Sale date to the proper format
data['Sale Date'] = pd.to_datetime(data['Sale Date'])
Dropping unnecessary columns
data = data.drop(['First Name', 'Order No', 'Zip', 'Last Name', 'City', 'State', 'Address 1', 'Address 2', 'Debit Or Credit', 'Agent Name', 'Verifier', 'BS Approved Or Declined', 'TM Approved Or Declined', 'LID Approved Or Declined', 'Telemed Approve Or Decline', 'Pet Approved Or Declined', 'Pet Cat Or Dog', 'Pets Name', 'Benefits Savings', 'Top Magazine', 'Pet', 'Telemed', 'Locked ID'], axis=1)
date = data['Sale Date'].value_counts()
df = date.reset_index()
df.columns = ['ds', 'y']
df = df.sort_values(by=['ds'])
Split train and test and set the prediction size
pred_size = 30
train_df = df.iloc[:len(df) - pred_size]
test_df = df.iloc[len(df) - pred_size:]
Plotting train_df and test_df
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(train_df['ds'], train_df['y'], label='Train Data')
ax.plot(test_df['ds'], test_df['y'], label='Test Data')
Adding labels and title
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_title('Train and Test Data')
Adding legend
ax.legend()
Displaying the plot
plt.show()
holidays = pd.DataFrame({
    'holiday': 'holiday',
    'ds': pd.to_datetime(['2022-05-30', '2022-06-20', '2022-07-04', '2022-09-05', '2022-10-10', '2022-11-11', '2022-11-24', '2022-12-26', '2023-01-02', '2023-01-16', '2023-02-20']),
    'lower_window': 0,
    'upper_window': 1,
})
model = Prophet(changepoint_prior_scale=0.08, seasonality_prior_scale=1, seasonality_mode='additive', holidays=holidays)
model.add_seasonality(name='daily', period=1, fourier_order=10, prior_scale=10)
model.add_country_holidays(country_name='US')
model.fit(train_df)
future = model.make_future_dataframe(periods=pred_size)
forecast = model.predict(future)
forecast['y'] = test_df['y']
rmse = np.sqrt(mean_squared_error(test_df['y'], forecast['yhat'].tail(pred_size)))
print('Root Mean Squared Error (RMSE):', rmse)

Thanks for your time in advance and I'm open to receive feedback

score 1 · Answer 1 · answered Jun 19 '23 at 17:46

1

I will suggest you to check your data for outliers first. Because RMSE is very sensitive to outliers.
Second, do normalization or standardization.

answered Jun 19 '23 at 17:46

Harshad Patil

1,068
1
4
13

RMSE too high when trying to create a machine learning model in Python

Changing the Sale date to the proper format

Dropping unnecessary columns

Split train and test and set the prediction size

Plotting train_df and test_df

Adding labels and title

Adding legend

Displaying the plot

1 Answers1