I am new using Python/ML. Right now, I am trying to create a model to forecast the expected call volume for a company. However, the RMSE that I am getting is higher than expected
Here is my code. I don't know if its something regarding the hyperparameters that are not optimizing properly the model.
I also tested XGBoost, ARIMA and Neural Prophet, but I'm getting worst results.
Sensitive information was deleted from the dataset that I'm using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
from prophet import Prophet
data = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTwg4wrDxAIyyig9y_jtyfe8fvY-1NjShR6-sHacQebyip9ahowWynJ5kC2gBX_mfP6V4hTpo8axyl_/pub?gid=0&single=true&output=csv")
Changing the Sale date to the proper format
data['Sale Date'] = pd.to_datetime(data['Sale Date'])
Dropping unnecessary columns
data = data.drop(['First Name', 'Order No', 'Zip', 'Last Name', 'City', 'State', 'Address 1', 'Address 2', 'Debit Or Credit', 'Agent Name', 'Verifier', 'BS Approved Or Declined', 'TM Approved Or Declined', 'LID Approved Or Declined', 'Telemed Approve Or Decline', 'Pet Approved Or Declined', 'Pet Cat Or Dog', 'Pets Name', 'Benefits Savings', 'Top Magazine', 'Pet', 'Telemed', 'Locked ID'], axis=1)
date = data['Sale Date'].value_counts()
df = date.reset_index()
df.columns = ['ds', 'y']
df = df.sort_values(by=['ds'])
Split train and test and set the prediction size
pred_size = 30
train_df = df.iloc[:len(df) - pred_size]
test_df = df.iloc[len(df) - pred_size:]
Plotting train_df and test_df
fig, ax = plt.subplots(figsize=(9, 4))
ax.plot(train_df['ds'], train_df['y'], label='Train Data')
ax.plot(test_df['ds'], test_df['y'], label='Test Data')
Adding labels and title
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')
ax.set_title('Train and Test Data')
Adding legend
ax.legend()
Displaying the plot
plt.show()
holidays = pd.DataFrame({
'holiday': 'holiday',
'ds': pd.to_datetime(['2022-05-30', '2022-06-20', '2022-07-04', '2022-09-05', '2022-10-10', '2022-11-11', '2022-11-24', '2022-12-26', '2023-01-02', '2023-01-16', '2023-02-20']),
'lower_window': 0,
'upper_window': 1,
})
model = Prophet(changepoint_prior_scale=0.08, seasonality_prior_scale=1, seasonality_mode='additive', holidays=holidays)
model.add_seasonality(name='daily', period=1, fourier_order=10, prior_scale=10)
model.add_country_holidays(country_name='US')
model.fit(train_df)
future = model.make_future_dataframe(periods=pred_size)
forecast = model.predict(future)
forecast['y'] = test_df['y']
rmse = np.sqrt(mean_squared_error(test_df['y'], forecast['yhat'].tail(pred_size)))
print('Root Mean Squared Error (RMSE):', rmse)
Thanks for your time in advance and I'm open to receive feedback