0

I am new using Python/ML. Right now, I am trying to create a model to forecast the expected call volume for a company. However, the RMSE that I am getting is higher than expected

Here is my code. I don't know if its something regarding the hyperparameters that are not optimizing properly the model.

I also tested XGBoost, ARIMA and Neural Prophet, but I'm getting worst results.

Sensitive information was deleted from the dataset that I'm using

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
from prophet import Prophet

data = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTwg4wrDxAIyyig9y_jtyfe8fvY-1NjShR6-sHacQebyip9ahowWynJ5kC2gBX_mfP6V4hTpo8axyl_/pub?gid=0&single=true&output=csv")

Changing the Sale date to the proper format

data['Sale Date'] = pd.to_datetime(data['Sale Date'])

Dropping unnecessary columns

data = data.drop(['First Name', 'Order No', 'Zip', 'Last Name', 'City', 'State', 'Address 1', 'Address 2', 'Debit Or Credit', 'Agent Name', 'Verifier', 'BS Approved Or Declined', 'TM Approved Or Declined', 'LID Approved Or Declined', 'Telemed Approve Or Decline', 'Pet Approved Or Declined', 'Pet Cat Or Dog', 'Pets Name', 'Benefits Savings', 'Top Magazine', 'Pet', 'Telemed', 'Locked ID'], axis=1)

date = data['Sale Date'].value_counts() df = date.reset_index() df.columns = ['ds', 'y'] df = df.sort_values(by=['ds'])

Split train and test and set the prediction size

pred_size = 30 train_df = df.iloc[:len(df) - pred_size] test_df = df.iloc[len(df) - pred_size:]

Plotting train_df and test_df

fig, ax = plt.subplots(figsize=(9, 4)) ax.plot(train_df['ds'], train_df['y'], label='Train Data') ax.plot(test_df['ds'], test_df['y'], label='Test Data')

Adding labels and title

ax.set_xlabel('X-axis') ax.set_ylabel('Y-axis') ax.set_title('Train and Test Data')

Adding legend

ax.legend()

Displaying the plot

plt.show()

holidays = pd.DataFrame({ 'holiday': 'holiday', 'ds': pd.to_datetime(['2022-05-30', '2022-06-20', '2022-07-04', '2022-09-05', '2022-10-10', '2022-11-11', '2022-11-24', '2022-12-26', '2023-01-02', '2023-01-16', '2023-02-20']), 'lower_window': 0, 'upper_window': 1, })

model = Prophet(changepoint_prior_scale=0.08, seasonality_prior_scale=1, seasonality_mode='additive', holidays=holidays) model.add_seasonality(name='daily', period=1, fourier_order=10, prior_scale=10) model.add_country_holidays(country_name='US') model.fit(train_df) future = model.make_future_dataframe(periods=pred_size) forecast = model.predict(future) forecast['y'] = test_df['y']

rmse = np.sqrt(mean_squared_error(test_df['y'], forecast['yhat'].tail(pred_size))) print('Root Mean Squared Error (RMSE):', rmse)

Thanks for your time in advance and I'm open to receive feedback

1 Answers1

1
  • I will suggest you to check your data for outliers first. Because RMSE is very sensitive to outliers.
  • Second, do normalization or standardization.
Harshad Patil
  • 1,068
  • 1
  • 4
  • 13