In short, I get prediction interval smaller than confidence interval while they should be wider. Any help to understand why is certainly welcome :)
Let me start by stating the problem at hand. I use a GLM to predict some variable (let's say some sale amount). If I sum the predicted mean for each line in the data, I get a prediction for the total amount of sales. After one year, I get the true value. The goal is to have a range around the prediction such that, if the true value is outside that bound, it can be a sign the model needs retraining.
From a statistical point of view, this seems like a good use case of prediction intervals. So I was trying to implement them in Python. Here is my code on a toy example, I'll specify my issues after.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
Generating data
np.random.seed(0)
n = 100
x1 = np.random.normal(size=n)
x2 = np.random.normal(size=n)
y = np.random.poisson(lam=np.exp(0.5 + 0.3 * x1 - 0.2 * x2), size=n)
DataFrame
data = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
GLM
model = smf.glm(formula='y ~ x1 + x2', data=data, family=sm.families.Poisson()).fit()
#Getting predictions
predictions=model.get_prediction(data)
pred_mean=predictions.summary_frame()['mean']
pred_lower_ci=predictions.summary_frame()['mean_ci_lower']
pred_upper_ci=predictions.summary_frame()['mean_ci_upper']
#Defining the confidence intervals
lower_bound_CI=np.sum(pred_lower_ci)
upper_bound_CI=np.sum(pred_upper_ci)
#Defining the prediction intervals
alpha=0.05
total_mean=np.sum(pred_mean)
lower_bound_PI=stats.poisson.ppf(alpha/2,total_mean)
upper_bound_PI=stats.poisson.ppf(1-alpha/2,total_mean)
I get as a result a predicted value of 169. The confidence interval is [131.65,217.96] and the prediction interval is [144,195].
What are my issues?
As mentioned in the beginning, my CI is wider than my PI while statistically it should be the opposite. I was thinking it is maybe due to the fact that I sum the 'mean_ci_lower' and 'mean_ci_upper', but the same issue happens with n=1 (so without any summation). I know that the summing assume independence of the data. Does anyone see something wrong about that?
When I look back at my way to compute the PI, I think I'd use the same formula if I wanted to compute myself the CI. So I wonder two things about this. Is my way of computing the PI correct? Maybe for a Poisson, using the true quantiles, both intervals are the same? And if not and the fact is that I compute a CI, why is it not the same result as what the package statsmodels get? Are they using a normal approximation? I tried to read the code in the package but got lost somehow. And how acceptable is it to use the normal approximation? Also if the glm family is the Gamma family (I expect to mostly encounter Gaussian, Poisson and Gamma)?
If you are still reading here, thank you already. And if you can/have the time to help, I'd definitely be most grateful.