Should I remove or interpolate missing values?

Question

I have a dataset containing a very long time series of hourly traffic congestion in a certain city, during a period of ~22 years (number of data points: Roughly 24 X 365 X 22 = 192720). I want to use this time series to forecast future hourly traffic congestion values. I have 2 types of missing values in the series:

A "single" missing value - ~30 values that are missing, with no certain pattern, i.e. the missing values are sporadically spread across the time series.
A missing day - 20 Days that are missing altogether, not a single data point for those days. 10 Of those 20 days are sporadically spread across the time series, while the other 10 are adjacent (10 days in a row).

the overall missing values rate is around 0.25%, so I'm not worried about removing them altogether for descriptive statistics etc., just wondering if it's correct to remove them for the forecasting part. Also, not sure if I should treat the 2 types of missing values differently.

Thanks!

score 2 · Accepted Answer · answered Jan 10 '24 at 09:57

Easy answer: try both and see what works best.

Obviously, whatever works best is what you should choose, but let's look into why one method might work better than the other and in what scenario it's more likely that it does.

What does "null" mean in your case?

This is a very common data question. Does "null" mean that (a) there is just no value AND we know that there isn't a value? Or does "null" mean, (b) there is a value AND we don't know it? Whether your data is continuous or discrete typically can help answer that question.

In your case, I'd say (a). Congestion could be measured at any time and would have a value. So your data are measurements/samples out of a continuous distribution.

So, in this case, interpolating might be a good idea. You are in a "luxury" situation where your missing data is relatively small. So, whichever option you should likely won't greatly impact the end solution. My personal take:

Do interpolate the data, as having an interpolation mechanism could help your end-model deal with missing data in production from a functional perspective. You don't want your code to break if it encounters a null value.

Also, not sure if I should treat the 2 types of missing values differently.

Interesting question. You might want to interpolate the values differently. Instead of taking the average of the window near the missing values, when a day is missing, you might want to use the average similar days (i.e. average Mondays) as you might have some seasonality. Whereas with random missing values, looking at the direct "neighbours" might be sufficient.

Should I remove or interpolate missing values?

1 Answers1