8

I am looking for an algorithms to maintain several statistics over a sliding window. The setup is as follows: There is a datastream consisting of (real value,timestamp) tuples. The values for the last x seconds are stored. In each iteration at least one new tuple arrives and between zero and more values are retired. Now I want to maintain mean and variance without recomputing everything from scratch. There are methods if values are never retired, e.g. Welford's method. But what if values are retired?

The second problem I am facing is, how to compute covariance between two sliding windows in a similar fashion?

Are you aware of a solution or references?

Johannes
  • 81
  • 1
  • 2

2 Answers2

2

To compute the mean of all the values in your window, separately compute (a) the sum of all the values in the window, and (b) a count of the number of values in the window.

To keep track of the sum of all the values in the sliding window, each time you receive a new value, add it to the sum. Each time a value leaves the window, subtract it from the sum. Now you'll always have a sum of the values in the window.

To keep track of the number of values in the window, each time you receive a new value, increment the counter. Each time a value leaves the window, decrement the counter.

Finally, the mean is the sum divided by the count.

To keep track of the variance, you need the count, the sum of the values, and the sum of the squares of the values. You can keep track of the sum of the squares of the values in the window in the same way (when a new value arrives, add its square to the running sum; when a value leaves, subtract its square). These three values are enough to let you compute the variance of the values in the sliding window.

Note that when using the sum of squares method to compute the variance, you will need to compute the three intermediate values to enough digits of precision. If you don't have enough digits of precision for those intermediate values, then you get a very inaccurate estimate. This is the robustness problem that is mentioned in the blog post you link to and that is addressed by Welford's method. However, there is another pragmatic way to address it that will work for most practical settings: just compute those three intermediate values to more digits of precision.

For the covariance between $x_1,\dots,x_n$ and $y_1,\dots,y_n$, you need to know the sum of the values $x_i$, $x_i^2$, $y_i^2$, $x_i y_i$, and the count $n$. So, it's basically the same principle as above, but now there are five intermediate values you need to maintain, instead of just three.

D.W.
  • 167,959
  • 22
  • 232
  • 500
2

Usually, an exact solution is impossible (unless you keep the entire window in memory), however it is possible to approximate the output as a function of the amount of memory you are allowed to have. A good starting point would be

They show how to maintain the sum of a positive-value data stream (then, computing the average is just dividing in the window's size). The mainly assume a fixed window size (i.e., at each time step a new item arrives and one old item expires), however they also discuss the model you are interested in, where each element has a time-stamp, and several items may expire at the same time. Check if Exponential Histogram (or their extension, smooth histograms) may give you what you need.

Another good source of information for data streams in general would be the survey

Ran G.
  • 20,884
  • 3
  • 61
  • 117