0

Let's say I have n elements to begin with, and over time more elements are added in. What's the minimum amount of data I have to store in order to be able to continuously update the variance, and how would I do it (assuming I can get away with purely aggregate data such as number of elements, current mean, and current variance)? Aggregate data would be preferable to individual element data, although if the latter is unavoidable then I guess I will not have a choice!

2 Answers2

0

You wont need too much data to store. It scales very well without respecting to the amount of data you need. It is possible to calculate the variance iteratively and since there is a good answer already given by did no need to repeat here.

0

a useful computational formula for the sample variance is as follows: $$s^2 = \frac{1}{n-1} \left(\sum y_i^2 - \frac{1}{n} \left(\sum y_i \right)^2 \right)$$

and the population variance is... $$\sigma^2 = \frac{1}{n} \left( \sum y_i^2 - \frac{1}{n} \left(\sum y_i \right)^2 \right)$$

So, you only need to store: $n$, $\sum y_i$ and $\sum y_i^2$

Brad S.
  • 1,866
  • OP isn't talking about sample variance, so the first denominator should be $n$, not $n-1$ – Ross Millikan Mar 18 '14 at 23:13
  • @RossMillikan Either way, the OP will still only need to store $n$, $\sum y_i$ and $\sum y_i^2$. Where does the OP imply that it is not the sample variance? – Brad S. Mar 18 '14 at 23:19
  • OP seems to me to be looking for the variance of the whole data set at any time, not viewing the first items as a sample of the whole population and getting an estimate of the population variance. – Ross Millikan Mar 18 '14 at 23:21
  • Still does not change that answer to the question. The OP will only need to store $n$, $\sum y_i$ and $\sum y_i^2$ – Brad S. Mar 18 '14 at 23:24
  • That is true. I was just trying to get the formula for what OP wants. – Ross Millikan Mar 18 '14 at 23:26
  • @RossMillikan Thanks. I've amended my answer. – Brad S. Mar 18 '14 at 23:30