Can I work out the variance in batches?

Question

So I have a data divided into chunks, and I can only calculate the variance in each of the chunks because of software limitations. But I want to get the variance of the whole data together, not the chunks. I know the variance is not a linear operator. I would like the get kind of the average of the variance but this will have to be the same number as If I calculated the variance of the whole data together. Example: Rolling a dice in 3 groups of 2 rolls I can calculate the variance on each of the groups, so I with this data, I want to calculate the variance of the whole set: rolling a dice 6 times. Thank you for your help.

heropup · Accepted Answer · 2020-04-03T16:41:06.340

Refer to the following answer to this question: How do I combine standard deviations of two groups?

In particular, the final formula

$$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}$$

illustrates how to compute the total variance of two samples, one of size $n$, sample mean $\bar x$, and sample variance $s_x^2$, and one of size $m$, sample mean $\bar y$, and sample variance $s_y^2$. Those are the quantities you need to track. Also note that the total sample mean is given by the formula $$\bar z = \frac{n \bar x + m \bar y}{n + m}.$$ These formulas readily lend themselves to an extended calculation for any number of groups:

Set $i = 1$.
Compute $n_i$, $\bar x_i$, and $s_{x_i}^2$, the sample size, sample mean, and sample variance of dataset $i$.
Increment $i$.
Repeat Step 2.
Use the above two formulas to compute a new $n_T$, $\bar x_T$, and $s_T^2$ representing the sample size, sample mean, and sample variance of all datasets up to set $i$.
If the last dataset was used to compute the result in step 5, stop. Otherwise, go to step 3.

Since the original poster has claimed that the formula does not work, I will furnish a numerical example to illustrate. This example will employ discrete data to match the scenario described in the question, but realizations from a continuous distribution can just as easily be provided.

Let $D_i$ represent dataset $i$. Then

$$\begin{align*} D_1 &= \{1, 1, 3, 4, 1, 5, 6, 3, 5, 5\} \\ D_2 &= \{5, 6, 2, 4, 2, 1, 1, 4, 2, 4, 4, 1, 3, 5, 6\} \\ D_3 &= \{3, 2, 6, 4, 1, 5, 2, 1, 3, 1, 5, 2, 2\} \\ D_4 &= \{5, 3, 1, 5, 1\} \end{align*}$$

Consequently, $$\begin{array}{|c|c|c|c|} \hline i & n_i & \bar x_i & s_{x_i}^2 \\ \hline 1 & 10 & \frac{17}{5} & \frac{18}{5} \\ \hline 2 & 15 & \frac{10}{3} & \frac{65}{21} \\ \hline 3 & 13 & \frac{37}{13} & \frac{73}{26} \\ \hline 4 & 5 & 3 & 4 \\ \hline \end{array}$$

We now calculate the combined sample sizes, means, and variances of datasets $1$ through $i$:

$$\begin{array}{|c|c|c|c|} \hline T & n_T & \bar x_T & \bar s_T^2 \\ \hline 1 & 10 & \frac{17}{5} & \frac{18}{5} \\ \hline 2 & 25 & \frac{84}{25} & \frac{947}{300} \\ \hline 3 & 38 & \frac{121}{38} & \frac{4245}{1406} \\ \hline 4 & 43 & \frac{136}{43} & \frac{2749}{903} \\ \hline \end{array}$$

The last row represents the total sample size, sample mean, and sample variance for the $4$ combined datasets.

Here is a sample calculation of the aggregate variance of datasets $1$ through $3$:

$$s_T^2 (T = 3) = \frac{(25 - 1)(\frac{947}{300}) + (13 - 1)(\frac{73}{26})}{25 + 13 - 1} + \frac{(25)(13)(\frac{84}{25} - \frac{37}{13})^2}{(25 + 13)(25 + 13 - 1)} = \frac{4245}{1406},$$

which matches the direct calculation based on datasets $D_1, D_2, D_3$.

Finally, Mathematica code to replicate the above computations:

d1 = {1, 1, 3, 4, 1, 5, 6, 3, 5, 5};
d2 = {5, 6, 2, 4, 2, 1, 1, 4, 2, 4, 4, 1, 3, 5, 6};
d3 = {3, 2, 6, 4, 1, 5, 2, 1, 3, 1, 5, 2, 2};
d4 = {5, 3, 1, 5, 1};

stat[x_] := {Length[x], Mean[x], Variance[x]}
data = stat /@ {d1, d2, d3, d4}
var[{n_, x_, sx_}, {m_, y_, sy_}] := {n + m, (n x + m y)/(n + m),
     ((n - 1) sx + (m - 1) sy)/(n + m - 1) + n m (x - y)^2/((n + m) (n + m - 1))}

Rest@FoldList[var[#1, #2] &, {0, 0, 0}, data]

stat[Join[d1, d2, d3, d4]]

In the future, rather than simply asserting that the formula doesn't work, it would be more polite and instructive to provide your own computations showing where you are encountering problems, so that your error can be found.

The sample sizes are subtracted by $1$ to get unbiased estimators in an estimation context. But is there a reason we are bothered about unbiasedness here? — StubbornAtom, Apr 01 '20 at 09:51
When you say set i=1 I am not sure what you mean...Maybe I can work out the variance with the first two terms, then the result join it with the third, and so on... — yomismito, Apr 01 '20 at 13:50
@yomismito Please see the edited answer, in which I have provided a detailed numerical example to demonstrate that the formulas do in fact work. — heropup, Apr 03 '20 at 16:42
Ok Im looking at your example and how you applied the formula, and this seems to work but when you are adding the previous data sets like you did in your case, in my case the data is divided, this is why the formula wasnt working for me. Any formula to to do it this way? Thank you so much — yomismito, May 04 '20 at 14:18

Can I work out the variance in batches?

1 Answers1

Linked