Suppose you have two samples from two populations $X$ and $Y$ (e.g. height of males in USA, heights of females in USA).
You are only given:
- the sample means of each population $\bar{x}$ and $\bar{y}$.
- the sample sizes available from each population $n$ and $m$
- the sample variances of both sample means $Var (\bar{x}) = S_1^2$ and $Var (\bar{y}) = S_2^2$.
- you are NOT given the individual observations: $x_1, x_2, \dots, x_n , y_1, y_2, \dots, y_m$
Based on this information, you are interested in estimating the "Weighted Mean" of the entire population (i.e. heights of all people in USA) . From "first principles", you can estimate the Weighted Mean (https://en.wikipedia.org/wiki/Weighted_arithmetic_mean) like this:
\begin{equation} \bar{x}_{weighted} = \frac{\bar{x}n + \bar{y}m}{n+m} \end{equation}
Given this information, I am now interested in estimating the Variance of the Weighted Mean (of the Population). Currently, I found two approaches to do this and I am not sure which of these approaches is better suited.
Approach 1: In Approach 1, we simply use the rules of Expectations and Variances to calculate the Variance of the Weighted Mean:
\begin{equation} Var(\bar{x}_{weighted}) = \left(\frac{1}{n+m-1}\right)^2 \left(n^2 Var(\bar{x}) + m^2 Var(\bar{y})\right) = \left(\frac{1}{n+m-1}\right)^2 \left(n^2 s_1^2 + m^2 s_2^2\right) \end{equation}
Approach 2: In Approach 2, the formula for the Variance of the Weighted Mean comes from here (https://en.wikipedia.org/wiki/Pooled_variance#Sample-based_statistics, https://wikimedia.org/api/rest_v1/media/math/render/svg/0224c1c53591c619794682f2bc3560dc86530e2b). Below, I try my best to derive this formula myself:
Let's hypothetically assume that we have access to $x_1, x_2, \dots, x_n , y_1, y_2, \dots, y_m$ - then in this case, we could say that $\sum x_i = n \bar{x} \text{ and } \sum y_i = m \bar{y} $
By first principle, we also know that $Var(\bar{x}) = \frac{1}{n} Var(x) \implies Var(x) = n Var(\bar{x})$ and $Var(\bar{y}) = \frac{1}{m} Var(y) \implies Var(y) = m Var(\bar{y})$
Using both these facts, we can write: \begin{align*} Var(x) &= \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n} \\ &= \frac{\sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2)}{n} \\ &= \frac{\sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + n\bar{x}^2}{n} \\ &= \frac{\sum_{i=1}^n x_i^2 - n\bar{x}^2}{n} \\ n Var(x) &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\ \sum_{i=1}^n x_i^2 &= n Var(x) + n\bar{x}^2 \end{align*}
Using the above logic: $\begin{align*} \sum_{i=1}^m y_i^2 &= m Var(y) + m\bar{y}^2 \end{align*}$
We can extend these to the means: \begin{align*} \sum_{i=1}^n x_i^2 &= n(nVar(\bar{x})) + n\bar{x}^2 = n^2s_1^2 + n\bar{x}^2 \\ \sum_{i=1}^m y_i^2 &= m(mVar(\bar{y})) + m\bar{y}^2 = m^2s_2^2 + m\bar{y}^2 \end{align*}
Now, if we were to treat both populations of $X$ and $Y$ as a single population ,
i.e. X $\cup Y = x_1 , x_2 , \dots , x_n , y_1 , y_2 , \dots , y_m$ , then we could write the Weighted Mean as (same as before) :
$$ \bar{x}_{weighted} = \frac{\bar{x}n + \bar{y}m}{n+m} = \frac{n\left(\frac{x_1+x_2+\dots+x_n}{n}\right) + m\left(\frac{y_1+y_2+\dots+y_m}{m}\right)}{n+m} = \frac{x_1 + x_2 + \dots + x_n + y_1 + y_2 + \dots + y_m}{n+m} $$
- Using this logic, I can write the following relations:
\begin{align*} Data &= X \cup Y = x_1, x_2, \dots, x_n, y_1, y_2, \dots, y_m \\ Var(Data) &= \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{n+m-1} \\ Var(\bar{x}_{weighted}) = Var(\overline{Data}) &= \frac{Var(Data)}{n+m-1} = \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{(n+m-1)^2} \end{align*}
- This means that the Variance of the Weighted Mean can be written as:
\begin{align} Var(\bar{x}_{weighted}) &= \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{(n+m-1)^2} \\ &= \frac{\sum (x_i^2 + 2\bar{x}_{weighted}x_i + \bar{x}_{weighted}^2) + \sum (y_i^2 + 2\bar{x}_{weighted}y_i + \bar{x}_{weighted}^2)}{(n+m-1)^2} \\ &= \frac{\sum x_i^2 - 2\bar{x}_{weighted}\sum x_i + n\bar{x}_{weighted}^2 + \sum y_i^2 - 2\bar{x}_{weighted}\sum y_i + m\bar{x}_{weighted}^2}{(n+m-1)^2} \\ &= \frac{n^2 s_1^2 + n \bar{x}_{weighted}^2 - 2\bar{x}_{weighted} n \bar{x}_{weighted} + m^2 s_2^2 - 2\bar{x}_{weighted} m \bar{y} + m\bar{x}_{weighted}^2}{(n+m-1)^2} \end{align}
- And after much simplification, we can write:
$$Var(\bar{x}_{weighted}) = \frac{n^2}{(n+m-1)^2}s_1^2 + \frac{m^2}{(n+m-1)^2}s_2^2 + \frac{n}{(n+m-1)^2} \cdot \frac{m}{(n+m-1)^2} (\bar{x} - \bar{y})^2$$
My Question: As we can see here, both Approach 1 and Approach 2 seem to provide different answers on how to calculate the Pooled Variance. However, are there are advantages to using one of these approaches over the other? Can we mathematically prove that Approach 2 performs better than Approach 1 in certain situations? (E.g. unbiased, consistent, etc)
Thanks!
Note: Approach 2 can be extended to the general case https://wikimedia.org/api/rest_v1/media/math/render/svg/0224c1c53591c619794682f2bc3560dc86530e2b