0

Suppose you have two samples from two populations $X$ and $Y$ (e.g. height of males in USA, heights of females in USA).

You are only given:

  • the sample means of each population $\bar{x}$ and $\bar{y}$.
  • the sample sizes available from each population $n$ and $m$
  • the sample variances of both sample means $Var (\bar{x}) = S_1^2$ and $Var (\bar{y}) = S_2^2$.
  • you are NOT given the individual observations: $x_1, x_2, \dots, x_n , y_1, y_2, \dots, y_m$

Based on this information, you are interested in estimating the "Weighted Mean" of the entire population (i.e. heights of all people in USA) . From "first principles", you can estimate the Weighted Mean (https://en.wikipedia.org/wiki/Weighted_arithmetic_mean) like this:

\begin{equation} \bar{x}_{weighted} = \frac{\bar{x}n + \bar{y}m}{n+m} \end{equation}

Given this information, I am now interested in estimating the Variance of the Weighted Mean (of the Population). Currently, I found two approaches to do this and I am not sure which of these approaches is better suited.

Approach 1: In Approach 1, we simply use the rules of Expectations and Variances to calculate the Variance of the Weighted Mean:

\begin{equation} Var(\bar{x}_{weighted}) = \left(\frac{1}{n+m-1}\right)^2 \left(n^2 Var(\bar{x}) + m^2 Var(\bar{y})\right) = \left(\frac{1}{n+m-1}\right)^2 \left(n^2 s_1^2 + m^2 s_2^2\right) \end{equation}

Approach 2: In Approach 2, the formula for the Variance of the Weighted Mean comes from here (https://en.wikipedia.org/wiki/Pooled_variance#Sample-based_statistics, https://wikimedia.org/api/rest_v1/media/math/render/svg/0224c1c53591c619794682f2bc3560dc86530e2b). Below, I try my best to derive this formula myself:

  • Let's hypothetically assume that we have access to $x_1, x_2, \dots, x_n , y_1, y_2, \dots, y_m$ - then in this case, we could say that $\sum x_i = n \bar{x} \text{ and } \sum y_i = m \bar{y} $

  • By first principle, we also know that $Var(\bar{x}) = \frac{1}{n} Var(x) \implies Var(x) = n Var(\bar{x})$ and $Var(\bar{y}) = \frac{1}{m} Var(y) \implies Var(y) = m Var(\bar{y})$

  • Using both these facts, we can write: \begin{align*} Var(x) &= \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n} \\ &= \frac{\sum_{i=1}^n (x_i^2 - 2x_i\bar{x} + \bar{x}^2)}{n} \\ &= \frac{\sum_{i=1}^n x_i^2 - 2\bar{x}\sum_{i=1}^n x_i + n\bar{x}^2}{n} \\ &= \frac{\sum_{i=1}^n x_i^2 - n\bar{x}^2}{n} \\ n Var(x) &= \sum_{i=1}^n x_i^2 - n\bar{x}^2 \\ \sum_{i=1}^n x_i^2 &= n Var(x) + n\bar{x}^2 \end{align*}

  • Using the above logic: $\begin{align*} \sum_{i=1}^m y_i^2 &= m Var(y) + m\bar{y}^2 \end{align*}$

  • We can extend these to the means: \begin{align*} \sum_{i=1}^n x_i^2 &= n(nVar(\bar{x})) + n\bar{x}^2 = n^2s_1^2 + n\bar{x}^2 \\ \sum_{i=1}^m y_i^2 &= m(mVar(\bar{y})) + m\bar{y}^2 = m^2s_2^2 + m\bar{y}^2 \end{align*}

  • Now, if we were to treat both populations of $X$ and $Y$ as a single population ,
    i.e. X $\cup Y = x_1 , x_2 , \dots , x_n , y_1 , y_2 , \dots , y_m$ , then we could write the Weighted Mean as (same as before) :

$$ \bar{x}_{weighted} = \frac{\bar{x}n + \bar{y}m}{n+m} = \frac{n\left(\frac{x_1+x_2+\dots+x_n}{n}\right) + m\left(\frac{y_1+y_2+\dots+y_m}{m}\right)}{n+m} = \frac{x_1 + x_2 + \dots + x_n + y_1 + y_2 + \dots + y_m}{n+m} $$

  • Using this logic, I can write the following relations:

\begin{align*} Data &= X \cup Y = x_1, x_2, \dots, x_n, y_1, y_2, \dots, y_m \\ Var(Data) &= \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{n+m-1} \\ Var(\bar{x}_{weighted}) = Var(\overline{Data}) &= \frac{Var(Data)}{n+m-1} = \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{(n+m-1)^2} \end{align*}

  • This means that the Variance of the Weighted Mean can be written as:

\begin{align} Var(\bar{x}_{weighted}) &= \frac{\sum_{i=1}^n (x_i - \bar{x}_{weighted})^2 + \sum_{i=1}^m (y_i - \bar{x}_{weighted})^2}{(n+m-1)^2} \\ &= \frac{\sum (x_i^2 + 2\bar{x}_{weighted}x_i + \bar{x}_{weighted}^2) + \sum (y_i^2 + 2\bar{x}_{weighted}y_i + \bar{x}_{weighted}^2)}{(n+m-1)^2} \\ &= \frac{\sum x_i^2 - 2\bar{x}_{weighted}\sum x_i + n\bar{x}_{weighted}^2 + \sum y_i^2 - 2\bar{x}_{weighted}\sum y_i + m\bar{x}_{weighted}^2}{(n+m-1)^2} \\ &= \frac{n^2 s_1^2 + n \bar{x}_{weighted}^2 - 2\bar{x}_{weighted} n \bar{x}_{weighted} + m^2 s_2^2 - 2\bar{x}_{weighted} m \bar{y} + m\bar{x}_{weighted}^2}{(n+m-1)^2} \end{align}

  • And after much simplification, we can write:

$$Var(\bar{x}_{weighted}) = \frac{n^2}{(n+m-1)^2}s_1^2 + \frac{m^2}{(n+m-1)^2}s_2^2 + \frac{n}{(n+m-1)^2} \cdot \frac{m}{(n+m-1)^2} (\bar{x} - \bar{y})^2$$

My Question: As we can see here, both Approach 1 and Approach 2 seem to provide different answers on how to calculate the Pooled Variance. However, are there are advantages to using one of these approaches over the other? Can we mathematically prove that Approach 2 performs better than Approach 1 in certain situations? (E.g. unbiased, consistent, etc)

Thanks!

Note: Approach 2 can be extended to the general case https://wikimedia.org/api/rest_v1/media/math/render/svg/0224c1c53591c619794682f2bc3560dc86530e2b

stats_noob
  • 4,107
  • 2
    You are trying to compute the variance of the total population, or the variance of the sample mean of the total population? – leonbloy May 17 '23 at 16:58
  • @ leonbloy: thank you for your reply! I am trying to find out the variance of the mean of the population. – stats_noob May 17 '23 at 18:00
  • 1
    I ask because in your first approach $s_1^2$ seems to be the variance of the individual sample mean, but in the other, $s_1^2$ seems to ve the variance of the population – leonbloy May 17 '23 at 18:24
  • @ leonbloy : please see the updated/edited version - thank you so much! – stats_noob May 18 '23 at 06:40
  • 1
    The exact formula is as you derived in Approach 2. In Approach 1, there is an implied assumption that the within-group population means are equal; i.e., $\mu_X = \mu_Y$. When this is not true, the estimator in the first approach will be biased. – heropup May 18 '23 at 09:00
  • @ heropup: thank you so much for your reply! Do you know how to show these proofs? If you have time, can you please vote to reopen my question and show the proofs? Thank you so much! – stats_noob May 18 '23 at 13:42
  • I edited my question - I am clarifying that I am interested in mathematically comparing the properties of Approach 1 vs Approach 2. This point is not addressed in the alleged duplicate question. Thanks! – stats_noob May 19 '23 at 15:07
  • 1
    The difference between the two formulas (literally the difference: subtract the first formula from the second formula) is $\frac{n}{(n+m)^2} \cdot \frac{m}{(n+m)^2} (\bar{x} - \bar{y})^2$. But there seem to be several errors in both formulas. Compare your Approach 2 with the Wikipedia formula. – David K May 19 '23 at 18:03
  • I think the difficulty in answering this question is that nobody is quite sure what you're trying to say. There is a lot of talk about sample mean and sample variance, but then the formulas that are applied are ones designed for populations, not samples. For example, to get the mean height of all adults in the U.S., you would want to weight the mean male height by the number of males in the population, not be the size of the sample. Then there is the peculiar wording, "sample variances of both sample means", which I cannot make sense of in this context. – David K May 22 '23 at 00:06
  • Are you looking for the standard error of the estimated population mean? That's a well-defined concept and it's different from estimating the variance of heights in the population. Is standard error of the mean also what a "sample variance of a sample mean" is? Why is the formula for $Var(x)$ written with $n$ rather than $n-1$ in the denominator? – David K May 22 '23 at 00:08
  • The last couple of edits addressed none of the points in my last two comments. You changed some variable names and subtracted $1$ from $n + m$ in a few places, which was something I had not yet gotten around to asking about because I wanted to get clarity on the other points first. I literally asked only about the formula for $Var(x)$ (which is the same as before) and several points in the text you wrote before any of the equations. But I notice that you still have not matched the Wikipedia formula you were supposedly deriving. – David K Jun 02 '23 at 11:26
  • Consider this example: you have a sample of $n=500$ U.S. males with mean height $1.77$ meters and a sample of $m=200$ U.S. females with mean height $1.63$ meters. What is your estimate of the mean height of the U.S. population, and how do you use your formulas to get that result? – David K Jun 03 '23 at 17:50

0 Answers0