What do we need to know about height of men and height of women to conclude that John is likely to be higher than Kate?

Question

UPDATE:

Changed IQ to height to make question seem less controversial.

Suppose we read article that says men have higher mean height than females (distributions of both populations are approximately normal. Both distributions are relatively unskewed). We would be tempted to conclude that randomly chosen male John(whose height we don't know) is likely(i.e. with more than 50% chance) to be higher than randomly selected female Kate(whose height we don't know too). But there is a problem - I don't see how we can go to such conclusion mathematically (or alternatively, how could we prove that such conclusion is false). It feels like some important details are missing for such conclusion. What should we additonally know to conclude that John is likely to be higher than Kate?

Distributions are relatively unskewed, so we can rule out possibility that either the minority of really high males or the minority of very short females drastically influence the mean of their respective population.

The simplest case would be if we knew that any male is higher than any female. Then the mean height of males would be higher than the mean height of all females. So it would be one possible answer to my question: if we knew that even the shortest male is higher than the hightest female, then we would be able to conclude that John is surely higher than Kate (so no probability here).

But under less straightforward circumstances (i.e. when it's NOT true that any male is higher than any female) what do we need to reasonably conclude that John is probably higher than Kate? We could try to say something like "John is likely to be higher than Kate because 51% of males are higher than 51% of females". But such approach looks dubious on closer examination because there are different ways that 51% can be formed. We can prioritize taking the highest avaiable males for 51% of males and be taking the shortiest females first when forming 51% of females. In this case we would be able to say that 51% of males are higher than 51% of females EVEN IF the both population are literally identical! Thus conclusion "John is likely to be higher than Kate" would NOT follow from the premise "51% of males are higher than 51% of females".

P.S. I have found video on Khan academy that explains how to calculate probability that random normal variable W is higher than random normal variable M: https://www.khanacademy.org/math/ap-statistics/random-variables-ap/combining-random-variables/v/analyzing-the-difference-in-distributions?modal=1

This seems like a reasonable question about statistics and interpreting probability, but I just want to caution that the motivating example of comparing intelligence of males and females is prone to causing controversy (which doesn't appear to be your intent). If you could perhaps abstract your question, or replace this intelligence comparison with a different analogous situation, that would be appreciated. :-) — Theo Bendit, May 17 '19 at 04:39
@TheoBendit I chose IQ because IQ has normal distribution by design. Probably I could just add warning that my example is fictional and that IQ isn't proper measure of intelligence. — KarmaPeasant, May 17 '19 at 04:44
While its interesting, I agree you should make this more math like. For instance some confidence interval for the difference between two sample populations distributions. — , May 17 '19 at 04:46
@Shogun Sorry, but difficulty with proper formalization of the problem is why I posted it here. Otherwise I would just formalize and solve it myself. It would help if you explained how "confidence interval for the difference between two sample populations distributions" would useful here. Personally I haven't worked with such confidence intervals yet. — KarmaPeasant, May 17 '19 at 04:52
Why not replace intelligence here with height? It seems like your question would then be mathematically the same, without having to make a caveat about whether we can measure intelligence and other potentially problematic claims. — Theoretical Economist, May 17 '19 at 07:12
Otherwise, instead of asking who is smarter, why not ask who has the higher IQ. This allows you to sidestep some of the non-mathematical issues with your question. Though comparing heights is probably better. (Suppose you can’t compare their heights visually.) — Theoretical Economist, May 17 '19 at 07:15

angryavian · Accepted Answer · 2019-05-17T16:52:29.350

1

If $X_1 \sim N(\mu_1, \sigma_1^2)$ and $X_2 \sim N(\mu_2, \sigma_2^2)$ are independent, then their difference follows the distribution $X_1 - X_2 \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$. You can then compute $P(X_1 > X_2) = \Phi(\frac{\mu_1 - \mu_2}{\sqrt{\sigma_1^2+\sigma_2^2}})$ where $\Phi$ is the CDF of the standard normal distribution. When $\mu_1 > \mu_2$, the above probability can be anything between $0.5$ and $1$, depending on the values of $\mu_1, \mu_2, \sigma_1^2, \sigma_2^2$.

Edit: Since $X_1 - X_2 \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$ we know $Z := \frac{(X_1 - X_2) - (\mu_1 - \mu_2)}{\sqrt{\sigma_1^2 + \sigma_2^2}} \sim N(0, 1)$. So $$P(X_1 > X_2) = P(X_1 - X_2 > 0) = P\left(Z > - \frac{\mu_1 - \mu_2}{\sqrt{\sigma_1^2+\sigma_2^2}}\right) = \Phi\left(\frac{\mu_1 - \mu_2}{\sqrt{\sigma_1^2+\sigma_2^2}}\right).$$

One crucial bit of information that is missing from your scenario is how the two people are chosen. John and Kate are two people with fixed heightss, so talking about the "chance" that one's height is larger than the other's does not make sense: it either is or it isn't.

However, if you chose John uniformly at random from the population of men, and the distribution of men's height is $N(\mu_1, \sigma_1^2)$, then John's height can be viewed as a normal random variable. Similarly for Kate's height if she is chosen uniformly at random from the population of women. If these two choices are made independently, then you may use the above computation.

In real life, I seriously doubt you have the opportunity to get a uniformly chosen person from the population, so I would not think the above computation would be particularly applicable in your height scenario.

edited May 17 '19 at 16:52

answered May 17 '19 at 06:03

angryavian

93,534

What is "uniform random sample"? – KarmaPeasant May 17 '19 at 06:10
@user161005 Each man was equally likely to be chosen. (Then the distribution of IQs is a discrete distribution, so it can't be normal, but if it is approximately normal, then the calculation at the beginning of my post is approximate.) – angryavian May 17 '19 at 06:18
How is it different from simple random sample that has size of one? – KarmaPeasant May 17 '19 at 06:19
@user161005 It's the same. That is what I meant. – angryavian May 17 '19 at 06:20
Also, where did you get $$P(X_1 > X_2) = \Phi(\frac{\mu_1 - \mu_2}{\sqrt{\sigma_1^2+\sigma_2^2}})$$ ? – KarmaPeasant May 17 '19 at 06:20
@user161005 See my edit. – angryavian May 17 '19 at 06:23
Why "their difference follows the distribution $$X_1 - X_2 \sim N(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2)$$" ? – KarmaPeasant May 17 '19 at 06:24
@user161005 https://math.stackexchange.com/questions/961765/why-are-linear-combinations-of-independent-standard-normal-random-variables-also – angryavian May 17 '19 at 06:30
I have replaced IQ with height, please edit your answer accordingly. – KarmaPeasant May 17 '19 at 07:50
"above probability can be between 0.5 and 1". Can you give me example of situation when it can be 0.5? – KarmaPeasant May 19 '19 at 06:49
@user161005 When $\mu_1$ and $\mu_2$ are close to each other, and $\sigma_1^2$ and $\sigma_2^2$ are large. – angryavian May 19 '19 at 19:51

score 0 · Answer 2 · answered May 18 '19 at 02:26

A more general solution...

Let $X =$ the height of a "random" man, and $Y =$ the height of a "random" woman. Assume $X,Y$ are continuous random variables, independent, and $E[X] - E[Y] = \delta > 0$. Under what conditions can we conclude $P(X>Y) > 1/2$?

The case of two Gaussians has been solved exactly by @angryavian, based on the fact that the difference of two Gaussians is a Gaussian. However, I think the result holds more generally.

Theorem: Suppose (i) both $X$ and $Y$ are symmetric about their respectively means, and (ii) both $X$ and $Y$ have continuous support where the PDF strictly $>0$. (This allows Gaussians, triangles, uniform, etc. and also allow $X,Y$ to be of different "types"). Then $P(X > Y) > 1/2$.

Proof: Let $U = X - E[X], V = Y - E[Y]$ and let $p_U, p_V$ be the PDFs. Then symmetry means $p_U(a) = p_U(-a), p_V(b) = p_V(-b)$ for any $a,b \in \mathbb{R}$.

We will first show $P(U>V) = P(U<V)$. Consider the joint distribution $p_{UV}$ and we have:

$p_{UV}(a, b) = p_U(a) p_V(b) = p_U(-a) p_V(-b) = p_{UV}(-a, -b)$
$P(U > V) = \int_{b\in \mathbb{R}} \int_{a>b} p_{UV}(a,b) \, da \, db$
$P(U < V) = \int_{b\in \mathbb{R}} \int_{a<b} p_{UV}(a,b) \, da \, db= \int_{-b\in \mathbb{R}} \int_{-a > -b} p_{UV}(-a, -b) \, d(-a) \, d(-b) = P(U>V)$

Back to the main result:

$X < Y \iff U + E[X] < V+ E[Y] \iff U-V < E[Y] - E[X] = -\delta < 0$
Since $X,Y$ have continuous support where PDF $>0$, so do $U, V$ and $U-V$. Therefore, $-\delta < 0 \implies P(U-V \in (-\delta, 0)) =\epsilon > 0$
Finally $P(X<Y) = P(U - V < -\delta) = P(U-V<0) - \epsilon < 1/2$. QED

Further note: Condition (ii) is needed. Without it, here is a counter-example:

$X =$ uniform in the discontinous support $[50, 60] \cup [70, 80]$
$Y =$ uniform in the range $[63,65]$
$E[X] = 65 > E[Y] = 64$ and yet $P(X>Y) = P(X \in [70,80]) = 1/2$, i.e. $\not> 1/2$.

What do we need to know about height of men and height of women to conclude that John is likely to be higher than Kate?

2 Answers2