1

I am currently studying statistics but am looking at Simpson's paradox, which makes use of a number theory claim which I do not dispute. The paradox is described more as an example, and I will describe it below:

Assume we have a sample space $\Omega$ that is discrete with 100 elements. Let there be events A and B within $\Omega$ such that $A \cup B = \Omega$. So $A$ and $B$ are a partition.

Consider a different partition $C$, $D$ so again $C = D^C$. We could have the following probabilities:

$P(A \cap C) = 0.25, P(A \cap D) = 0.25, P(B \cap C) = 0.28, P(B \cap D) = 0.22 \hspace{2mm}(Equ. 1)$.

You can see this as four groups with some proportion of the whole sample space. According to the example, consider a third partition $E,F$ with the following probabilities:

$P(A \cap C | E) = 0.15, P(A \cap D | E) = 0.22, P(B \cap C | E) = 0.05, P(B \cap D | E) = 0.08 \hspace{2mm}(Equ. 2)$. $P(A \cap C | F) = 0.10, P(A \cap D | F) = 0.03, P(B \cap C | F) = 0.23, P(B \cap D | F) = 0.14 \hspace{2mm}(Equ. 3)$.

With these numbers, we assume $A$ is the event of elements that are 'red' and $B$ are 'blue'. Looking at $(Equ. 1)$, we see $50\%$ of red elements are C, and $56\%$ of blue elements are C.

Looking at $(Equ. 2)$, $41\%$ of red elements are C, and $38\%$ of blue elements are C. So conditional on event $E$, more elements are red than blue, despite the overall sample space having more blue than red.

The same is seen with $(Equ. 3)$, where there are more red than blue. This seems misleading intuitively because we expect proportions of a population to not change depending on how we 'slice' or condition the sample space.

The number theory inequality that explains the paradox is:

$\frac{A}{B} > \frac{a}{b}$ and $\frac{C}{D} > \frac{c}{d}$ does not imply $\frac{A + C}{B + D} > \frac{a + c}{b + d}$.

I can appreciate this implication may not be true and am given an example of it failing. However, what is the 'failing point' that makes this inequality true? As in how has the Simpson's paradox managed to produce this misleading but true (and I don't doubt it's true) result.

Apologies if this is not exactly clear, hopefully the last paragraph clarifies what I am looking for.

Bill Dubuque
  • 282,220
  • 1
    Not sure I understand what you are after. The desired implication is simply false. You can't compare portions unless you know the total amounts. What more is there to say? The "paradox" is that we, incorrectly, expect averages to be easily combined in a way that simply does not work. – lulu Dec 30 '24 at 23:17
  • That is a fair response, and I agree that the paradox is this 'intuition going wrong' type of problem. I was wondering if there was any number theory showing 'how' the implication can be pushed to fail: so, for example 'increase A in proportion to c' or similar. Basically so I know where the paradox moves from following intuition to not, and how to push it in this direction. – user21764386 Dec 30 '24 at 23:23
  • 1
    I still don't understand what you want. to generate "real world" examples of Simpson, just have one high average occur with very few samples and another (slightly lower) average occur with a huge number of examples. Then, in a second pool, have one very low average with a high number of samples beat an even lower average with very few samples. Then the first beats the second in each sample, but combined the second wins as they had the best average with a lot of samples. – lulu Dec 30 '24 at 23:27
  • Numerically: On Monday I get one hit in one try, where you get $999$ hits in $1000$ tries. So my average beats yours on Monday. On Tuesday, I get one hit in $1000$ tries while you try only once but miss. So my average beats yours on Tuesday too. But combined, I got two hits in $1001$ tries while you got $999$ hits in $1001$ tries, so your combined average is far, far better than mine. – lulu Dec 30 '24 at 23:29
  • If you view the mediants as diagonal slopes as here then it is geometrically obvious that we can choose "extreme" parallelograms that make that mediant inequality fail. – Bill Dubuque Dec 30 '24 at 23:34
  • Thanks @lulu, makes perfect sense. What I was looking for. – user21764386 Dec 30 '24 at 23:47
  • Also thanks @BillDubuque, the geometric slant is useful. – user21764386 Dec 30 '24 at 23:50
  • "So $A$ and $B$ are a partition." Why? Why can't $A = B = \Omega$? – Eric Towers Dec 31 '24 at 00:21
  • @EricTowers That is true, my question leaves that possibility open. In the example I am looking at I am sure $A$ and $B$ will be non-empty as they are 'presence' or 'absence' of disease (and I assume at least one patient is ill), so I should have made this clear and actually $A = B^C$ (similarly $C = D^C, E = F^C$ in the notes). – user21764386 Dec 31 '24 at 00:23
  • Reading about the mediant may highlight the number theory (rather that the Simpson's paradox-ness). – Eric Towers Dec 31 '24 at 00:29
  • @EricTowers Thanks, Bill beat you to it. Very interesting visually though. – user21764386 Dec 31 '24 at 00:33

0 Answers0