0

I have two binomial distributions, each with a different number of trials. How can I model the distribution of the variable $X$ in $$X=Y+Z$$ where $Y$ is Binomial($n$, $P_Y$) and $Z$ is Binomial($m$, $P_Z$)?


I've read threads on binomials and Poisson binomials, such as

None of them address what happens when the number of trials differs ($n$ and $m$).

I'd like to get a percentile rank or such for the resulting distribution. Calculating the variance is simple: $$Var(X) = Var(Y) + Var(Z) = n*P_Y*(1-P_Y) + m*P_Z*(1-P_Z)$$

Likewise the standard deviation is just the square root of that. But I'm stuck on finding any sort of percentile rank for a given result, or finding where the $N$th percentile of the result is. The Poisson binomial formulas don't work because they require the same $n$ for each binomial.

$P_Y$ and $P_Z$ differ by about an order of magnitude, and $n$ and $m$ differ by a couple orders of magnitude. So it's hard to just weight each piece and hope it comes out in the wash. I tried approximating with a straight Poisson distribution too since the mean of $X$ is simple to find, but that doesn't seem to do any better. I think the wide variation between $P_Y$ and $P_Z$ may be the culprit.

I can't separate out the distribution for $n$ and $m$ because all I see is the combined result $X$. Each event involves one trial of $Y$ and multiple trials of $Z$. In concrete terms, I have separate events with values like $n = 1, P_Y = .25$ and $m = 40, P_Z = .02$. That gives one result, a combined number of successes. Then that's repeated as another event, yielding another combined result.

Taking a number of these events together, I'll end with something like $n = 10$ and $m = 400$ yielding 15 total successes where the mean is 10.5. Trying to figure out how that total ranks within the distribution of expected values. Any ideas appreciated.

Ed_
  • 101
  • 2

1 Answers1

1

By using convolution,

\begin{align} P(X=x) &= P(Y+Z = x) \\ &= \sum_{z=\max(0,x-n)}^{\min(m,x)} P(Y+Z=x|Z=z)P(Z=z) \\ &= \sum_{z=\max(0,x-n)}^{\min(m,x)} P(Y=x-z)P(Z=z) \\&= \sum_{z=\max(0,x-n)}^{\min(m,x)} \binom{n}{x-z}P_Y^{x-z}(1-P_Y)^{n-x-z}\binom{m}{z}P_Z^{z}(1-P_Z)^{m-z} \end{align}

Assuming independent, the characteristic function is

$$ \left( P_Y {{\rm e}^{it}}+1-P_Y \right) ^{n}\left( P_Z {{\rm e}^{it}}+1-P_Z \right) ^{m}$$

Siong Thye Goh
  • 153,832
  • 1
    Thanks, appreciate it! That's a bit hairy to look at, but actually pretty straightforward (dare I say simple) once you wrap your head around it! Just the sums of probabilities for each possible split of y + z = x. Doesn't that just give you the probability of one particular x though? How do you turn that into a percentile rank, short of calculating that sum for every possible X? I'm interested in the characteristics of the entire distribution, not just one result. Thanks again, great help! – Ed_ Jul 25 '17 at 03:09
  • How does one compute percentile rank for a binomial distribution? – Siong Thye Goh Jul 25 '17 at 18:11
  • According to this, brute force or asymptotic estimation. However Excel has a binom.dist function that calculates cumulative probability (basically, percentile), so there must be a straightforward closed-form solution or at least approximation for ordinary binomials. Not sure how they do it though. – Ed_ Jul 25 '17 at 19:47
  • I see, we have the expression for $P(X=x)$, i.e. we know the PMF, we can compute the CDF from there. – Siong Thye Goh Jul 25 '17 at 20:11
  • FYI I coded it up in python and it works perfect! Both PMF and CDF, which just iterates over PMF. Runs very fast too. Surprisingly, the percentile results are very very close to Poisson, within 1-2% on my data set. Amazing! Thanks so much. – Ed_ Jul 26 '17 at 00:17