Probability - Interview Question - Hidden Assumptions and Phrasing Issues

Question

I’ve encountered the following seemingly simple probability interview question in my workplace:

Two reviewers were tasked with finding errors in a book. The first had found 40 errors and the other had found 60. 20 of the found errors were found in common. Give an estimate on the number of errors in the book.

A few clarifications were given:

The errors are not false positives.
The probability of the reviewers to find any error is independent of each other. (Problematic phrasing?)
The lower bound is not required (i.e at least 80 errors).

It was my opinion that this problem is not well defined and any answer would rely on hidden assumptions.

My coworker said that the solution is easily calculable using the following method assigning to x the total number of errors:

$$P(A) = \frac{40}{x}$$ $$P(B) = \frac{60}{x}$$ $$P(A\cap B) = \frac{20}{x}$$ $$P(A\cap B) = P(A) * P(B)$$ $$\frac{20}{x} = \frac{40}{x} * \frac{60}{x} $$ $$20x = 2400$$ $$x = 120$$

I found this answer unsatisfying, but I am struggling to coherently explain why. I believe there are various assumptions hidden in the above “solution”.

I need help identifying these assumptions or phrasing issues with the question itself that make it not well defined. It could be that I’m mistaken and the problem is well defined and I’ve complicated it.

I am also interested in alternative solutions that could be based on different assumptions but don’t negate the clarifications made.

The interviewer is essentially asking you to re-derive the Lincoln index for estimating a population size from a mark-and-recapture study. — Mike Earnest, Dec 27 '23 at 18:41
I thought about it and found $120$, was happy to see that's what your coworker found. There are assumptions, yes, and these have been somewhat described in the answers. In the context of an interview question, I still believe $120$ was the most sensible answer to make. — Jean-Armand Moroni, Dec 27 '23 at 21:40

Robertmg · Accepted Answer · 2023-12-26T22:40:11.230

Let $A_i \thicksim Ber(p)$ be a random variable describing whether or not person $A$ found error $i$, and $B_i$ be the same but for person $B$. The answer posted assumes that $\forall i,j; \mathbb{P}(A_i = 1) = \mathbb{P}(A_j = 1)$, which doesn't feel right. For example, if the errors are typos then the typo: "dwjaiodajwio" is more obvious than using "there" instead of "their". We also should consider types of error, maybe person $B$ is better at finding grammatical error than person $A$, but person $A$ can find all of the spelling errors.

If we choose to assume this, then $\mathbb{P}(A_i = 1) = \frac{40}{x}$ is still incorrect. Let $A \thicksim \text{Bin}(x, \frac{40}{x})$ and $B \thicksim \text{Bin}(x, \frac{60}{x})$. Then we expect $A = 40$ and $B = 60$ given $x$ total errors, but this is of course the expectation, not on any given trial will they be equal. That is the biggest problem here, is that we claim this trial to be equal to the expectation.

The answer given has assumed that the true expectation is equal to the number of errors found (i.e. $A = x \cdot \frac{40}{x} = 40 = \mathbb{E}[A]$). That is the big "hidden assumption" that the answer has without saying. On just one trial, it is ridiculous to assume this, and the other answer from heropup showed an example as to why this becomes a problem if we find $0$ in common. You are certainly correct that this is not a well-defined problem, and it should have these things specified to make sense.

It would be hard to get an estimate on the true probability, since we don't know the number of true errors or how the errors work. In other words, if we had a disease over a country, and we knew there were at least $100$ people sick out of some amount of people, it's hard to estimate the number of sick people when we know literally nothing about the disease. It could be exactly $100$ if the disease is a rare genetic condition, or it could be $100,000$ if the disease was like the common cold, we don't know, and no estimate will exactly feel satisfactory, since we would need major assumptions on the data.

Final edit: What if they found $40$ error's in common, and $A$ still found $40$ while $B$ still found $60$? Then it seems like we should expect there to only be $60$ error's based on the work at hand, but that makes literally no sense to just assume $B$ is perfect.

heropup · Answer 2 · 2023-12-27T01:22:58.727

You have good reason to suspect this analysis. An obvious way to see why this estimate cannot be appropriate is to observe that if $0$ errors are found in common between the two reviewers, this would imply $$\frac{0}{x} = \frac{40}{x} \cdot \frac{60}{x},$$ and the estimate is $x = \infty$ errors. This is obviously absurd. The number of errors might be very large, but it cannot be infinite even when there is a small but nonzero probability that no common errors are found. For instance, in a book with $x = 10$ errors, if reviewer $A$ finds $2$ errors and $B$ finds $3$ errors, it is quite reasonable to think that none of those errors are common.

If I were to take the time to solve this question, I'd first state some additional but reasonable assumptions:

Within each reviewer, each error has the same fixed probability of being discovered.
Within each reviewer, the probability that a given error is found is independent of any other errors.

Such a model would involve a binomial and/or hypergeometric distribution approach, and estimating $x$ would be done by maximum likelihood.

If we do assume such a model, then what is the probability of the aforementioned outcome: $x = 10$ errors, but $N_A = 2$ and $N_B = 3$ are found, and $N_C = 0$ common errors?

If $x$ is unknown, what is the corresponding likelihood function for $x$ in the above case?

Actually, under a binomial model, I believe the MLE is $x=\infty$ in this case, though the likelihood increases very slowly. — K. A. Buhr, Dec 26 '23 at 23:27
@K.A.Buhr After actually taking the time to try the calculation, I believe you are correct. I think this makes the question ill-posed, since a reasonable estimate should not include infinity. — heropup, Dec 27 '23 at 01:24

Probability - Interview Question - Hidden Assumptions and Phrasing Issues

2 Answers2