26

There are some related questions on the net but I did not understand their solutions.

I am reading in a textbook about methods of finding a collision. It states to consider a collision for a hash function with a 256-bit output size and writes if we pick random inputs and compute the hash values, that we'll find a collision with high probability and if we choose just $2^{130}$ + 1 inputs, it turns out that there is a 99.8% chance at least two inputs will collide. It also says we can find a collision by looking at roughly the square root of the number of possible output results.

Questions:

  1. What is the formula used to calculate that if we choose $2^{130}$ + 1 input at least 2 inputs will collide with a 99.8% probability?

    From my research it looks like this is related to the "birthday attack" problem, where you calculate first the probability that the hash inputs DO NOT collide and subtract this off from 1. Whenever I tried to plug and chug using the formulas I found online I wasn't getting the results the book stated.

    Edit: I tried using the formula P(A) = 1 - $\frac{k!}{(k-n)!k^{n}}$ and obtained 99.99...%. I do not know how the authors obtained the percentage 99.8%. Can someone please state the explicit equation used, so I can see how the book gets the 99.8% probability of collisions with $2^{130}$ + 1 randomly chosen inputs? Where does the +1 come from?

  2. How is the square root function related to this? If I take square root of 256, I get 16. What does this mean in the context of collisions?

  3. The size of the output is 256 bits. Does this mean the total number of possible outputs is $2^{256}$?

  4. Does the pigeonhole principle apply to this? You are taking an infinite amount of inputs and mapping them to a finite amount of outputs.

Max
  • 417
  • 1
  • 6
  • 11

1 Answers1

38

Birthday problem for cryptographic hashing, 101.

Let $p_n$ be the probability of collision for a number $n$ of random distinct inputs hashed to $k$ possible values (that is, probability that at least two hashes are identical), on the assumption that the hash is perfect. That $p_n$ is also the minimum probability of collision with no hypothesis on the hash.

Obviously, $p_0=p_1=0$ (until there are at least two hashes, no collision is possible); and $p_j=1$ if $j>k\;$ (by the pigeonhole principle, with more than $k$ pigeons for $k$ holes, at least one pigeon needs to share a hole with another).

Let $q_j=1-p_j\;$ be the probability of no collision for $j$ hashes, with $q_0=1$. When we add another hash after there was no collision among the previous $j$ hashes, the new hash has probability $\frac j k$ of matching one of the $j$ previous hashes (creating the first collision), and probability $1-\frac j k$ of not creating a collision. Hence the recurrence: $q_{j+1}=q_j\,\bigl(1-\frac j k\bigr)$. It comes the exact: $$p_n=1-q_n\quad\text{with}\quad q_n=\;\prod_{j=0}^{n-1}{\Bigl(1-\frac j k\Bigr)}\;=\;{\frac{k!}{(k-n)!\;k^n}}$$

That is also $\displaystyle q_n\,=\,P(k,n)/k^n\,=\,C(k,n)\,n!/k^n\,=\,{k\choose n}\,\frac{n!}{k^n}$, where $P(k,n)$ is the number of permutations of $n$ things among $k$, and $C(k,n)={k\choose n}$ is the number of combinations of $n$ things among $k$ (beware that $k$ and $n$ are reversed in this answer compared to the linked references).

This is fine for low $k$ and $n$, but impractical for large values. With $n\ll k\,$, applying that for small $\epsilon>0$ it holds that $\ln(1-\epsilon)\lessapprox-\epsilon\;$, we get $$\ln(q_n)\;=\;\sum_{j=0}^{n-1}{\ln\Bigl(1-\frac j k\Bigr)}\;\lessapprox\;-\frac 1 k\sum_{j=0}^{n-1}j\;=\;{-\,\frac{n\,(n-1)}{2k}}$$


In summary we have exactly $$ \begin{align*} p_n=1-q_n&=1-\frac{k!}{(k-n)!\;k^n}&&=1-P(k,n)/k^n\\\\ &=1-{k\choose n}\,\frac{n!}{k^n}&&=1-C(k,n)\,n!/k^n \end{align*}$$ and as fair approximations when the above is inconvenient $$p_n=1-q_n\quad\text{with}\quad \begin{align} &q_n\lessapprox e^{-n\,(n-1)/(2k)}&\text{(assuming $n\ll k$)}\\ &q_n\approx e^{-n^2/(2k)}&\text{(additionally assuming large $n$)} \end{align}$$

From this we can derive that for large $k$ as used in cryptography, or a $b$-bit hash (where $k=2^b)$, probability and odds of collision for $n$ random messages are approximately $$ \def\alignedcolumn#1#2#3#4#5{\phantom{#1}\llap{#3}\mathrel{#4}\rlap{#5}\phantom{#2}}% % Specify the widest expressions on the left-hand and right-hand sides. \def\ncolumn{\alignedcolumn{1.177\sqrt k}{2^{b/2+0.236\ldots}}}% \def\pncolumn{\alignedcolumn{1-e^{-1/2}}{00.0\%}}% \def\oddscolumn{\alignedcolumn{}{11/17}}% \begin{array}{c|c|c} n&\text{$p_n$ of collision}&\text{odds of collision}\\ \hline \ncolumn{\sqrt k}={2^{b/2}}&\pncolumn{1-e^{-1/2}}\approx{39.3\%}&\oddscolumn{}\approx{11/17}\\ \ncolumn{1.177\sqrt k}\approx{2^{b/2+0.236\ldots}}&\pncolumn{1/2}={50\%}&\oddscolumn{}{\phantom{=}}{\hphantom{1}1/1}\\ \ncolumn{1.414\sqrt k}\approx{2^{(b+1)/2}}&\pncolumn{1-e^{-1}}\approx{63.2\%}&\oddscolumn{}\approx{12/7}\\ \ncolumn{2\sqrt k}={2^{b/2+1}}&\pncolumn{1-e^{-2}}\approx{86.5\%}&\oddscolumn{}\approx{19/3} \end{array} $$

Without proof: the expected number hashes to get a collision converges to $\sqrt{k\pi/2}=1.2533141\ldots\sqrt k$. For $k\ge2^{16}$, the approximation $\frac54\sqrt k$ is good to <0.5%.


The applied cryptographer is often interested in very low probability of collision $p_n$, as occurring for $n\ll\sqrt k$. The above approximation remains mathematically valid and usable for $q_j$, but can give inaccurate results for $p_j=1-q_j$ because the difference is orders of magnitude smaller than either term in the subtraction. We need another method.

Observe there's a collision when the $i$th hash equals the $j$th hash, for any pair $(i,j)$ with $1\le i<j\le n\;$; and for each of such $n(n-1)/2$ pairs, there is probability $1/k$ of a collision. Probabilities for such $n(n-1)/2$ events are not quite independent, but for low overall probability of collision this remains a valid approximation, as well as approximating very low probability that any of multiple events occurs to the sum of the probabilities of each event. We get these approximations $$\begin{align*}p_n&\lessapprox{\frac{n\;(n-1)}{2k}}&\text{(assuming $n\ll\sqrt k$)}\\ &\lessapprox{\frac{n^2}{2k}}&\text{(additionally assuming large $n$)}\end{align*}$$

For example, with $n=2^{100}$ distinct random inputs (hypothetically) hashed to $b=256$ bits with a perfect (or even passable) hash, probability of collision is about $2^{2\cdot100-1-256}=2^{-57}$ (less than 8 chances in a million million millions, which can safely be discounted).


In the question, we have a hash with $b=256$-bit, that can thus take $k=2^b=2^{256}$ output values, and $n=2^{130}+1\;$. Here $k$ is large, $n\ll k$ and $n\not\ll\sqrt k$, therefore the appropriate formulas are

$$p_n=1-q_n\quad\text{with}\quad q_n\approx e^{-n^2/(2k)}\quad\text{(assuming large $k$ and $n\ll k$)}$$

The $+1$ term in $n=2^{130}+1$ is negligible, and we get $-n^2/(2k)\approx-\left(2^{130}\right)^2/\left(2\cdot2^{256}\right)=-2^{2\cdot130-1-256}=-2^3=-8\;$, giving $q_n\approx e^{-8}\approx0.034\%$ and $p_n\approx99.966\%\;$.

Notes: $n\approx2^{130}$ is so large that it is practically impossible to carry that number of operations, much less hashes. The question's body, and the present answer, thus does not contain anything that could help practically towards finding collisions for a hash function with a 256-bit output size (as in the question's original title).
The approximations 99.8% and 99.99% in the question are incorrect in their last digit, whatever the rounding convention.

fgrieu
  • 149,326
  • 13
  • 324
  • 622