15

Suppose I have a 128-bit random binary string (128 bits of entropy), then I hash it using SHA-256, then I take the first 128 bits of the output hash. Does the taken bit string still have (almost) 128 bits of entropy or the entropy is reduced to 64 bits? (I mean another 64 bits of entropy probably lies in the second 128 bits of the output hash).

I am confused because in the definitions of cryptographic hash functions that I have read in the past it was said that if one bit of input is changed then every bit of the output changes with a probability of $1/2$. It seems from this we can deduce that our truncated output still has (almost*) the same entropy as the input. Is that right?

*: I added 'almost' to mean ignoring the possible hash collisions.

Biv
  • 10,088
  • 2
  • 42
  • 68
user40602
  • 517
  • 4
  • 12

2 Answers2

20

Expected entropy in the output of a random oracle

The expected entropy in the output of a $h$-bit random oracle fed with random $h$-bit input is close to $h-0.8272$ bit, for even moderate $h$ (e.g. at least $32$). As $h$ grows, that expected entropy becomes arbitrary close to $h-\eta$ bit with $$\begin{align}\eta&=\frac 1{e\ln(2)}\sum_{i=1}^\infty\frac{\ln(i+1)}{i!}\\ &=0.82724538915300508343173\dots\text{bit}\end{align}$$ where the sum is given by A193424.

Proof, where I'll be using $a\approx b$ as a convenient shorthand for $\displaystyle\lim_{h\to\infty}\frac a b\ =\ 1$

  • For a particular distribution implemented by the oracle, let $n_j$ be the number of output values appearing exactly $j$ times among the outputs for all inputs. The exact entropy $H$ for that particular distribution can be computed from the $n_j$ by applying the definition of entropy, giving $$\begin{align}H&=\sum_{j=1}^{2^h}n_j\;\frac j {2^h}\;\log_2\left(\frac{2^h}j\right)\\ &=\frac h{2^h}\sum_{j=1}^{2^h}j\;n_j\;-\frac 1{2^h}\sum_{j=1}^{2^h}n_j\;j\;\log_2(j)\end{align}$$ where we have (by merely counting what all inputs lead to) $$\sum_{j=1}^{2^h}j\;n_j\;=2^h$$ thus $$h-H=\frac 1{2^h}\sum_{j=1}^{2^h}n_j\;j\;\log_2(j)$$
  • For fixed $j$ and as $h$ grows, by counting of the possibilities, we can establish that for random distribution, odds that any particular value is reached $j$ times is $\displaystyle\approx\frac 1{e\;j!}$. Thus for fixed $j$ and as $h$ grows, the expected $n_j$ is $\displaystyle\approx\frac{2^h}{e\;j!}$.
  • In the exact expression of $h-H$, all the terms in the sum are non-negative. To obtain an asymptotic of the expected $h-H$ when $h$ grows, we can thus replace $n_j$ by its expected value, and obtain that when $h$ grows the expected value of $h-H$ is $\displaystyle\approx\frac 1 e\sum_{j=1}^\infty\frac{j\;\log_2(j)}{j!}$.
  • The stated result follows by defining $i=j-1$, and removing the sum's first term, which is zero.

I've been unable to locate an earlier mathematical derivation. The closest I found is an empirical estimation of $\eta$ to 4 decimals by Andrea Röck: Collision Attacks based on the Entropy Loss caused by Random Functions, WEWoRC 2007, slides; with more in her thesis.

Update: In a 2020 presentation (page 41), William R. Cordwell and Mark D. Torgerson of Sandia National Labs give $0.827245$, without attribution.

My first empirical derivation was using a program which draws $2^h$ pseudo-random $h$-bit values and counts how many values are reached how many times; for $h=35$ (the largest I could do with 20GB RAM), three runs gave:

             run 1                run 2                run 3
  0  12640123427 36.79%   12640183855 36.79%   12640308584 36.79%
  1  12640408212 36.79%   12640365800 36.79%   12640104651 36.79%
  2   6320124091 18.39%    6320013534 18.39%    6320174710 18.39%
  3   2106681541  6.13%    2106762262  6.13%    2106726749  6.13%
  4    526645276  1.53%     526674914  1.53%     526679947  1.53%
  5    105334156  0.31%     105325000  0.31%     105330269  0.31%
  6     17561277  0.05%      17551924  0.05%      17556150  0.05%
  7      2507918  0.01%       2508727  0.01%       2505282  0.01%
  8       313971  0.00%        313943  0.00%        313406  0.00%
  9        34748  0.00%         34553  0.00%         34755  0.00%
 10         3424  0.00%          3542  0.00%          3546  0.00%
 11          291  0.00%           287  0.00%           292  0.00%
 12           31  0.00%            24  0.00%            24  0.00%
 13            4  0.00%             3  0.00%             2  0.00%
 14            1  0.00%             0  0.00%             1  0.00%
 15+           0  0.00%             0  0.00%             0  0.00%
entropy   34.172763 bit        34.172758 bit        34.172751 bit

Application to the question: the entropy for the output of SHA-256 truncated to its first $128$ bits when fed a random $128$-bit input is about $127.173$ bit, down from very close to $128$ bit before truncation (see final note). The truncation does not halve the entropy, because the halves are not independent. The right line of thought is that SHA-256 truncated to its first $128$ bits is a fine $128$-bit hash, and behaves like a random oracle.

Note: if we consider a random function from $\{0,1\}^{128}$ to $\{0,1\}^{256}$, most likely there is a small $k$ (most often $0$ or $1$, sometime $2$, rarely $3$ or more) such that $k$ outputs have exactly two corresponding inputs, $2^{128}-2k$ outputs have exactly one, and $2^{256}-2^{128}+2k$ outputs have none (odds that any output is reached three or more times are negligible).

Therefore, in this most likely case, the entropy on output of that function when fed a random $128$-bit input is $-k\;2^{-127}\log_2(2^{-127})-(2^{128}-2k)\;2^{-128}\log_2(2^{-128})$, that is $128-k\;2^{-127}$.

The best model we have for SHA-256 (not truncated) for $128$-bit input is a particular function chosen at random among functions from $\{0,1\}^{128}$ to $\{0,1\}^{256}$, thus we can conclude the entropy for the output of SHA-256 when fed a random $128$-bit input is likely exactly $128-k\;2^{-127}$ for $k\in\{0,1,2\}$, which is very nearly $128$ bit, down to about the 37th decimal places.

fgrieu
  • 149,326
  • 13
  • 324
  • 622
-1

Remember that a hash function does not create entropy (it can make it very hard to detect a lack of entropy though). Therefore, if the input has 128-bits of entropy, the output has 128-bits of entropy as a maximum. If the output size is 256 bits, then each output bit has $0.5$ bits of entropy. Taking $1/2$ the output bits will cut the entropy in half.

To be on the safe side (due to loss from collisions and such), it is safe to assume that the output is only $0.85 \times$input entropy.

To get a good understanding of what NIST requires associated with this topic (truncated output of functions on entropy) read NIST SP 800-90B.

Raoul722
  • 3,003
  • 3
  • 23
  • 42
George
  • 1