What is the entropy of a Mersenne Twister (MT)?

Question

From Serious Cryptography: "Entropy is maximized when the distribution is uniform because a uniform distribution maximizes uncertainty: no outcome is more likely than the others. Therefore, n-bit values can’t have more than n bits of entropy."

The MT strives to have as uniform a distribution as possible, so one would think the entropy is maximized and equal to the bit length of the seed.

But the MT becomes entirely predictable after 624 iterations, at which point its entropy should be considered to be zero, but the subsequent distribution is still uniform.

These two ideas seem to contradict each other.

Squeamish Ossifrage · Answer 1 · 2019-02-19T20:18:40.203

But the MT becomes entirely predictable after 624 iterations, at which point its entropy should be considered to be zero, but the subsequent distribution is still uniform.

The subsequent distribution is not, in fact, independently uniform.

Given 624 32-bit outputs $$V = V_1 \mathbin\| V_2 \mathbin\| \cdots \mathbin\| V_{624},$$ I can deterministically recover the Mersenne twister state by $$S(V) = \tau^{-1}(V_1) \mathbin\| \tau^{-1}(V_2) \mathbin\| \cdots \mathbin\| \tau^{-1}(V_{624}),$$ where $\tau$ is the tempering transform, and then compute the next output by deterministically running the generator $G$ on the recovered state $S(V)$. If the $V_i$ were independently uniformly distributed, we would have $$\Pr[V_{625} = G(S(V))] = 1/2^{32},$$ but if the $V_i$ are instead generated by the Mersenne twister with uniform random initial state, we have $$\Pr[V_{625} = G(S(V))] = 1.$$

This test—evaluate whether $V_{625} = G(S(V))$—will distinguish the Mersenne twister under a uniform random initial state from a uniform random string with high success rate.

Does this nonuniformity matter for the Monte Carlo physics simulation you're doing? Perhaps not: your physics model is unlikely to, by happenstance, interact with the Mersenne twister output generation function $G$ in a way that would be a problem, because physics itself is usually not adversarially trying to screw you up personally, no matter how much the years of pain in grad school might make it seem.

On the other hand, if there is an intelligent adversary, as we assume in cryptography, that intelligent adversary will exploit this relation to decipher your messages, steal your money by forging signatures, etc.

In other words, the Mersenne twister is designed to appear uniform to stupid algorithms that don't even know it's the Mersenne twister they're trying to distinguish, but falls flat on its face when it trips on an adversary who knows it's the Mersenne twister but just doesn't know the initial state.

score 1 · Answer 2 · answered Jan 20 '19 at 18:33

But the MT becomes entirely predictable after 624 iterations, at which point its entropy should be considered to be zero, but the subsequent distribution is still uniform.

You are mixing up two different concepts of "entropy" here.

The first is the "true" (information-theoretic) entropy of a random variable $X\colon\;\Omega\to A$, defined as $$ H(X) = -\sum_{a\in A}\;p_a\cdot\log_2 p_a \text, $$ where $p_a=\Pr[X=a]$ and $\log_2 p_a$ is considered to be $0$ in case $p_a=0$.

In this case, you would e.g. consider $A$ to be the set of all (infinite) sequences of numbers in the output range $[0,\dots,2^{32}-1]$ of the Mersenne Twister, and $X$ should be the random variable that outputs all MT values starting from some randomized internal state (i.e. $624$ uniformly random $32$-bit words). Then, indeed, there are exactly $2^{32\cdot 624}$ distinct outcomes of $X$, and they are all equally likely, hence the entropy of $X$ is $32\cdot624=19968$ bits.
On the other hand, there is something one may call "statistical entropy" or "empirical entropy": This essentially consists of "guessing" a random variable $X$ that may have been used to generate a concrete observed outcome and then computing the entropy of that variable.

Concretely, given just a MT output sequence of length $n$, one may (wrongly!) guess that it is the output of $n$ independent identically distributed random variables $X_1,\dots,X_n$, each giving $32$-bit words $i$ distributed according to the observed relative frequencies $p_i$, and then the entropy of the random variable $(X_1,\dots,X_n)$ would be $$n\cdot\sum_{i=0}^{2^{32}-1}-p_i\log_2 p_i$$ bits. This is clearly different from the information-theoretic entropy (above), but there is no contradiction — this is simply a guess for the true entropy based on incomplete information. There are other sets of assumptions (e.g. different word sizes; uniform distribution instead of estimating based on frequencies; bases different from powers of two; dependencies between the words in the sequence; ...) that typically lead to different estimates for the entropy.

In summary, the information-theoretic entropy of any MT instance (without reseeding) is upper bounded by $19968$ bits, but without knowing that it comes from a MT the generated sequence may look like a uniformly random sequence to statistical tests.

^{Note that this is essentially the same problem as determining the "entropy" of a password: Unless you are generating a password yourself using a known randomized process, there simply is no way to tell whether a password is "random enough" or not. Just by looking at the characters, e.g. an encoded version of $\mathrm{SHA256}(w)$ for a random dictionary word $w$ may look very "random", but once you know the process used to generate it, the entropy is clearly upper bounded by the $\log_2$ of the number of words in the dictionary.}

score -3 · Answer 3 · answered Jan 20 '19 at 12:58

Your question encompasses both the cryptographic and information theory definitions of entropy. In cryptography, one is generally concerned with unknown entropy.

Entropy is maximized when the distribution is uniform...

This is true and holds whether the sequence is known or unknown. In the standard $ H(X)=-\sum_{x \in X} P(x) \log_2{P(x)} $ formula, the summation is maximised when all $ P(x) $ are equal. It drops as the distribution moves away from uniform to other types (Poisson, Gaussian, etc.) You can envisage it as related to the area of the sequence's histogram, as long as the bins remain the same.

But the MT becomes entirely predictable after 624 iterations, at which point its entropy should be considered to be zero, but the subsequent distribution is still uniform.

Mostly correct. Assume that the input to the MT is unknown (unknown entropy). MT therefore will start generating unknown entropy. The ratio $ \frac{H_{unknown}}{H_{known}} $ will then linearly decrease from high to 0 as the output lengthens to 624 words. After 624 words, all future MT output will be entirely predictable and thus known. Yet according to information theory the known entropy will remain at 8 bits/byte as the distribution is uniform. Storage of said entropy would still require 8 bits/byte. Unknown entropy useful for cryptography will of course then be zero.

Nobody at all does, but if you did use MT for cryptographic randomness, I suggest that one should reseed after 616 outputs, guaranteeing a 256 bit security level. There are other MT warm up nusances that make this a sub-optimal idea, but it illustrates the principle of entropy transforming from unknown to known.

What is the entropy of a Mersenne Twister (MT)?

3 Answers3

Linked