7

If we start with a set of possible input values, and apply the md5 algorithm to all elements of this set and then filter out the unique results (a.k.a. filter out collision), we are left with a smaller set. Consider this small piece of pseudo code:

Set<String> inputs = ALL_UNIQUE_INPUTS;
while (inputs.length > 1) {
    Set<String> newInputs;
    foreach (String input : inputs) {
        newInputs.add(md5(input));
    }
    inputs = newInputs;
}

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

Furthermore, can we somehow determine how many iterations this would take for a given input space?

Note: I am aware that you should not attempt to increase the computational complexity of password hashing in this manner, but use salts, reapply those and other neat tricks to make the process more computationally complex whilst not increasing the chance of collisions.

CodesInChaos
  • 25,121
  • 2
  • 90
  • 129

4 Answers4

8

After about $2^{n/2}=2^{64}$ iterations an input will enter a cycle (of length approximately $2^{n/2}=2^{64}$). If inputs didn't collide by the time they enter a cycle, they never will.

If you have more than about $2^{n/4}=2^{32}$ inputs, you'll get collisions after $2^{n/2}=2^{64}$ iterations, as per birthday-problem. But of course you won't have the patience to wait that long.

CodesInChaos
  • 25,121
  • 2
  • 90
  • 129
3

This question was adressed here quite a few times, just differently phrased, for example in Cycles in SHA256.

Another awnser here had the following statement:

Hashes have a fixed size output. After one round, all your new inputs will be equally sized, being the size of the hash output. An ideal hash function operating on such inputs will be bijective and you will constantly just be rearranging your inputs with none of them ever colliding.

Ideal hash function in this sense is quite similar to the definition of a perfect hash function. However, this just does not fit with cryptographic hash functions, where the "ideal" version is a random oracle or a truly random function (with the specified domain), where collisions can happen.

Exactly this question was already adressed here:

But to come back to the question: Cryptographic hashes are designed to be as close to random functions as possible. In an answer to the first linked question, fgrieu drew a really nice visualization here.

A few of the key points of what to expect:

  • The graph is probably disconnected
  • The graph contains cycles of different leangth
  • It might contain fixed points (cycles of length 1)
  • It also contains nodes, which lead to a cycle but are not part of it.

So to answer the initial questions:

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

No, it can decreate the size of the set. But with a cryptographic hash it is unlikely. The set size only decreases in case of a collision, which is really unlikely (and surely not "most iterations").

Considering the second part of the question: That could happen, but it is unlikely. If we remember the graph of the random function, then the original set have to be

  • in the same connected subgraph.
  • Have a fixed point instead of a cycle; or alternatively there have to be collisions, so that there is only one element in the cycle.

Furthermore, can we somehow determine how many iterations this would take for a given input space?

Well, for practical purposes with MD5: Much, much too long. The cycles can have any length within the graph, and you have to save all previous steps to actually notice that you are in a cycle. With a graph of $2^{128}$ nodes, you would have to estimate the number of nodes in cycles, and then estimate how many values you need to store to be able to determine that you are in a cycle. It is quite likely, you need close to $2^{128}$ steps anyway.

tylo
  • 12,864
  • 26
  • 40
1

Your set of inputs will not necessarily reach a single input, and for any well designed cryptographic hash function, it won't.

Hashes have a fixed size output. After one round, all your new inputs will be equally sized, being the size of the hash output. An ideal hash function operating on such inputs will be bijective and you will constantly just be rearranging your inputs with none of them ever colliding.

Evaluating the difference between an ideal hash function and the actual md5 algorithm in this scenario is many PhD's worth of research.

1

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

No, not for input that has no pre-computed collisions.

Finding collisions is supposed to be a hard problem for secure hash functions. For SHA-1 none have been published yet at the time of writing, and that's a hash function that is considered pretty weak.

Furthermore, can we somehow determine how many iterations this would take for a given input space?

The chance of an accidental collision over a short input space is very low. The chance of finding a single one is high for a set of $2^{64}$ and probably still significant for $2^{32}$ elements.

In other words, you'll wait forever. Again, for input that has not pre-computed collisions.

Note: I am aware that you should not attempt to increase the computational complexity of password hashing in this manner, but use salts, reapply those and other neat tricks to make the process more computationally complex whilst not increasing the chance of collisions.

I don't know about the other neat tricks, but for MD5 it's easy to find collisions because MD5 is broken. Anybody could create a set of distinct inputs in such a way that the while loop would immediately terminate (once it gets past that, it will probably never end).

Better use a secure hash such as one of the SHA-2 or SHA-3 variants.

Maarten Bodewes
  • 96,351
  • 14
  • 169
  • 323