How much can I assume hashes to be different for different contents?

Question

I know that some hashes, like MD5 or SHA-1, that were previously thought to be safe are now known to be vulnerable to collision attacks. But it is obvious that collisions exist for all hashes, given that the space of possible hashes is smaller than the space of possible contents. For example, if one considers all possible files whose size is smaller or equal to the hash size, there must be some collisions.

However, I wonder if I can be sure that hashes will be different for “small enough” differences in contents. For example, for a given hash, can I assume that:

All contents whose size is = the hash size will have different hashes (so that if $H(m_1) ≠ H(m_2)$ then $H(H(m_1)) ≠ H(H(m_2))$)?
All contents smaller that $m$ bits/bytes will have different hashes?
All contents that differ by less than $m$ bits/bytes will have different hashes?
All contents that differ by less than $m$ consecutive bits/bytes will have different values?
Inserting less that $m$ bits/bytes within a content will change its hash?
Inserting less that $m$ bits/bytes at the end/beginning of a content will change its hash?
Anything else?

If there are such assumptions that are true, do they survive the hash being truncated?

I guess answers to these questions are very dependent with the chosen hash functions. I’m very interested by answers about hashes of the SHA-2 and SHA-3 families, but answers about other hash functions (even MD5 and SHA-1) are welcome as well.

score 6 · Answer 1 · edited Apr 16 '17 at 03:07

All contents whose size is = the hash size will have different hashes (so that if hash(file1) ≠ hash(file2) then hash(hash(file1)) ≠ hash(hash(file2)))?

No, but finding such a value should be impossible for a secure hash.

All contents smaller that m bits/bytes will have different hashes?

That depends on the value of m. If m = 1 (bit or byte) then it will be true for any secure hash. If m is very large we get back into the situation that there must be identical hashes because of the pigeonhole principle.

All contents that differ by less than m bits/bytes will have different hashes?

No, because of the pigeonhole principle again. No, but finding a pair of messages that collide should be impossible for a secure hash.

All contents that differ by less than m consecutive bits/bytes will have different values?

See above.

Inserting less than m bits/bytes within a content will change its hash?

See above.

Inserting less than m bits/bytes at the end/beginning of a content will change its hash?

See above.

Anything else?

Basically it all comes down on the basic properties of secure hash values.

If there are such assumptions that are true, do they survive the hash being truncated?

In general truncating a secure hash of course limits the security, but it should only harm security by 1 bit for each 2 bits removed (for collision attacks - possibly more for other attacks, but those would have a higher security in to deal with in the first place).

I guess answers to these questions are very dependent with the chosen hash functions. I’m very interested by answers about hashes of the SHA2 and SHA3 families, but answers about other hash functions (even MD5 and SHA1) are welcome as well.

The answers above are for generic secure hash functions. MD5 / SHA-1 are obviously not considered secure anymore.

Detailing each and every security property of each and every secure hash and testing if it is vulnerable to attacks is way too broad for any answer.

fgrieu · Answer 2 · 2017-04-22T13:37:41.533

Additions to Maarten Bodewes's answer:

It is possible to construct a hash (collision-resistant, preimage-resistant, and behaving mostly like a random function) with the property that

All contents whose size is = the hash size will have different hashes (so that if $H(m_1)\ne H(m_2)$ then $H(H(m_1))\ne H(H(m_2))$ )

One method is to start from a normal hash of $b$ bits, and special-case what happens when the message $m$ is exactly $b$ bits, where the hash is defined to be $P(m)$ with $P$ a fixed one-way permutation of $b$ bits. For large enough $b$, we can construct $P$ based on the discrete logarithm problem, and cycling.

Example with $b=2048$: let $p$ be the smallest prime at least ${\pi\over3}2^b$ with $q=(p-1)/2$ prime, and $g$ the smallest integer at least ${\sqrt5-1\over2}p$ with $g^q\not\equiv1\pmod p$; that is $p=\left\lceil{\pi\over3}2^b\right\rceil+3115515$ and $g=\left\lceil{\sqrt5-1\over2}p\right\rceil$

If the message $m$ is exactly $b$-bit
1. convert $m$ to integer using big-endian convention, giving $x$;
2. let $x\gets(g^{x+1}-1)\bmod p$ and repeat until $x<2^b$;
3. convert $x$ to $b$-bit bitstring using big-endian convention, giving $H(m)$.
Otherwise (message shorter or larger than $b$-bit), let $h\gets\operatorname{SHA-512}(m)$ and let $H(m)$ be $\operatorname{SHA-512}(h\|'0')\|\operatorname{SHA-512}(h\|'1')\|\operatorname{SHA-512}(h\|'2')\|\operatorname{SHA-512}(h\|'3')$

Given how $p$ and $g$ are chosen, $x\to g^x\bmod p$ is a permutation of the set $\{1,2,\dots,p-1\}$; it follows that step 2. implements a permutation of the set $\{0,1,\dots,2^b-1\}$; it follows that no two $b$-bit messages collide. Without proof: the best methods we have to find a collision or preimage involve breaking $\operatorname{SHA-512}$ or solving a hard discrete logarithm problem.

Other one-way permutations allowing to reduce $b$ are discussed there.

From this, it is easy to construct a hash of $b$ bits so that all messages strictly less than $b$ bits will have distinct hashes; simply right-pad a message $m$ with a single 1 bit, then if the result is less than $b$-bit pad it with enough 0 bits to reach $b$ bits; then finally apply the hash defined above.

Guut Boy · Answer 3 · 2017-04-18T19:16:04.993

TLDR: All but the second property cannot be assumed of general cryptographic hash functions. The first property could possibly hold for specific hash functions, but cannot generally be assumed. ~~The remainder are impossible for any hash function (given that the space of contents is larger than the space of possible hash values).~~

Below I explain in more detail.

All contents whose size is equal the hash size will have different hashes (so that if $H(m_1)≠H(m_2)$ then $H(H(m_1))≠H(H(m_2))$ )?

It may be possible to specifically design a hash function to have this property, but I do not know of any commonly used function with this property.

However, generally you cannot give this guarantee. A hash function could easily be secure while having a collision between two messages with size equal to the hash function output. By collision resistance of a cryptographic hash function it would be hard to find such a collision though.

In fact it is very likely that there is such a collision. To see this consider that there are as many contents of this size as there are possible hash values. Thus given the set of contents of this size and just one additional content we are certain to have a collision within this set.

All contents smaller than $m$ bits/bytes will have different hashes?

For "small enough" values of $m$ this property is actually required for a hash function to be collision resistant. This is because for sufficiently small $m$ we could simply bruteforce our way to a collision for any function that does not have this property.

All contents that differ by less than $m$ bits/bytes will have different hashes?

EDIT: as pointed out in the comments the following argument does not hold (hence the strike-through).

A hash function cannot have this property. To see this consider that all contents differ from some other contents by $m$ bits/bytes or less. In fact, we can go from any content $c$ to any other content $c'$ by $m$ bits/bytes increments. Thus this property actually implies a hash function without collisions, which is generally not possible.

All contents that differ by less than $m$ consecutive bits/bytes will have different values?

Inserting less that $m$ bits/bytes within a content will change its hash?

Inserting less that $m$ bits/bytes at the end/beginning of a content will change its hash?

~~For these questions the same argument as above holds. I.e., these properties all imply a hash function without collisions.~~

score 1 · Answer 4 · answered Apr 15 '17 at 16:19

No, you can't assume that hashes of messages with the same size as the hash are different when the message is different. Some hash algorithm like MD5, SHA-1 and SHA-256 work on blocks of the same length as the resulting hash. They also pad the message to make it unique for every message input and to try to ward of length extension attacks. This means that for message which are of the same length as the resulting hash, 2 blocks will be digested: the message itself and padding. Hashing function also are in all / most cases not bijective.
For SHA3 (Keccak) this is slightly more complex because the algorithm works with a sponge construction, while the other ones mentioned all work with a Merkle–Damgård construction. However, SHA-3 was also not designed to give anyone any possible information about the hash in regards to the input message. Any special property could be exploited in attacks. Because of this cryptologists often want a hash algorithm to behave like a random oracle which every finalist of the SHA-3 competition (so also Keccak, the new SHA-3 algorithm) was evaluated under. 1, 2

The same should be true for smaller messages. There may be algorithm which have this property, but most common ones like MD5, SHA1, SHA256 were not designed with this in mind. You could brute force every small message to see if there are duplicates, but as long as the algorith is still secure (MD5 and SHA1 are NOT) with overwhelming probability you won't find any.

Changing parts of the message should always (with overwhelming probability) result in a different hash. Some algorithms, especially older ones like MD5 are susceptible to length extention attacks. 3 This does not mean that you can easily create a message which has the same hash as another one, but still has some security problems depending of your protocol. Note that the entries for the SHA-3 competition where required to have defenses against length extension attacks and Keccak is not susceptible to them as far as we know.

Standard disclaimer: Please note that MD5 and SHA1 are broken. Don't use them for anything anymore if you are not really, really sure that you know what you do. All this statements only apply for still secure hash algorithms. MD5 and SHA-1 are not secure.

How much can I assume hashes to be different for different contents?

4 Answers4