3

For quite some time I've been thinking about the idea to construct a hashing algorithm that contains its own checksum value, and thereby can verify itself. With hashing algorithms like SHA1 and MD5 this seems to be difficult although not impossible as explained here. The content that is hashed could never change, but I can think of a few situations where this is absolutely desirable. For example, certificates containing their own thumbprint calculated over all fields. I have tried to design the basics myself once, but that was more ambitious than I had foreseen.

The way I see it there are two approaches:

  • Narrow down the possible hash values by analyzing the content. Then race for all possibilities to see if the content and the hash match. I have implemented this and although it did work it was everything but usable.
  • Calculate a hash and adjust the content to match the hash value. For hashing algorithms like MD5 this is near to impossible and any new algorithm would possibly impaired by this.

I'm convinced it is both possible and usable someway, therefore it surprises me how few there is to find on the subject. Are there any case studies or related algorithms to this idea?

Yorick de Wid
  • 165
  • 1
  • 8

2 Answers2

4

I think what you're missing here is that a cryptographic hash by itself is not actually sufficient to verify the integrity of a message. Consider this:

I want to send a message over the Internet (on an insecure connection, e.g. UDP), but have it be protected from tampering. I take the message and attach at the end a cryptographic hash of the message (e.g. SHA256(message)). When my friend receives the message, she verifies that the hash at the end matches SHA256(message).

Now, despite the fact that she has "verified" the message using a cryptographic hash, it turns out that this is not sufficient to prevent tampering. If an attacker is able to intercept and modify the message before it reaches my friend, they can remove the old hash, SHA256(message), and add a new hash, SHA256(tamperedMessage). When my friend receives the tampered message, she won't notice anything is wrong, because the message she receives still ends with a matching SHA256 hash.

Hashes do not (by themselves) provide integrity protection (in other words, prevent tampering). Hashes provide a convenient way of identifying the content of a message.

We can use hashes to build schemes that do provide integrity protection. Consider the earlier example. Let's say I again transmit a message to my friend over the Internet, this time without a hash. She receives the message, but knows that she can not yet trust that it has not been tampered with. If I meet up with her in person, I can provide her with the hash, say, diligently scrawled on a sheet of paper. If she checks this hash against the message she received over the Internet, she will know that it has not been tampered with. Why? Because (1) the hash uniquely identifies the original, legitimate message and (2) the hash was provided over a secure channel (i.e. in person).

This is why having a hash that "verifies itself" does not prevent tampering. You could attach to a hash, a hash of that hash, then attach to that hash, a hash of that hash, but your scheme would still not provide any integrity protection unless the hashes are provided over a secure channel. Hashes identify, and further identifying the identifier does not actually improve security.

Note (in anticipation of nitpicks): I used the term "uniquely identifies". Hashes aren't actually unique from a theoretical perspective, but they are in practice.

Tim McLean
  • 2,914
  • 1
  • 16
  • 26
2

Maybe you can create a hash function like that, but it will have a major security weakness, because in order to achieve your goal, you need some correlated manipulations in the input and output. And that can be exploited by an attacker for cryptanalysis.

  • Assume you have a given input and know its according hash value.
  • If you flip a single bit in the input, it should flip the hash value at every position with probability $0.5$, because otherwise you have a weakness for differential cryptanalysis. Formally, this can be seen as an linear correlation of some part of the algorithm.

That last part is what you actually need for your self containing hash function but at the same time that also gives the attacker an advantage for finding collisions or even preimages, calculate the internal algorithm of the hash backwards, etc.

The fact that something exists, does not mean there is a more efficient way than just testing all candidates.

tylo
  • 12,864
  • 26
  • 40