13

There are more inputs to the SHA-256 function than outputs, so it must be a many-to-one function by the Pigeonhole Principle. However, that doesn't automatically imply that it send more than one input to each of its outputs. For instance, Squaring is many-to-one, but maps only 0 to 0.

Question 1: Is it the case that for each 256-bit sequence $H$ to which SHA-256 sends at least one input, SHA-256 sends more than one input to $H$? If so, have mathematicians shown this thoroughly?

Question 2: What is the biggest number $n$ such that we currently know that for each 256-bit sequence $H$ to which SHA-256 sends at least one input, SHA-256 maps at least $n$ inputs to $H$?

Bonus question: Is SHA-256 surjective (onto), that is, does it send to each 256-bit sequence at least one input?

Remark: My practical motivation for asking these questions is to know whether it's really secure to upload SHA-256 hashes of private files, e.g. security camera footage, to a blockchain à la OriginStamp.

Tristan Laguz
  • 145
  • 1
  • 9

3 Answers3

18

We do not currently know by value any two distinct inputs of SHA-256 that map to the same output, and we have no practical method to find any such two values. So there is no imperious reason to fear that one exhibits a file with the same hash as a private file and different content, even with assistance from whoever makes the private file.

On the other hand

is it really secure to upload SHA-256 hashes of private files ?

Not necessarily so: that allows to confirm a guess of the private file. So if you post the SHA-256 of a credit card number (even with expiration date appended), that allows determined adversaries to find that info with near certainty.

Addition 1: If we publish the hash of something private, we can still be safe against brute force (guess-and-verify-by-hash) when there are so many plausible values of that private thing that computing their hashes until having hashed the right value is infeasible. One way to be protected is to have at least say 128 bits (16 bytes) uniformly random and kept secret in that something private. More precisely, we want the (Rényi or min) entropy of that something private to be high enough. As rightly noted in comment, even large private things can be vulnerable: knowing the hash of a file allows to determine if it belongs to a collection of files, and in the affirmative which file it is. In the case of security camera footage, we are safe if the hashed footage is not available in any form to the attacker; but if it's an extract of a larger footage known to the attacker, there's chance that the attacker can determine the start and finish, and the extract.

Addition 2: Publishing the hash of something can be a security issue for a different reason: when that something is intended to be processed with that hash, or a variant thereof. For example, publishing the SHA-256 of an HMAC-SHA-256 key of more than 64 bytes does not allow finding that key, but reveals a functionally equivalent 32-byte key (one producing the same results when used as key in HMAC-SHA-256). Same for publishing the SHA-512 of an Ed25519 private key, or even one half of that.


Question 1: Does SHA-256 send more than one input to each of its outputs (the bit sequences to which it maps at least one input)?

That's most probably true. An heuristic argument can be made similarly to the bonus question in the last section of this answer.

If so, have mathematicians shown this thoroughly?

No, that's not mathematically proven.


Question 2: What is the biggest number $n$ such that we currently know that SHA-256 maps to each of its outputs at least $n$ inputs?

I think $n=1$ for "know" in the sense mathematicians use, and "each of its outputs" defined as in Question 1. Argument: I don't see how to prove that there does not exist a message giving a hash that is not reached by any other message.

But heuristically, it's expected $n$ is extremely large. I'm betting the house on $n\gg2^{(2^{63})}$, with heuristic arguments as in the next section of this answer.


Bonus question: does SHA-256 send to each 256-bit sequence at least one input?

That's very likely, but not mathematically proven.

A first-level heuristic argument assimilates SHA-256 to a random function, and then the coupon collector problem shows that in any set of more than $2^{263.48}$ messages (chosen without knowledge of the function) it's likely that all values are reached, with probability of the contrary dropping fast with the exponent in the number of messages. And there are $2^{(2^{64})}-1$ messages for SHA-256, with $2^{64}$ immensely more than $263.48$.

A finer model of SHA-256 must take into account it's MD structure, that is an iterated compression function aiming at being a random function, and the length padding. With this, it's arguable that for 119-byte messages (that is the maximum byte size in SHA-256 messages for two compressions), the number of inputs in the first compression is so way above the coupon collector bound ($2^{512}$ vs $2^{263.48}$) that most certainly almost all the $2^{256}$ values are reached by the first compression. And then we have even much more inputs in the second round (nearly $2^{256+512-9\times 8}=2^{696}$) making reaching all outputs even more most certain. And even if there was some curse for 64-byte blocks ending in 80 00 00 00 00 00 00 03 B8 (corresponding to 119-byte), we have plenty of other message sizes to find a non-cursed one.

fgrieu
  • 149,326
  • 13
  • 324
  • 622
8

Question 1: Does SHA-256 send more than one input to each of its outputs (the bit sequences to which it maps at least one input)? If so, have mathematicians shown this thoroughly?

No they haven't, but if the output would not be well distributed then it would negatively affect collision resistance. It is almost certain that all outputs will have an almost endless number of input messages that map to it. It is almost endless because there is an actual limit to the input space for SHA-2, due to the padding containing the size of the input message in bits. But the design of the hash function makes this also hard to prove.

Question 2: What is the biggest number n such that we currently know that SHA-256 maps to each of its outputs (the bit sequences to which it sends at least one input) at least n inputs?

It would be $n=1$ for any output that has been observed as otherwise the collision resistance and thus the algorithm would be broken. It is not possible to know if all outputs can be generated, and for the same reason we don't know if the outputs can be generated more than once.

What we do know that all the inputs will generate an output, which means that given unlimited runtime at least some output values have been repeated for some messages due to the pigeonhole principle.

Bonus question: Is SHA-256 surjective (onto), that is, does it send to each 256-bit sequence at least one input?

That's not certain either and obviously it cannot be brute-forced. But as indicated, it is certainly designed to have a good distribution, so you would expect this to be the case. It doesn't matter much for your use case though; you just want collision resistance.

Remark: My practical motivation for asking these questions is to know whether it's really secure to upload SHA-256 hashes of private files, e.g. security camera footage, to a blockchain à la OriginStamp.

Sure. If it wasn't it would not be feasible to use it for signatures either. The same collision resistance is in effect.

I'd be more worried about e.g. sending 1 < ms clips or something like that, because that could possibly be brute forced, and SHA-256 obviously does not protect you against collisions of the input message itself; those can be easily detected. Prefixing a random salt stored with the hash can possibly protect yourself against the latter, but of course not the former brute force attack as the salt will be known to the attacker as well.

And for the ultra-nervous there is always SHA-512 which may be surprisingly speedy on many systems as it uses 64 bit calculations rather than 32 bit. Then again, it is usually not hardware accelerated. As poncho indicates it might also be a good idea to use, for instance, SHA-3-512 or one of the faster blake hashes.

Maarten Bodewes
  • 96,351
  • 14
  • 169
  • 323
3

My practical motivation for asking these questions is to know whether it's really secure to upload SHA-256 hashes of private files, e.g. security camera footage, to a blockchain à la OriginStamp.

Yes, as long as the input file contains enough entropy, this is quite secure (from a privacy/confidentiality perspective), even if you were to distrust any security/randomness properties of SHA-256.

Even if we cannot fully disprove that there might be some outputs of SHA-256 that don't have many preimages, it is actually easy to see that most inputs will map to an output that does have a lot of them: For example, there are at most $2^{256+128}$ different inputs to SHA-256 that map to an output with at most $2^{128}$ different preimages - that's because there are only at most $2^{256}$ such outputs to begin with, and obviously only $2^{128}$ different inputs can map to any given one by definition. In particular, this implies that if your input has at least, say, 512 bits of entropy (as any video file will trivially satisfy) then the probability of being hashed to a hash value with less than $2^{128}$ preimages is less than $2^{-128}$.

This argument holds independently of any security or randomness properties of the hash function itself. The only technicality is that this doesn't mean that SHA256 could not still theoretically leak some useful information about your input - for example, if you were to replace it with a function that simply outputs the first 256 bits of the input that would clearly still give away some information, even if all outputs have infinitely many preimages. In practice, it is still very very unlikely that SHA256 hashes would contain any theoretical structure that is leaking anything "interesting" about the input, even if SHA256 were to be "fully broken" in a cryptographic sense tomorrow.

ManfP
  • 205
  • 3
  • 8