18

If you hash a string using SHA-256 on your computer, and I hash the same string using SHA-256 on my computer, will we generate the same value? Does the algorithm depend on a seed (so we'd both need the same seed) or some other such parameter?

edit: To clarify, by 'string' I meant 'the same byte input', which as the comments and @ispiro's answer point out, may be different for the same character string depending on the encoding.

conor
  • 299
  • 1
  • 2
  • 7

2 Answers2

33

Yes, if you hash the same input with the same function, you will always get the same result.

This follows from the fact that it is a hash-function. By definition a function is a relation between a set of inputs and a set of permissible outputs with the property that each input is related to exactly one output.

In practice there is no seed involved in evaluating a hash-function.

Now, this is how things work in practice. On the theoretical side of things, we often talk about families of hash-functions. In that case there does exist a key that selects which member of the family we are using. The reason for this is a technical problem with the definition of collision resistance.

The naive definition of collision resistance for a single hash function $H : \{0,1\}^* \to \{0,1\}^n$ would be that for all efficient algorithms $\mathcal{A}$ the following probability is negligible $$\Pr[(x_1,x_2)\gets\mathcal{A}(1^n): H(x_1)=H(x_2)]$$

The problem with that is, that it is impossible to achieve. Given that $H$ is compressing, collisions necessarily exist. So an algorithm $\mathcal{A}$ that simply has one of those collision hardcoded and outputs it, has $$\Pr[(x_1,x_2)\gets\mathcal{A}(1^n): H(x_1)=H(x_2)] = 1.$$ So the definition is not achievable, since this $\mathcal{A}$ by definition exists even though nobody might know what it is.

To solve this problem, we define collision resistance for a family of hash-functions $\{H_k : \{0,1\}^* \to \{0,1\}^n\}_k$. We then define that such a family is collision resistant if it holds that the following probability is negligible $$\Pr_{k\gets\{0,1\}^n}[(x_1,x_2)\gets\mathcal{A}(k): H_k(x_1)=H_k(x_2)].$$

Here we do not run into the same problem, because the exact function $\mathcal{A}$ needs to find a collision for is chosen uniformly at random from an exponentially large family. Since $\mathcal{A}$ could have hardcoded collisions for at most a polynomial number of functions in the family, such hash-function families are not trivially impossible.

Note that this means that there somewhat of a disconnect between the theoretical treatment of hash-functions and their practical use.

Maeher
  • 7,185
  • 1
  • 36
  • 46
11

Strings aren't byte arrays.

The accepted answer deals with the question of whether SHA256 includes a seed. (Though the proof from the word "function" is arguable, since we call password-to-key functions "functions" though they can include "salt and pepper".) But strings still need to be encoded into bytes to be hashed.

Expounding on Drunix's comment, a quick search has revealed that it's quite likely that identical strings return different hash values owing to the strings being encoded in different encodings.

Here's a highly upvoted answer on StackOverflow suggesting using either UTF8 or UTF16 ("unicode" in the answer), which would nominally return different bytes and therefore different hashes.

And here's an answer using ASCII. Which "uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character." (MSDN) Again, returning a different hash than a UTF8 encoded string.

Additionally, take the following answer on StackOverflow. It mentions how macOS (Apple's Mac operating system) stores file names in a specific (unexpected?) way so that certain strings will "change" (at least in their byte representation).

And, of course, if your string comes from a text file, it will depend on the file's encoding. Notepad defaults (at least on my computer) to ANSI.

ispiro
  • 2,085
  • 2
  • 18
  • 29