28

I understand that for example MD5 produces a 128 bit hash value from a given text of variable size. My question is if there is a hash-like algorithm that will produce a hash value where one can specify the length of the outcome? So one would specify that that given any input the hash value (output) should be say 1000 bits.

For example, I would like to produce a hash value of the same length as the input. One way that I had thought of doing this would be to just encrypt the input somehow, but this would probably be easy to break, since one would just decrypt.

Another way I had thought about would be to divide the input up into say 128 bit chunks and then do MD5 (or some other hash) on each part and then just create one long string with the hashes of all the strings. However, I can see that a change in the input in one byte only would change 128 bits of the output.

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
Thomas
  • 1,184
  • 5
  • 16
  • 33

6 Answers6

20

As D.W. notes, you can use the output of any conventional hash function to key a stream cipher (or a block cipher in a streaming mode like CTR), and then take the output of the cipher as your digest.

However, there has been a trend in modern hash function design to support arbitrary-length output directly, without the need for additional layers. For example, the cryptographic sponge construction has this feature built in: you absorb the input into the sponge and then squeeze as much output out of it as you want.

Out of the five SHA-3 competition finalists, two — Skein and Keccak — support arbitrary output lengths. Keccak does this by virtue of being a sponge hash; Skein instead internally uses a system very similar to D.W.'s CTR-mode construction, reusing its Threefish tweakable block cipher for both input compression and output generation.

Update: Since this answer was originally written, Keccak was selected as the SHA-3 competition winner and standardized as SHA-3 by the NIST. The arbitrary output length variant of Keccak was standardized as two "extendable-output functions" (XOFs) named SHAKE128 and SHAKE256 (where the numbers 128 and 256 denote the internal sponge capacity in bits).

Ilmari Karonen
  • 46,700
  • 5
  • 112
  • 189
18

Sure. If you want a $b$-bit hash of the message $m$, then use the first $b$ bits of AES-CTR(SHA256($m$)). That'll do the trick.

In other words, compute SHA256($m$) and treat the resulting 256-bit string as a 256-bit AES key. Next, use AES in counter mode (with this key) to generate an unending stream of pseudorandom bits. Take the first $b$ bits from this stream, and call it your hash. (Make sure you don't treat the IV as part of this stream, as the IV won't be pseudorandom. You may need to manually remove the IV first before taking the first $b$ bits.)

Security. This should be secure as long as $b \ge 160$ or so. In particular, a collision attack is expected to take about $2^{\min(b,256)/2}$ steps of computation, given our current knowledge of AES and SHA256. So, as long as you don't choose a value of $b$ that is too small, you should be good. Choosing a value of $b$ larger than 256 does not give you greater security against collisions, but that's irrelevant: the level of security will already be way more than enough for any reasonable application, so you're good.

D.W.
  • 36,982
  • 13
  • 107
  • 196
10

In general, each combination of a (secure) hash function for input with a (deterministic) pseudo random number generator for output will work here - one "state of the art" example is the one given by D.W. (using AES-CTR as PRNG and SHA-256 as hash).

Another way is similar to what PBKDF-2 does to have output with the right length: hash the input (or a hash of the input) multiple times, each with a different prefix, and concatenate these outputs:

output = H(1 || M) || H(2 || M) || H(3 || M) || ...

(One could say that this is a special case of the general case before, at least when H is already a hash of the original message.)

There are some hash functions with a "arbitrary output length" mode, such as Skein (one of the SHA-3 candidates). (This mode of Skein internally works just like the scheme above, but it is hidden in one standardized primitive, you don't have to build this yourself.)

The actual winner of the SHA-3 competition, Keccak, also has a mode with arbitrary-length output (by just squeezing its sponge as long as needed to produce the required output), which was standardized as SHAKE128 and SHAKE256 (the number indicates the security level for collision resistance, if the output length is at least double as long).

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
3

Yes, there are hash-like algorithms that are able to produce variable-length outputs without any extra efforts. This is something "sponge functions" do. One such sponge construction is KeccaK which is one of five finalists in the SHA-3 competition.

sellibitze
  • 321
  • 1
  • 9
1

I'm curious about your purpose. Generally the primary operation involving a message digest is ultimately to compare two digest values. Hashing passwords allows comparing the digest values instead of carrying the super secret password around the systems. Hashing messages allows the transmitter and sender to verify the data was correctly received without resending the whole message. Encrypting a hash with a private key allows them to be decrypted with the public key and then compared to verify the signer's identity. Even hashing strings in a database allows an efficient indexed search by comparing and sorting digest values.

In every case of comparison, having the digest values be of equal size enables the comparison - unequal sizes would never result in an equal result.

If it's a size constraint, a cryptographically secure digest value can always be truncated with the understanding that there is a corresponding loss in fidelity. (A non secure digest algorithm might not have the bit dispersal properties that would make such an assumption safe.)

But you've said you want a "larger" digest. When I think about the meaning of digest, to me that is a "summary", which is always smaller than that which it summarizes. What are you hoping to gain by making it larger? Is it a matter of not trusting in the anti-collision properties of 2^256 possibilities? You have piqued my curiosity.

John Deters
  • 3,778
  • 16
  • 29
0

Since most of the other answers are from before SHA-3 was standardized, I'll add another update: Keccak was chosen for SHA-3. The standard doesn't define any extendable output features for SHA-3 itself, I think because they wanted it to be a drop-in replacement for SHA-2. But it does define two closely related functions, SHAKE128 and SHAKE256, which support extendable output. These are fairly widely available, including in OpenSSL and in Python's standard hashlib module.

The BLAKE family of hashes also supports extendable output, via either BLAKE2X or BLAKE3.

The HKDF construction also supports extendable output, using any underlying hash function. Unfortunately, it's standardized with a maximum output length of 255 times the underlying digest size. It seems like it would've been easy to use 8 bytes instead of 1 byte for the block counter, and I'm not sure why they didn't.

Jack O'Connor
  • 647
  • 6
  • 13