Why can't we use a hash with no collision to compress data reliably?

Question

According to Wikipedia, a hash maps digital data of arbitrary size to digital data of fixed size.

For all practical measures, a hash is a unique signature of a big chunk of data. But there is such a thing as a collision-free hash, I heard.

Other than being able to decompress back, arguably the main difference between compressing and hashing is precisely that collision factor - but what if the hash has no collisions?

Why exactly can't we "just" get that "perfect" hash and use it as a compressing method instead? Wouldn't it be able to generate way smaller files?

^{I see how I must be just missing something, so it's just my way of trying to understand what's the underlying difference between hashing and compressing! :)}

Gilles 'SO- stop being evil' · Answer 1 · 2021-11-01T08:54:02.453

Mathematically speaking, there is no such thing as a collision-free hash. Practically speaking, there is.

Cryptographic hash functions in good standing have no known collisions. That's one of their defining properties. They do have collisions, but there isn't enough computing power on Earth (if not in the whole universe) to find one, given current mathematical knowledge. A SHA-256 value is 256 bits, so we know that there exists a pair of 257-bit strings that have the same hash, but the best-known techniques to find one are out of range of current computing power.

Intuitively speaking, if it's hard to find collisions for a hash, the hash is hard to inverse. If there was a known algorithm to invert a hash, then at some point the algorithm would have to decide which of the possible preimages to go for, and we could run it with both decisions to find a collision.

It is possible to use a hash as a compression function. But since there is no way to calculate the original text from the hash, this compression method can only be used when the original text is available. Sounds useless? Not quite. In fact, that's one the basic reasons to use hashes! Cryptographic hashes are used when there are two storage or communication mechanisms, one that's secure but supports only a small volume of data, another that's insecure but supports a large volume of data. Store the hash on the small, secure storage and the actual data on the large, insecure storage. Then, when you need the file, retrieve the data, retrieve the hash, and check the hash. In this way, the secure storage mechanism uses the hash as a compression function; the decompression function makes use of the insecure storage, but guarantees the security of the outcome. (You'll not that something is lost, however: if the insecure storage is corrupted, this will be detected, but cannot be corrected. The “decompression” mechanism guarantees integrity (if you get the data back, it's the right data) but not availability (you might not be able to get the data back).)

Seen another way, a cryptographic hash can be used as a compression mechanism, but this requires that each time a new file is stored, the decompression function is somehow modified to remember the original file content. This is clearly impractical, but it is of theoretical interest — this basically describes a random oracle, which is a sort of idealized version of a cryptographic hash.

A perfect hash is a different kind of beast: it is mathematically collision-free, but it achieves that by restricting the possible inputs to a finite (usually small) subset of all possible inputs. The decompression function for a perfect hash is usually stored as a table from hash values to the corresponding original data (for example using an array if the hash values are small integers).

score 5 · Answer 2 · answered Sep 19 '15 at 05:34

A perfect hash function computes unique indexes for a predefined finite set of possible inputs. Typically such a function is used to implement a hash table. It is then not necessary to worry about collisions. Normally the set of possible inputs is small and known, such that it is also possible to invert the function (i.e., given the index one can find the input).

Example: If you have the set of strings

("Hello World", "A quick brown fox", "A lazy dog")

then for example the function counting the occurences of character l in a string would be a perfect hash function for those strings, as it would map the strings to indexes as follows:

"Hello World" -> 3
"A quick brown fox" -> 0
"A lazy dog" -> 1

So you can somehow use this function for compression. However there are a number of problems:

1) A first potential problem is that in order to decompress you must know the perfect hash function and the set of possible strings. So if you store '3' in a file you also have to store somewhere the fact that '3' corresponds to "Hello World". This approach can however make sense if you have a fixed set of inputs that you use for many files.

2) You will only have a compression if you have a small set of large inputs. E.g., if the set of possible inputs is {"a","b",...,"z"}, then the resulting indexes (hashes) will be like {1,...,26} an no compression takes place. So this hashing is not suited for general purpose compression.

3) Mapping every input to the same fixed size index may not be the best idea for compression. General purpose compression functions like Huffman coding also consider the probability of occurence of the different strings, and then strings that occur more often are mapped to shorter sequences than very rare strings, which gives a better compression ratio.

PS: Your question is not really a cryptography question, as perfect hash functions for hash tables are not cryptographic hash functions.

score 3 · Answer 3 · answered Sep 18 '15 at 20:20

Using a perfect hash in this case is essentially the same as using an index. For a perfect hash to work, both the compressor and decompressor have to know the $N$ possible things that might be "compressed" in the data.

You are better off giving them each a number in $[0,N)$, and using $log_2N$ bits as your "compressed data".

This is better than a perfect hash because it uses the minimum number of information possible (perhaps not true with a perfect hash), and doesn't give any more information about the contents than a hash does.

The only way a perfect hash would be better would be if you didn't want other people knowing how many possible items there were. The hash comes from a larger number space, so people can't very easily see if there are only 5 items, or 500 million items.

cregox · Accepted Answer · 2021-11-01T00:10:57.790

in my head, i thought "hey, that small string that represents the huge one (the hash) looks pretty neat! if it doesn't represent any other huge ones of same size, then we can say it became, for all effects, a compression of that huge one! but that can't be right, or else we wouldn't need compression algorithms!!".

now i read some of all this back again... and i still don't know exactly what a hash is!

but...

the hash is just an index

looks like the perfect analogy.

the hash tries to index any data in the least amount of bytes necessary.

so the algorithms simply try to give "page numbers" for each possible combination that we would create, while ignoring combinations that are unlikely to exist, so we don't waste page numbers on things that should probably never exist.

a hash could still generate a collision free string

but it would probably need to be too big for any practical purposes. some say it can't be any smaller than the original string!

vs compression

on another note...

we can easily compress infinite amounts of non-random data into a few bytes.

for instance, an "infinite amount of zeroes" could be represented by "0~" for instance.

the whole issue with compressing lies basically in finding patterns in what appears to be random... (thus, the hutter prize, which contributes for A.I. research by simply focusing on compression...)

as for hashing, it doesn't matter. the infinite amount of zeroes would still output the same sized string as everything else being hashed.

prevent confusion

hashes are related to (and often confused with) checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers. although the concepts overlap to some extent, each one has its own uses and requirements and is designed and optimized differently.

the hash functions differ from those concepts above mainly in terms of data integrity: hashing have no intention to keep data.

tl;dr;

in a way, we do data compression to get smaller size of the same data (which can then be decompressed to get to virtually the same data, depending if we do it lossless or not).

and we do hashing to improve finding data ķwhich can easily be confused with decompressing because we get from something small to something huge, but the hash doesn't contain any of the original data).

completely different beasts.

score 0 · Answer 5 · answered Oct 24 '21 at 21:35

There is a recent and quite enjoyable article by Thomas Pornin called Paradoxical Compression with Verifiable Delay Functions, which I think achieves something like what you suggest.

It is impossible for a function $h : \{0,1\}^* \to \{0,1\}^n$ to avoid collisions. One way to think of collision-resistance is the following. A function $h$ is collision-resistant if I can claim that it (paradoxically) has no collisions, and it's hard for you to prove me wrong.

We can do something similar for compression schemes. It is impossible for a compression scheme to simultaneously satisfy the following properties:

compression is lossless: Decompress(Compress($y$)) = $y$ for every string $y$
compression makes no string longer: $|$Compress$(y)| \le |y|$ for all $y$
compression shrinks at least one string: there is a $y$ such that the previous inequality is strict.

We can define a "paradoxical compression" scheme as one where I can claim that it satisfies these three properties, and it's hard for you to prove me wrong.

We can define things more precisely with the following game: I provide you oracle access to the Compress and Decompression algorithms. I also give you a value $x$ such that Compress($x$) is strictly shorter than $x$. You win the game if you can find any value $y$ such that Decompress(Compress($y$)) $\ne y$, or Compress($y$) is longer than $y$.

This paper shows how to construct paradoxical compression, where this game is hard to win.

The paper has a nice discussion about why paradoxical compression can't be made even stronger. We can't design a paradoxical compression that (apparently) compresses all strings by at least 1 bit. We can only hope for paradoxical compression that nontrivially compresses one string and (apparently) does not increase the length of any input.

score -2 · Answer 6 · answered Jul 29 '19 at 16:54

If you have enough computing power to crack a hash, you can send big files over the internet almost instantly, reducing the need for expensive and powerful network infrastructure, you would have to crack it faster than it takes to download the file to be worth it at least time wise, it may be physically, economically and technologically inefficient or even impossible to have a smartphone/computer the same size and price you have but fast enough to be able to crack hashes like a supercomputer can, or even every supercomputer on earth and universe combined can't. If you find a way to crack hashes instantly with a portable quantum computer for example, then it would be a good way to reduce network load and make the internet almost instantaneous or you can find a way in the middle that you would be able to compress files using more than a hash, perhaps a hash and many parameters in a text file, to guide the cracking algorithm the right way, in password cracking for example it takes a very long time to bruteforce a password when you have absolutely no information about the length or character set of the password, but if only you had the length and character set of the password it would significantly reduce the time used to crack the hash.The same thing could be done with larger files/information.Here is a small example :

Hash and parameters : sha1: 19f054f1f448ff152f1d586be39a56f179fe80c9 Lowercase letters only : yes Words in a dictionary : yes Length : 13

Result : stackexchange