14

This comes following a discussion with a colleague.

  • My plaintext file plain consists of a about 100,000 lines of "all work and no play...". It's size is: 2.2 MB.

  • Compressed it is: 5.4kB

  • I encrypt the original:

    openssl aes-128-cbc -in plain -out plain.ENC
    

    plain.ENC is marginally bigger than the original, which I would expect.

  • I compress the encrypted copy: gzip plain.ENC. But I observe that the compressed copy is now marginally larger than plain.ENC.

Assuming the entropy of the original file is on the order of ~1000-10000 bits, why is the corresponding ciphertext incompressable? My intuitive notion of the entropy of a string is the minimal number of bits required to produce that string. If the string is the 3 MB long cipher text, it was generated with a 1000-bit entropy string encrypted with an AES implementation (probably no more than a few thousand bits), as well as a few-hundred bit key. Intuitively, to me, the entropy of the cipher text should be no more than the sum of the entropies of the strings and procedures that generate it.

So, my question is: Is my intuitive notion completely wrong? If we extended the low-entropy file to be orders of magnitude larger, would we begin to observe compressibility in the ciphertext?

Paŭlo Ebermann
  • 22,946
  • 7
  • 82
  • 119
Bill
  • 293
  • 2
  • 5

3 Answers3

19

Well, your definition of entropy is known as Kolmogorov complexity, and it's not so much that it is incorrect, as it is that it is inapplicable to what gzip does.

For example, the value $\pi$ can also be generated by a short program; however, if you attempt to compress a 2.2Mbyte sample of the binary expansion, you'll also find that gzip will also not be able to compress it.

What we can tell from that is that gzip doesn't actually do a good job of estimating the Kolmogorov complexity. Now, this is not actually a major criticism of gzip; it is actually impossible to write a program that computes it (!).

Instead of attempting to wrestle with Kolmogorov complexity, what gzip (and every other compression algorithm out there) relies on to compress the data is heuristics. That is, it has a model of redundancies that appear in real plaintexts (such as repeated substrings), and is able to take advantage of those redundancies to shorten the length. Now, if that model corresponds to the text it is given, gzip can compress quite well; if the model doesn't correspond (and both the encrypted text and the binary expansion of $\pi$ fall in this camp), it can't compress at all.

And, no, encrypting more data would not make the result any more compressible.

poncho
  • 154,064
  • 12
  • 239
  • 382
2

It won't compress because data that is encrypted with AES becomes pseudo-random-like and thus as close to maximum entropy as possible. As you pointed out, the clear text input is low entropy.

Additionally, entropy can be used as a way to detect clear text (given the clear text isn't pseudo-random itself). The output entropy from failed AES decrypts remains high unless the keys match. Even when the clear text has zero entropy (all zeroes), the failed AES decipher resultant data appears near 4.0 entropy, which is pretty cool I think. This is due to the key, substitution, and key schedule I believe.

(unprintable chars replaced with '?')

Entropy Result
3.875   <v????t@?V??????
3.125   plain text yay    <= success!
4.000   ??E???Yx?o?i????
4.000   ?Z?G^e?S?(?0?]J`
4.000   e? ?g?a??;)??? ?
4.000   ?OSF???-o??J????
4.000   ?0w??????????,??
4.000   ???????I^3"??c??
3.875   ??~?~97e_????r>?
3.875   Rk?IN??M???(??Vg
4.000   ?????)u?'?????ak
4.000   ??]9????c|?6???l
4.000   ??w???b??h<???;?
4.000   {#????q???~r?j?y
3.750   ?Q?????XV?n5Q?35

If the clear text is high entropy, it's harder to detect the right key. One way to compensate is to get the average entropy (nearly 4), and pick out the ones that are slightly lower than the average, but even then there would be some false positives.

That's my bastardization of it. Sorry it's off topic but I couldn't resist.

0

under perfect encryption y (output)= x(input) + z (key) As long as the key is a random (perfect random) then (Entropy of the output) H(y)= H(z) and therefore the output is not compressible.

if the key is a sudo-random in general, the output is compressible the max (H(z), H(x))

A single output stream , might have a very simple representation, such as all zeros (0000...0) and simply compressible by tools like zip or ..., but on average as long as you do not know the key (z), still you can not decode ( decrypt) the string.

hsaidi
  • 1