8

Take a list of the first $n$ primes $P_n=\{2,3,5,7,11,\ldots\}$ and convert the sequence into a binary string

$$S_n = 101110111\ldots$$

Compress the string with your favorite compression algorithm (the details don't matter) and let the size of the string after compression be $C(S_n)$. Compare this to the average over permutations $\pi$ of the string $\left < C(\pi S_n) \right >$, or random strings of 0, or 1 of the same length $C(R_N)$. Why does it seem to be true that

$$ C(S_n) < \left < C(\pi S_n) \right > \approx C(R_N) $$

implying that primes are compressible!

For example, when $N=2^{21}$, using zlib as a compression algorithm and average over 20 trials:

$$ \frac{C(S_n)}{\left < C(\pi S_n) \right >} \approx .761 $$

What is this "redundant data"?

Hooked
  • 6,785
  • They can be compressed infinitely more if you allow your compressed data to be a program. By that I mean, the Huffman prize allows a program to compress data, and specifically encourages it, so why not here? And in this case we know the shortest program is roughly a string holding a < 50 line Python code. – Daniel Donnelly Oct 26 '21 at 09:36

3 Answers3

6

It turns out that storage of primes can be compressed by an arbitrary amount, though the storage needed is exponential in the compression factor.

Note: This has been corrected and been made more precise.

Let $p_n$ be the $n$-th prime (with $p_1 = 2$), and let $P_n$ be the product of the first $n$ primes, so that $P_1 = 2, P_2 = 6, P_3 = 30, P_4 = 210$.

If we do a sieve of Eratosthenes, sieving out multiples of the first $n$ primes, all that are left are the first $n$ primes and the numbers relatively prime to $P_n$.

For example, for $n=2$, the numbers left are $2, 3$, and the forms $6m+1$ and $6m+5$. For $n=3$, the numbers left are $2, 3, 5$ and the forms $30m+1,7,11,13,17,19,23,$ and $29$.

For general $n$, the numbers remaining are of the form $P_nm+q_i$, where the $q_i$ are the numbers from $1$ to $P_n-1$ relatively prime to $P_n$.

To illustrate, I will use the case $n=3$.

Each block of 30 numbers in a range $30m+1$ to $30m+29$ can have as most $8$ primes as shown above. Therefore, only one bit is needed for each of these $8$ possibilities, to indicate whether or not that value is actually prime. Therefore, storage of primes can be compressed by a factor of $\frac{8}{30} \approx 0.267$.

Here is what happens for general $n$.

The $P_n$ values in the range from $mP_n$ to $(m+1)P_n-1$ are compressed to $\phi(P_n)$ bits, where $\phi(m)$ is Eu;er's phi function (though he lets me use it) that counts the number of integers from $1$ to $m$ relatively prime to $m$.

Since $\phi(P_n) = \prod_{i=1}^n (p_i-1) $, the compression factor is $\dfrac{\phi(P_n)}{P_n} =\prod_{i=1}^n (1-\dfrac1{p_i}) $. Since this goes to zero (because $\sum_{i=1}^n \dfrac1{p_i} \sim \ln \ln n $ - this is Merten's theorem), the amount of compression can be arbitrarily large, though about $e^n/n$ bits are needed.

To see this, by one of the corollaries of the prime number theorem, $P_n \sim e^n$, and $\dfrac{\phi(P_n)}{P_n} =\prod_{i=1}^n (1-\dfrac1{p_i}) \approx e^{-\ln \ln n} = 1/\ln n $. Therefore, each block of $P_n$ values can be represented by about $\frac{P_n}{n}$ bits.

Therefore, using a sieve with the first $n$ primes and the numbers relatively prime to $P_n$ can compress the primes by about a factor of $n$.

Therefore the primes can be compressed by an arbitrary amount, though the amount of storage needed is exponential in $n$.

A number of years ago, I used this idea with $n=4$, so I got a compression of $\frac{1\cdot 2\cdot 4\cdot 6}{2\cdot 3\cdot 5\cdot 7} =\frac{8}{35} $.

marty cohen
  • 110,450
  • Ah I think I see now, despite the representation (e.g. the variants purposed by Ted) a compression algorithm that relies on a block scheme (like zlib) will naturally fall into a length commensurate with the block. Nice insight - thanks! – Hooked Sep 29 '13 at 22:39
2

The first 23,163,298 primes can be considered compression-friendly. It is the maximum number of primes for which every gap between them is <= 255, i.e. fits into a single byte. So for those primes, you can use just a single byte per prime, storing the gap, rather than the actual prime as number, which consumes 8 bytes. This way, you end up using 8 times less memory.

I used this useful fact in my solution for caching primes that way.

UPDATE

This can be extended even further, to 303,371,455,241 - first prime with gap > 511, by taking advantage of the fact that all prime gaps are even numbers (except between 2 and 3), and so we can use bit 0 for storing bit 8, which in turn doubles the supported gap to 511, and thus increasing the range of primes significantly, to 303,371,455,241.

I implemented the complete 511 compression solution in my library.

vitaly-t
  • 121
2

Primes are always odd, so the last bit is always 1 (except for the initial prime 2, of course). So there is definitely some redundancy.

Another consideration is that your description scheme (listing all the bits of every number) is not optimal. To describe any subset of $\{1,2,\ldots,n\}$ takes at most $n$ bits, since there are $2^n$ such subsets. We just have one bit corresponding to each number telling us whether that number is in or out of the subset.

For the particular subset you are considering (the primes), under the description scheme you are using (listing the bits in succession), there are about $\frac{n}{\ln n}$ elements (Prime Number Theorem), and each one has about $\log_2{n}$ bits. So the number of bits in your description is approximately $\frac{n}{\ln n}(\log_2{n}) = n (\log_2{e})\approx 1.44 n.$

Ted
  • 35,732
  • Just tested this idea by removing the last bit of each prime numbers binary representation and the ratio is still ~.80, so this doesn't quite answer the question. – Hooked Sep 29 '13 at 06:14
  • See edits for another consideration, probably has much greater effect than the 1 bits at the end. – Ted Sep 29 '13 at 06:31
  • every prime also starts with a $1$, so the first bit is also always $1$. – mercio Sep 29 '13 at 07:24
  • @Ted I've reran the results with the more efficient representation you've suggested. In this case we still have the same ~80% ratio between the prime sequence and the permuted sequence, but now both sequences compress far better than the random sequence (presumably due to the fact that there are more 0's than 1's). So I really appreciate the effort, but neither explanation describes why the prime sequence compresses over a permutation of the same sequence. – Hooked Sep 29 '13 at 22:28