What is the maximum number of primes that I can pack in a 30KB text file?

Question

If I store in trivial way I can store roughly 5100 primes ie primes upto 50k in a 30KB file. Actually I need primes till $2^{30}$ but obviously its not possible to store such a huge list in a file of size of the order of some KBs. So my goal is the store as many primes as possible in a 30KB file. And then retrieve them in sublinear computation time. And beyond that limit I will have to sieve anyways. I have gone though this answer but doesnt quite fully satisfy me, So it would be great if someone can get the max compressed list with implementation of packing/unpacking algorithm.

I can afford to go for a retrieval/unpacking algo wchich is more than $O(1)$ per prime but has to be definitely sublinear ie less than $O(N)$ where N is size of list.

Comments are not for extended discussion; this conversation has been moved to chat. — Xander Henderson, Sep 05 '22 at 22:37
If you find it useful, this algorithm allows you to use a large wheel sieve in this way, if modified appropriately, you can store a greater number of prime numbers in a boolean vector and obtain the value of the prime number from the index. — user140242, Oct 08 '22 at 17:38

Oscar Smith · Answer 1 · 2022-09-04T06:18:03.887

7

The simplest solution is to store a bitmask of which numbers are prime. This takes 1 bit per integer, which is enough to determine primality for all numbers up to roughly 2^18.

It is however, possible to do much better than this by not storing multiples of 2, 3, or 5. This will increase the size of the code a little bit, but will let you store 3.75x more data (or roughly up to 900,000).

This still isn't optimal since only roughly 1/3rd of these bits are 1s, which implies this data could be further compressed, but this is approaching the limit. I would be very surprised if it's possible to store anything more than the primes up to 2,000,000 or so.

edited Sep 04 '22 at 06:18

answered Sep 03 '22 at 18:46

Oscar Smith

547

1

Wheel factorization has entered chat. – Joshua Sep 04 '22 at 02:00
I think I'm misinterpreting your answer, if it takes 1 bit per integer, and you have 30KB wouldn't you have 3010248 = 245760 bits? (and this x3.75 is well over 300k?) – ryan f Sep 04 '22 at 06:06
wow, I'm an idiot. I did the math for a sieve, but forgot how many bits are in a byte. – Oscar Smith Sep 04 '22 at 06:17

ryan f · Answer 2 · 2022-09-06T00:54:33.607

I want to note what happens if you add "Wheel Factorization" to Oscar Smith's answer.

Wheel Factorization

Wheel Factorization requires some basis primes (e.g $\{2, 3\}$)
We'll call the product of these basis primes: $p$ (e.g $2 \cdot 3 = 6$)

We now organize numbers in a table with $p$ columns, and eliminate the following numbers:

Any number in the first row, divisible by a basis numer:

1     2     3     #     5     #
7     8     9     10    11    12
13    14    15    16    17    18
19    20    21    22    23    24
25    26    27    28    29    30
...

Any number under our basis numbers or eliminated numbers:

1     2     3     #    5      #
7     #     #     #    11     #
13    #     #     #    17     #
19    #     #     #    23     #
25    #     #     #    29     #
...

Wheel Factorization doesn't remove all composites, there are still some composites in the list.

Packing algorithm

We need to save the first row, but everything after the first row can be compressed nicely:

7   11    Yes Yes    11
13  17 -> Yes Yes -> 11
19  23    Yes Yes    11
25  29    No  Yes    01
...

We also need to save which numbers were eliminated, which can again be done bitwise, we'll call this our header:

         7  #  #  #  11  #
                 |
                 v
header = 0  1  1  1   0  1

Unpacking algorithm

To pinpoint a number like $25$, we compute which row & column it's in:

$$r = \lfloor(25 - 1) / 6\rfloor = \color{red}{4}$$ $$c = ((25 - 1) / 6 - r) \cdot 6 = \color{blue}{0}$$

We then check our $c$th ($\color{blue}{0}$th) bit in our header. If it's eliminated (the bit is flipped on to 1), then we know it's not a prime. In this case the bit is 0, meaning we have to look further in the table.

We check the first item in the $r$th ($\color{red}{4}$th) row, and find it's a $0$, finally telling us it's composite:

  11    7   11
  11    13  17
  11    19  23
> 01 -> 25  29
 ..    ...

I've brushed over some details, which aren't really important unless your actually implementing this:

You need to save the first row for numbers $\leq p$. You'll also need to know how many 0s came before the $c$th bit in the header. In the example there were no 0s before the $c$th bit meaning we had to "check the first item".

Total size

To save the primality lookup table up to $N$ for a basis set of primes $B = \{b_1, b_2, b_3, \dots, b_n \}$ we need:

Header with $p$ bits, where $p = \prod^{n}_{i=0} b_i$
The first row, or primality of all numbers $\leq p$, which is another $p$ bits.
Everything after the first row:
- Let $k$ be number of columns that aren't eliminated, or: $$k = |\{m \in \mathbb{N}/\{0\}\ |\ m \leq p \land \forall b \in B(m \text{ is not divisible by } b) \}|$$
- Then everything after the first row takes: $k \lfloor N/p \rfloor$ bits (since $N/p$ is how many rows there are).

Which totals to: $2p + k\lfloor N/p \rfloor$ bits. We can find how many integers you can fit in 30KB by solving for $N$:

If $B = \{2, 3\}$, then $p = 6, k = 2$
- $N=737,244$ fits in 30KB (245,760 bits).
If $B = \{2, 3, 5, 7, 11\}$, then $p = 2,310, k = 480$
- $N=1,160,486$ fits in 30KB.
If $B = \{2, 3, 5, 7, 11, 13\}$, then $p = 30,030, k= 5,760$
- $N = 960,960$ fits in 30KB.

Conclusion

So the best basis you can get with this particular "Wheel Factorization" approach is $\{2, 3, 5, 7, 11\}$. Which can store up to 1,160,486 integers.

Any larger basis will cause $N$ to decrease. This is because the header & first row are uncompressed, meaning the header and first row will eventually become larger than 30KB.

If you move the header and first row into the program, you won't have this limit, but you'll be storing quite a lot in the program. For some perspective, here are some $p,k$ values:

p         k        M = (245760p)/k
30030     5760     1281280
510510    92160    1361360
9699690   1658880  1436991
340510170 56770560 1474070
...

You'll be storing $2p$ bits in the program, to store the first $M$ integers. After $p=510,510$, it seems to become impractical.

TL;DR: This particular Wheel Factorization approach can only save ~5 integers per bit.

one thing you could do is use a sieve program to generate the header on the fly. — Oscar Smith, Sep 05 '22 at 16:50
@OscarSmith That's a good idea. I was thinking after the first few primes are decompressed, you could create the header for a larger wheel to decompress more. However I'm not sure if there will be enough space left for it to be worthwhile. — ryan f, Sep 06 '22 at 00:21

score 0 · Answer 3 · answered Sep 04 '22 at 03:07

0

As a compromise between speed and compression, I'd put 30 primes/byte by marking primes as 30k+(1,7,11,11,13,17,23,29). This would give primes to about 112,000.

You can get better by using numbers using the "wheel numbers": 210, 2310, 30300, etc.

At some point (I don't know where), one can store the differences from the current prime to the next. Maybe a universal code would help here. I don't know if a universal code would be good in making a prime list.

answered Sep 04 '22 at 03:07

ttw

332

The differences method isn't good. I did that math, and the problem is that prime gaps grow quickly enough (in the worst case) that you would have to do some pretty fancy encoding of the gaps to win out. – Oscar Smith Sep 04 '22 at 03:10
1

I think one has to get to millions of decimal places for the gap method to be worthwhile. That's why I compromised on 8/30 for the compression. – ttw Sep 04 '22 at 03:46

score 0 · Answer 4 · answered Sep 04 '22 at 07:07

I'm assuming the interface you want is to be able to retrieve the $n$th prime given $n$.

The simplest way would be to store an array of primes, three bytes per prime. That's $30 \cdot 1024 / 3 = 10240$ primes that can fit into 30KB.

One very simple improvement is to group the array into blocks of 12 primes

$$\boxed{\color{red}{2}, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37}, \boxed{\color{red}{41}, 43, 47, \cdots}$$

The first prime in each block, the base, is stored in three bytes. The rest of the primes in the block are stored as a difference from the base in one byte. The block size of 12 was chosen empirically to make the differences fit.

That's 12 primes in 3+11=14 bytes, or $30\cdot 1024 \cdot (12/14) \approx 26330$ primes that fit in 30KB.

Another slight improvement: by removing 2 from the list, all the remaining primes are odd, so the differences from the base will all be even. So instead of storing the difference, we can store half the difference. Now differences of up to 510 will fit in one byte, so we can increase the block size up to 29.

29 primes in 3+28=31 bytes, or $30\cdot 1024 \cdot (29/31) \approx 28730$ primes that fit in 30KB.

Here's a small Python script demonstrating the packing/retrieval code for this.