is it possible to compress a sequence of progressive unique numbers dramatically?

Question

Let's say I have a 60 set of numbers from 0-100,000 so like [81, 98, 115, 189, 254, ... , 97866, 98441, 99671] all unique and progressively increasing. Would it be mathematically possible to compress this sequence by 80-90%? so far I've tried gzip which only compresses at 50% and other algorithms the best one I found was at 0.68% almost 0.69%. Though I have not tried mixing this algorithms together to get to 80-90% but is this mathematically possible?

The current limit I am currently trying to achieve is a byte representation that is below the number of sets I have. So for example this set [81, 98, 115, 189, 254, ... , 97866, 98441, 99671] can be translated to something with the size of 50 bytes that is below 60 so that would do great for me

also I know of subtracting the lower value to the other value, example : [81, 98, 115] → [81, 17, 17] but this doesn't really work for what I am after.

Nathaniel · Accepted Answer · 2022-09-11T20:36:20.137

4

From a counting point of view, there are $\binom{100000}{60}$ such sequences and so, numbering them, you could theoritically compress it down to $\log_2\binom{100000}{60}$ bits but not further. This is roughly equal to 90 octets.

edited Sep 11 '22 at 20:36

answered Sep 10 '22 at 17:02

Nathaniel

18,309
2
30
58

score 1 · Answer 2 · answered Sep 12 '22 at 15:51

To expand on Nathaniel's answer, you can number the subsets efficiently using the combinatorial number system.

In short, say your subset is $[c_1, c_2,\ldots, c_k]$ (where $k=60$). Compute $N = \binom{c_1}{1}+\binom{c_2}{2}+\ldots+\binom{c_k}{k}$. (note that we use the convention that $\binom{n}{m} = 0$ for $n<m$). This number $N$ represents your subset, and you can simply store that.

Now the other way around, we want to be able to recover the subset from this number $N$. We'll start by finding the maximum value $c_k$. To do so, find the largest value of $c$ among $k-1,k,k+1\ldots$ for which $\binom{c}{k} \leq N$. This is the largest element in the subset (i.e. this is $c_k$). Now replace $N$ with $N-\binom{c_k}{k}$, then replace $k$ with $k-1$ and repeat the procedure until $k$ reaches $0$.

To make this efficient (time wise) you might want to give some thought about how you compute the binomial coefficients. Perhaps the most hassle-free approach is to compute them when needed by a recursive function with memoization (that way you can harvest some of the computation done for previous binomial coefficients when computing a new one, and don't spend too much time computing coefficients you won't end up needing).

score -1 · Answer 3 · answered Jan 18 '25 at 19:23

You can attempt to resolve the sequence into an algebraic form or use one of the series identities, e.g. arithmetic, geometric, harmonic, and statistical series. Curve-fitting or interpolation such as Fourier series or a Z-transform might work for periodic structures. I am working on the same type of issue, but I have a trillion-trillion points to compress. My plan is to use a dictionary lookup type of compression because my data has repeating partitions. Also, try to find common factors in your data and use that to your advantage.

is it possible to compress a sequence of progressive unique numbers dramatically?

3 Answers3