0

Let's say I have a 60 set of numbers from 0-100,000 so like [81, 98, 115, 189, 254, ... , 97866, 98441, 99671] all unique and progressively increasing. Would it be mathematically possible to compress this sequence by 80-90%? so far I've tried gzip which only compresses at 50% and other algorithms the best one I found was at 0.68% almost 0.69%. Though I have not tried mixing this algorithms together to get to 80-90% but is this mathematically possible?

The current limit I am currently trying to achieve is a byte representation that is below the number of sets I have. So for example this set [81, 98, 115, 189, 254, ... , 97866, 98441, 99671] can be translated to something with the size of 50 bytes that is below 60 so that would do great for me

also I know of subtracting the lower value to the other value, example : [81, 98, 115] → [81, 17, 17] but this doesn't really work for what I am after.

D.W.
  • 167,959
  • 22
  • 232
  • 500
gushkash
  • 23
  • 3

3 Answers3

4

From a counting point of view, there are $\binom{100000}{60}$ such sequences and so, numbering them, you could theoritically compress it down to $\log_2\binom{100000}{60}$ bits but not further. This is roughly equal to 90 octets.

Nathaniel
  • 18,309
  • 2
  • 30
  • 58
1

To expand on Nathaniel's answer, you can number the subsets efficiently using the combinatorial number system.

In short, say your subset is $[c_1, c_2,\ldots, c_k]$ (where $k=60$). Compute $N = \binom{c_1}{1}+\binom{c_2}{2}+\ldots+\binom{c_k}{k}$. (note that we use the convention that $\binom{n}{m} = 0$ for $n<m$). This number $N$ represents your subset, and you can simply store that.

Now the other way around, we want to be able to recover the subset from this number $N$. We'll start by finding the maximum value $c_k$. To do so, find the largest value of $c$ among $k-1,k,k+1\ldots$ for which $\binom{c}{k} \leq N$. This is the largest element in the subset (i.e. this is $c_k$). Now replace $N$ with $N-\binom{c_k}{k}$, then replace $k$ with $k-1$ and repeat the procedure until $k$ reaches $0$.

To make this efficient (time wise) you might want to give some thought about how you compute the binomial coefficients. Perhaps the most hassle-free approach is to compute them when needed by a recursive function with memoization (that way you can harvest some of the computation done for previous binomial coefficients when computing a new one, and don't spend too much time computing coefficients you won't end up needing).

Tassle
  • 2,542
  • 6
  • 16
-1

You can attempt to resolve the sequence into an algebraic form or use one of the series identities, e.g. arithmetic, geometric, harmonic, and statistical series. Curve-fitting or interpolation such as Fourier series or a Z-transform might work for periodic structures. I am working on the same type of issue, but I have a trillion-trillion points to compress. My plan is to use a dictionary lookup type of compression because my data has repeating partitions. Also, try to find common factors in your data and use that to your advantage.