Finding a Simple Distribution In a Binary String

Question

Unsupervised feature discovery of text that started with its bit string representation would need to discover octets were the first-order parse of such a bit string. This raises a question:

What is the technique called that can discover that a binary string, for example:

0100100111110010101010101011111011001000111000100101110001111110111010111110010111001010100011111110001101100101001010101111000111101011010011111001111101001111101011111011110011011001111000010100110001

has the simple model (with A x B meaning A occurs B times in the bag):

{00 x 1, 01 x 2, 10 x 3, 11 x 4}

even though it knows only that it should group bits in substrings (tokens) of the same bit length (ie: it doesn't know it should group bits in pairs)?

That is to say, if the binary string input was generated by a perl program:

for(0..100){print ( (('00') x 1, ('01') x 2, ('10') x 3, ('11') x 4)[rand(10)])}

the technique would reject, as less predictive, the distribution (model):

{0 x 7, 1 x 13}

and it would also reject a model that used 2 bit tokens on odd-numbered bit boundaries, as well as models that used 3 bit, or longer, tokens.

A related, more difficult technique, would find the model for a string generated by sampling the bag:

{0 x 1, 1 x 1, 00 x 1, 01 x 2, 10 x 3, 11 x 4}}

That is to say the bit string is a mix of token sizes.

score 0 · Accepted Answer · answered Jan 20 '17 at 18:23

TL;DR: use maximum likelihood and discrete optimization.

Evaluating candidate models: the maximum likelihood principle

If you have a candidate model, you can evaluate how well it fits the data using the maximum likelihood principle.

If $M$ is a model and $x$ is a string, let $P(x|M)$ denote the probability of outputting string $x$ when $M$ is the true model. Here I assume a generative model that produces $x$ as follows: at each step, it randomly picks one term $g_i \times n_i$ from $M$, appends $n_i$ copies of the string $g_i$ to the output, and repeats until some stopping point (say, stops once we've generated a string of fixed length).

Of course in practice we have the reverse problem: we have observed a fixed string $x$, and want to infer $M$. Now we'll treat the observation $x$ as fixed. We define the likelihood of $M$ to be $L(M) = P(x|M)$. If we have observed multiple strings $x_1,\dots,x_m$, then we define the likelihood of $M$ to be $L(M) = P(x_1|M) \times \cdots \times P(x_m|M)$.

The intuition is: models with larger likelihood fit the data better. So, if you have a choice of multiple models, choose the one with the largest likelihood -- that's the one that seems most consistent with the data.

In practice, for computational reasons, we often deal with the log-likelihood, $\log L(M)$. We choose the model whose log-likelihood is largest. Since the log is monotone, this doesn't change anything fundamental.

if you're comparing a simple model to a complex model, this introduces the risk of overfitting. The likelihood alone doesn't account for Occam's razor: the principle that, all else being equal, simpler models are more likely to represent the truth. This can be fixed by introducing some kind of regularization.

Finally, note that the likelihood of a model can be computed efficiently using dynamic programming. We each prefix $w$ of $x$, we compute $L(w)$ in terms of shorter prefixes, starting from shorter prefixes to longer prefixes, until we have computed $L(x)$. If you don't immediately see how to do this computation, ask a separate question; it's a standard dynamic programming exercise. If you're dealing with long strings, you might want to compute using log-likelihoods rather than likelihoods, to avoid underflow.

Fixed-length tokens

If all tokens have the same length, it's probably fairly easy to find a good model. Assume know the length $\ell$ of all tokens in the model; if we don't, we can try each possibility for $\ell$, one at a time, and take the one that yields the best model.

Since we know the length $\ell$, we can divide the string $x$ up into tokens of length $\ell$. In this way we can see the set of all tokens that appear in $x$, say $t_1,\dots,t_k$. Now we know that the model must be of the form

$$M = \{t_1 \times n_1, \dots, t_k \times n_k\}$$

and we merely need to infer the numbers $n_1,\dots,n_k$.

Let's focus on the token $t_1$ and see how to infer $n_1$. We can find all occurrences of $t_1$ in $x$, and combine them into sequences of contiguous repeats, and let $S_1$ denote the set of repeat lengths. For instance, if at one place we see $t_1$ repeated 3 times consecutively, and at another place we see $t_1$ repeated 9 times consecutively, then we have $S_1 = \{3,9\}$. At this point we simply take $n_1 = \gcd S_1$, i.e., $n_1$ is the largest number that divides every element of $S_1$.

We'll of course repeat this for each token $t_i$. We end up with a complete model, as desired.

A technical detail: This assumes that each token $t_i$ is listed only once in $M$, with a single repeat-factor $n_i$. In other words, it assumes the model is allowed to look like $\{00 \times 4\}$ but not $\{00 \times 2, 00 \times 3\}$ (the latter has the token $00$ with two different repeat-factors). If you want to consider the latter kind of model, the problem reduces to finding a set of repeat-factors $R_1$ such that every element of $S_1$ can be expressed as a linear combination of $R_1$. The optimal solution will depend on the form of regularization you use; without regularization, the optimal solution will always be to simply take $R_1$ to have a single element, $R_= \{\gcd S_1\}$. So if you want to consider models where the same token appears twice, you'll need to specify a particular form of regularization (ask a new question). For now, I'll assume such models aren't of interest.

So this shows how to solve the problem, in the easy case where all tokens have the same length.

Variable-length tokens

Handling models where the lengths of the tokens are not all the same looks much more challenging. I can suggest one possible approach, but the best approach will probably depend on the parameter settings you're encountering in practice.

I suggest reducing this to a discrete optimization problem. In particular, I suggest you identify a set of tokens $t_1,\dots,t_k$ that you're confident will be a superset of the ones in the real model, and then use optimization methods to solve for the repeat-factors $n_1,\dots,n_k$ that maximize the likelihood of the model.

In more detail: Fix the set of $t_1,\dots,t_k$. Now the model looks like

$$M = \{t_1 \times n_1, \dots, t_k \times n_k\}$$

where the $t_i$'s are known and the $n_i$'s are unknown (variables). Consequently we can think of the likelihood $L(M)$ as a function of the $n_i$'s: given any candidate values for $n_1,\dots,n_k$, we can compute $L(M)$ using dynamic programming.

So, I'd suggest you use some existing optimization strategy to find $n_1,\dots,n_k \in \mathbb{N}$ that maximize $L(M)$. A natural approach is probably some form of local search, e.g., hillclimbing, hillclimbing with random restarts, or simulated annealing. A suggestion for a set of "local moves" would be to pick a single $n_i$ and change it via one of the following operations: multiply $n_i$ by a small prime number; divide $n_i$ by a small prime divisor of it; set $n_i$ to zero; change $n_i$ from zero to a small number; increment $n_i$; decrement $n_i$.

How do we find the set $t_1,\dots,t_k$ of tokens? Here a convenient fact is that we don't have to get this set exactly right; it suffices for it to be a superset of the true set of tokens in the actual model. In particular, setting $n_i=0$ is equivalent to removing the token $t_i$ from the model entirely. So, we can choose a larger-than-necessary set of tokens $t_1,\dots,t_k$ and let the optimization routine effectively solve for which tokens should be retained and which should be eliminated. One heuristic would be to choose $t_1,\dots,t_k$ to be the set of all bit-strings of a certain range of lengths (e.g., all bit-strings of length 2 or 3). Another heuristic would be to use some kind of filtering condition: use the set of all bit-strings $t$ that appear at least some minimum number of times in $x$. The nice thing is that we can try each of these choices in turn, apply the optimizer to each, get a list of candidate models, and choose the best one (using the maximum-likelihood principle). For instance, it might not be clear how to choose a threshold for the filtering, but we can try multiple values in a exponentially decreasing sequence and keep the best model obtained.

Similarly, it's also possible to come up with heuristics for the initial values of $n_1,\dots,n_k$ to feed to the optimizer (this will help some optimizers converge to a better solution). For instance, for each token $t_i$ and each candidate repeat-factor $r$, you could count the number of times that $t_i$ appears repeated $r$ times in a row, then choose the value of $r$ that has the highest count as the initial guess for $n_i$.

How well will this work? I don't know. It will probably depend a lot on the parameters of the problem instances you run into in practice. I would suggest you try it on your data sets, with several different optimization methods and fiddling with the parameters a bit. If it doesn't work, ask another question where you show us what you've tried, and also show us the typical range of values for the most important parameters: the number of tokens in the model ($k$), the range of lengths of the tokens themselves, the range of values of the repeat-factors $n_i$, the length of the string $x$.

Finding a Simple Distribution In a Binary String

1 Answers1

Evaluating candidate models: the maximum likelihood principle

Fixed-length tokens

Variable-length tokens