Probability of Binary Word Matching

Question

Thanks for reading. Would be lovely if somebody could help me out on this (but not just post the answer) but also how you got there. I'm a programmer and I've ran across this problem which I can't solve. Nobody taught me mathematical probability so I really only know the self-taught basics, please keep that in mind.

What is the probability of finding a binary word of length N within another binary word of length M where of course M $\geq$ N?

Under the assumption that N = 1, I've came up with this solution $p = 1-(\frac{1}{ 2})^M$ .. yeah very impressive.

How do I incoorporate the length of the first word N into this or do I have a completely incorrect approach?

[Criteria]

Lets assume that X can be either 1 or 0 then XX has $2^2=4$ combinations {(0,0), (0,1), (1,0), (1,1)}. The probability of finding either 1 or 0 in this word would be 0.75 (so it doesn't matter if it's found in the first or the second character). The probability of finding 11 for example would be 0.25

What matters is that the first word has to exist continually within the second. For instance we match 11 with a word of length 3 (XXX) with 8 combinations, matches would be at (011) (110) and (111) so the probability would be 0.375.

However match(11, 101) this is not allowed due to the zero in between, although (1, 101) is allowed.

Hope I've explained this clear enough without knowing the adequate terminology. P.S. since N > M makes no sense I assume a fraction to appear somewhere by sheer intuition.

@barakmanos in which part? probably a typo I can't find edit think I've corrected it now — Ilhan, Feb 03 '16 at 11:19

score 1 · Answer 1 · answered Feb 03 '16 at 17:06

The probability of a binary word of length $N$ occurring at position $i$ within another binary word of length $M\ge N$, where $1\le i\le M-N+1$, is $$P(N_i)=\frac{1}{2^N}$$ The probability of not finding word $N$ at position $i$ within word $M$ is $$P(\lnot N_i)=1-\frac{1}{2^N}$$ The probability of not finding word $N$ anywhere within word $M$ is $$P(\lnot N)=(1-\frac{1}{2^N})^{M-N+1}$$ Finally, the probability of finding word $N$ somewhere within word $M$ is $$P(N)=1-(1-\frac{1}{2^N})^{M-N+1}$$ Note that, for $N=1$, this equates to $1-(1-\frac{1}{2})^M=1-(\frac{1}{2})^M$ as you found.

Thank you a lot this is what I've been trying to figure out. Cheers! — Ilhan, Feb 03 '16 at 17:13

Toaster · Accepted Answer · 2017-06-01T16:16:05.787

Sorry but I found that the solution presented here isn't correct. I did try to simulate and confirm the formula proposed by user @Logophobic using random trials of a M bits vector.

Here what I did:

For each random realization of a vetor of size M I did a search to find at least one pattern inside the random generated vector.
I repeated the procedure K times and then I counted all matching results and divided them by the number of trials I did executed. This would give for a large number of trials a near approximation of the real value of the probability of having a pattern inside a vector of size M.

I found that simulated results are not exact but have a good precision. For M>30 it would need more trials to avoid larger errors.

I also verified some results by direct brute force all possible 5 bits pattern occurences inside a M bit vector to verify my simulation for M<=15.

Simulation results

M = 5, prob = 0.031612, HitCount = 31612
M = 6, prob = 0.046956, HitCount = 46956
M = 7, prob = 0.062553, HitCount = 62553
M = 8, prob = 0.077884, HitCount = 77884
M = 9, prob = 0.094238, HitCount = 94238
M = 10, prob = 0.109726, HitCount = 109726
M = 11, prob = 0.124648, HitCount = 124648
M = 12, prob = 0.139473, HitCount = 139473
M = 13, prob = 0.153750, HitCount = 153750
M = 14, prob = 0.168943, HitCount = 168943
M = 15, prob = 0.181898, HitCount = 181898
M = 16, prob = 0.196280, HitCount = 196280
M = 17, prob = 0.210253, HitCount = 210253
M = 18, prob = 0.223612, HitCount = 223612
M = 19, prob = 0.237499, HitCount = 237499
M = 20, prob = 0.249648, HitCount = 249648
M = 21, prob = 0.263080, HitCount = 263080
M = 22, prob = 0.274135, HitCount = 274135
M = 23, prob = 0.287609, HitCount = 287609
M = 24, prob = 0.299382, HitCount = 299382
M = 25, prob = 0.311449, HitCount = 311449
M = 26, prob = 0.323036, HitCount = 323036
M = 27, prob = 0.334595, HitCount = 334595
M = 28, prob = 0.346705, HitCount = 346705
M = 29, prob = 0.356554, HitCount = 356554
M = 30, prob = 0.368210, HitCount = 368210

where $K=10^6$ is the number of trials used in each simulation. The probability is simply $p= HitCount * (\frac{1}{2})^K$, where $HitCount$ is the number of times that a pattern is found in a vector.

Formula results

(obtained with user @Logophobic formula)

$1-(1-\frac{1}{2^N})^{M-N+1}$, with $N=5$

Formula: M = 5, prob = 0.031250
Formula: M = 6, prob = 0.061523
Formula: M = 7, prob = 0.090851
Formula: M = 8, prob = 0.119262
Formula: M = 9, prob = 0.146785
Formula: M = 10, prob = 0.173448
Formula: M = 11, prob = 0.199278
Formula: M = 12, prob = 0.224300
Formula: M = 13, prob = 0.248541
Formula: M = 14, prob = 0.272024
Formula: M = 15, prob = 0.294773
Formula: M = 16, prob = 0.316811
Formula: M = 17, prob = 0.338161
Formula: M = 18, prob = 0.358844
Formula: M = 19, prob = 0.378880
Formula: M = 20, prob = 0.398290
Formula: M = 21, prob = 0.417093
Formula: M = 22, prob = 0.435309
Formula: M = 23, prob = 0.452956
Formula: M = 24, prob = 0.470051
Formula: M = 25, prob = 0.486612
Formula: M = 26, prob = 0.502655
Formula: M = 27, prob = 0.518197
Formula: M = 28, prob = 0.533253
Formula: M = 29, prob = 0.547839
Formula: M = 30, prob = 0.561969

So I would like to ask if someone has a formula that matches my simulation results (as I am confident that it is the correct answer for N=5). I also compared my simulations with some brute force simulation to try all possible combinations for M <=18 (larger values are impraticable to simulate).

Brute force

M = 5, Brute Force prob = 0.031250, HitCount = 1
M = 6, Brute Force prob = 0.046875, HitCount = 3
M = 7, Brute Force prob = 0.062500, HitCount = 8
M = 8, Brute Force prob = 0.078125, HitCount = 20
M = 9, Brute Force prob = 0.093750, HitCount = 48
M = 10, Brute Force prob = 0.109375, HitCount = 112
M = 11, Brute Force prob = 0.124512, HitCount = 255
M = 12, Brute Force prob = 0.139404, HitCount = 571
M = 13, Brute Force prob = 0.154053, HitCount = 1262
M = 14, Brute Force prob = 0.168457, HitCount = 2760
M = 15, Brute Force prob = 0.182617, HitCount = 5984
M = 16, Brute Force prob = 0.196533, HitCount = 12880
M = 17, Brute Force prob = 0.210213, HitCount = 27553
M = 18, Brute Force prob = 0.223660, HitCount = 58631

To confirm this results $prob = HitCount*p^M$ where $p=1/2$ (probability of a single bit is uniform).

All simulations are made considering N=5.

Some graphs:

Hello @brain_toasted I've managed to evaluate this eventually on my own and forgot about this post. You are right, there is some mistake in the formula which I've noticed too. I've managed to compute eventually what I wanted initially: "What is the expected length of the longest common substring for two strings of alphabet $q$ and length $n$."
I've managed to evaluate the asymptotic bounds to this:

$log_q(n) < r \leq 2 log_q(n)$

However the solution probably is $r = 2 log_q(n)$. You can derive the probability from this equation. Try plotting it against your data. — Ilhan, Jun 01 '17 at 16:42
I would like to point out that my previous comment was towards your second image. — Ilhan, Jun 01 '17 at 17:23
Hi thanks for your feedback. I think that your answer is a bit different from your original post. You here are saying that you wanted to find the expected length of the longest common substring for two strings, which is different in finding the probability of having some defined bit pattern of size N inside a bitstream of size M (where all bits have the same probability 0.5). — Toaster, Jun 01 '17 at 17:28
Yes the question is different. I never managed to find the answer to this exact question to this date. Your assumptions of the equation from the previous answer being wrong are right though, but I can't seem to find the catch my self either. I accepted your answer as it is more correct than the previous. — Ilhan, Jun 01 '17 at 17:33
Your answer helps anyway as I don't need to review my simulations.
Now I am trying another path to solve this problem. I readed something about the "Distribution Of Longest Run" in which I think may have a clue about the solution of this problem (this may be a recursive series problem). Thanks again @ilhan for your feedback. — Toaster, Jun 01 '17 at 17:40
Now I understood what you said with this document: https://math.stackexchange.com/questions/439557/maximum-run-of-zeros-in-a-n-bit-binary-string?rq=1 — Toaster, Jun 01 '17 at 18:20

Probability of Binary Word Matching

2 Answers2

Simulation results

Formula results

Brute force