1

I want to find the probability of a specify substring will occur in a string of random characters.

Just simplify the question with numbers. 5 numbers are drawn randomly from 1 to 5 independently. The result can be 12345, 11234, 11111 etc. What is the probability that there are two 1s in a row? The possible cases are 11234, 31112, 11111, 11211 etc.

I think this would be $$(1/5)^{2}\binom{4}{1}=0.16$$ The probability of drawing two 1s with the combination from 4 space.

Then I tried to list out all the possible ways with computer, and found that there are 421 cases in a total of 3125 ways that match the condition, the probability should be 0.13472

How to calculate this value? So that it can apply to other length of string and longer substring, such as finding "ABC" occurs in a string of 32 random alphabet characters.

drhab
  • 153,781
  • What do you mean by "421 cases in a total of 3125 ways that match the condition"? What condition? – 5xum Sep 17 '19 at 09:23
  • @5xum The condition is "there are two 1s in a row". There are 3125 possible 5-digit words using the digits 1-5, and he has found that 421 of them have 11 in them somewhere. – Arthur Sep 17 '19 at 09:30
  • @5xum the condition that 11 is a substring of the string. – drhab Sep 17 '19 at 09:30

2 Answers2

3

A direct solution for your first problem.


For $i=1,\dots,4$ let $A_{i}$ denote the number of strings with $1$ on the spots $i$ and $i+1$.

With inclusion/exclusion and symmetry we find:$$\left|\bigcup_{i=1}^{4}A_{i}\right|=$$$$4\left|A_{1}\right|-3\left|A_{1}\cap A_{2}\right|-3\left|A_{1}\cap A_{3}\right|+2\left|A_{1}\cap A_{2}\cap A_{3}\right|+2\left|A_{1}\cap A_{2}\cap A_{4}\right|-\left|A_{1}\cap A_{2}\cap A_{3}\cap A_{4}\right|$$$$=4\cdot5^{3}-3\cdot5^{2}-3\cdot5^{1}+2\cdot5^{1}+2\cdot5^{0}-1\cdot5^{0}$$$$=421$$

We must be careful here especially by applying the symmetry.

Note e.g. that $|A_1\cap A_2|\neq|A_1\cap A_3|$.

The difficulties we encounter depend quite much on the problem that has to be handled.

drhab
  • 153,781
0

The general problem can be formulated as follows:

among the strings (words) of lenght $n$, from the alphabet $\{c_1,c_2, \ldots, c_q \}$, how many of them will contain one or more runs of up to $r$ consecutive appearances of a given chacter (e.g. $c_1$) ?

A possible solution goes through the following steps

a) First note that it is more convenient to start with the cumulative number up to $r$.

b) The total number of words under consideration is clearly $q^n$.
We partition them into those containing in total $s$ characters $c_1$ and $n-s$ different from $c_1$. $$ q^{\,n} = \sum\limits_{0\, \le \,s\, \le \,n} {\left( \matrix{ n \cr s \cr} \right)1^{\,s} \left( {q - 1} \right)^{\,n - s} } $$

c) In each partition above, take the words with a given fixed position of the $s$ ch. $c_1$.
These will correspond to a binary string: $c_1 \to 1, \; others \to 0$.
Each binary string will correspond to $\left( {q - 1} \right)^{\,n - s}$ words.

d )In this related post it is explained that the Number of binary strings, with $s$ "$1$"'s and $m$ "$0$"'s in total, that have up to $r$ consecutive $1$s is given by $$ N_b (s,r,m + 1)\quad \left| {\;0 \leqslant \text{integers }s,m,r} \right.\quad = \sum\limits_{\left( {0\, \leqslant } \right)\,\,k\,\,\left( { \leqslant \,\frac{s}{r+1}\, \leqslant \,m + 1} \right)} { \left( { - 1} \right)^k \binom{m+1}{k} \binom {s + m - k\left( {r + 1} \right) }{s - k\left( {r + 1} \right) } } $$

Have also a look at this other one and other links provided therein to get a panorama on the subject.

Conclusion

We conclude that the

probability of having up to $r$ consecutive appearances of a given character in a string of lenght $n$ from an alphabet with $q$ characters

is $$ \bbox[lightyellow] { \eqalign{ & P(c \le r\;\left| {\,n,q} \right.) = {1 \over {q^{\,n} }}\sum\limits_{0\, \le \,s\, \le \,n} {\left( {q - 1} \right)^{\,n - s} N_b (s,r,m + 1)} = \cr & = {1 \over {q^{\,n} }}\sum\limits_{0\, \le \,m\, \le \,n} {\left( {q - 1} \right)^{\,m} N_b (n - m,r,m + 1)} = \cr & = {1 \over {q^{\,n} }}\sum\limits_{\left( {0\, \le } \right)\,\,k\,\,\left( { \le \,{{n - m} \over {r + 1}}\, \le \,m + 1} \right)} {\left( {q - 1} \right)^{\,m} \left( { - 1} \right)^k \left( \matrix{ m + 1 \cr k \cr} \right)\left( \matrix{ n - k\left( {r + 1} \right) \cr n - m - k\left( {r + 1} \right) \cr} \right)} \cr} }$$

which just tells that starting from a binary word with a definite positioning of the ones, then each of the remaining $q-1$ characters can take the position of the zeros.

Example:

Taking
digits $\{1,2, \cdots, 5\}$ , that is $q=5$,
number of extractions (string lenght), $n =5$;
total number of strings, $q^n=3125$ we get the following table $$ \begin{array}{*{20}c} r &| & 0 & 1 & 2 & 3 & 4 & 5 \\ \hline {q^{\,n} P(c \le r)} &| & {1024} & {2704} & {3060} & {3116} & {3124} & {3125} \\ \end{array} $$

Of course, since this is a cumulative table, to get, for instance, $P(r\le c)$ you compute it as $1-P(c \le r-1)$, while the probability that the max run of consecutive appearances of the chosen character be $r$ will be given by $P(c \le r)-P(c \le r-1)$

G Cab
  • 35,964
  • I would like to get at least r consecutive but not up to r. So if r is 4, then 1 to 3 appearance should not count. – youmu_i19 Sep 20 '19 at 05:51
  • @youmu_i19: once you have the number of words "up to r", then the total n. of words minus that will give the n. of words with "at least r+1", and "up to r" minus "up to r-1" gives the number "with exactly r", etc. – G Cab Sep 20 '19 at 09:51
  • @youmu_i19: I expanded the answer, also adding an example. – G Cab Sep 20 '19 at 19:16
  • Is that the m means the number characters that are not selected? the remaining q−1 characters – youmu_i19 Sep 23 '19 at 04:56
  • @youmu_i19: with $m$ I denoted the number of zeros in the binary string, where of course it is $s+m=n$. Taking $n$ fixed, you can work in $(n,s)$ or $(n,m)$. Each zero in the binary word can be replaced by any of the remaining $q-1$ ch. in the corresponding $q-characters$ string. – G Cab Sep 23 '19 at 07:02