4

Consider a binary string of length $n$. An edit operation is a single character insert, delete or substitution. Given a string $S$, my question relates to the number of distinct strings that can be made by a single edit operation performed on $S$.

Let us write $f(S)$ for the number of distinct strings that can be made by performing a single edit operation on $S$.

For example, if $S = 1111011010$, then $f(S) = 28$.

Let $X$ be a random variable representing a random binary string of length $n$, with the bits chosen uniformly and independently. My question is what is:

$$\mathbb{E}(f(X))\;?$$

2 Answers2

6

Substitutions are easy – we get $n$ different substitution results.

For insertions and deletions, we need the expected number of changes between $0$ and $1$. There are $n-1$ potential change locations, and each is a change with probability $\frac12$, so the expected number of changes is $\frac{n-1}2$, so the expected number of runs is $\frac{n+1}2$.

The result of a deletion is determined by the run in which we delete, so the expected number of deletion results is $\frac{n+1}2$.

We can count insertions separately according to whether they change the number of runs. If they don't, they just increment the length of some run, and we again expect $\frac{n+1}2$ of these. If they do increase the number of runs, that's because they insert a specific bit in any of $n+1$ locations that isn't a change location, of which we expect $\frac{n-1}2$, so we expect $n+1-\frac{n-1}2=\frac{n+3}2$ such locations.

Thus, in total we have

$$ \mathbb E(f(X))=n+\frac{n+1}2+\frac{n+1}2+\frac{n+3}2=\frac52(n+1)\;. $$

joriki
  • 242,601
2

It may be useful to know that for a random string of length $n$

  • It has $n$ characters
  • The expected number of groups of identical characters is $\frac{n+1}2$
  • The expected number of pairs of identical characters is $\frac{n-1}2$
  • The number of ends is $2$

So for different types of edits:

  • The number of possible substitutions is $n$
  • The expected number of shrinkages of a group of identical characters is $\frac{n+1}2$
  • The expected number of expansions of a group of identical characters with the same character is $\frac{n+1}2$
  • The expected number of insertions of a different character into a pair of identical characters is $\frac{n-1}2$
  • The number of possible insertions of a different character at the beginning or end is $2$

making the expected number of possible edits $n+\frac{n+1}2+\frac{n+1}2+\frac{n-1}2+2 = \frac{5(n+1)}{2}$

Henry
  • 169,616
  • Thank you for this beautifully clean solution. I find your solution a little easier to follow as well. –  Dec 21 '19 at 12:10