2

In other words, what is the likelihood that a recognizer of a given regular language will accept a random string of length $n$?

 

If there is only a single non-terminal $A$, then there are only two kinds of rules:

  1. Intermediate rules of the form $ S \to \sigma S $.
  2. Terminating rules of the form $ S \to \sigma $.

Such a grammar can then be rewritten in shorthand with exactly two rules, thusly:

$$\left\{\begin{align} &S \enspace \to \enspace \{\sigma, \tau, \dots\} S = ΤS\\ &S \enspace \to \enspace \{\sigma, \tau, \dots\} = Τ'\\ \end{align}\right.\\ \space \\ (Τ, Τ' \subset \Sigma) $$

So, we simply choose one of the $Τ$ (this is Tau) symbols at every position, except for the last one, which we choose from $Τ'$.

$$ d = \frac {\lvert Τ\rvert^{n - 1} \lvert Τ' \rvert} {\lvert\Sigma\rvert^n} $$

I will call an instance of such language $L_1$.

 

If there are two non-terminals, the palette widens:

  1. Looping rules of the form $ S \to \sigma S $.
  2. Alternating rules of the form $ S \to \sigma A $.
  3. Terminating rules of the form $ S \to \sigma $.
  4. Looping rules of the form $ A \to \sigma A $.
  5. Alternating rules of the form $ A \to \sigma S $.
  6. Terminating rules of the form $ A \to \sigma $.

In shorthand: $$\left\{\begin{align} &S \enspace \to \enspace Τ_{SS} S \\ &S \enspace \to \enspace Τ_{SA} A \\ &S \enspace \to \enspace Τ_{S\epsilon} \\ &A \enspace \to \enspace Τ_{AA} A \\ &A \enspace \to \enspace Τ_{AS} S \\ &A \enspace \to \enspace Τ_{S\epsilon} \\ \end{align}\right.\\ \space \\ (Τ_{SS}, Τ_{SA}, Τ_{S\epsilon}, Τ_{AA}, Τ_{AS}, Τ_{S\epsilon} \subset \Sigma) $$

Happily, we may deconstruct this complicated language into words of the simpler languages $L_1$ by taking only a looping rule and either an alternating or a terminating shorthand rule. This gives us four languages that I will intuitively denote $L_{1S}, L_{1S\epsilon}, L_{1A}, L_{1A\epsilon}$. I will also say $L^n$ meaning all the sentences of $L$ that are $n$ symbols long.

So, the sentences of this present language (let us call it $L_2$) consist of $k$ alternating words of $L_{1S}$ and $L_{1A}$ of lengths $m_1 \dots m_k, \sum_{i = 1 \dots k}m_i = n$, starting with $L_{1S}^{m_1}$ and ending on either $L_{1S\epsilon}^{m_k}$ if $k$ is odd or otherwise on $L_{1A\epsilon}^{m_k}$.

To compute the number of such sentences, we may start with the set $\{P\}$ of integer partitions of $n$, then from each partition $P = \langle m_1\dots m_k \rangle$ compute the following numbers:

  1. The number $p$ of distinct permutations $\left(^k_Q\right)$ of the constituent words, where $Q = \langle q_1\dots\ \rangle$ is the number of times each integer is seen in $P$. For instance, for $n = 5$ and $P = \langle 2, 2, 1 \rangle$, $Q = \langle 1, 2 \rangle$ and $p = \frac{3!}{2! \times 1!} = 3$

  2. The product $r$ of the number of words of lengths $m_i \in P$, given that the first word comes from $L_{1S}$, the second from $L_{1A}$, and so on (and accounting for the last word being of a slightly different form):

    $$ r = \prod_{i = 1, 3\dots k - 1}\lvert L_{1S}^{m_i} \rvert \times \prod_{i = 2, 4\dots k - 1}\lvert L_{1A}^{m_i} \rvert \times \begin{cases} & \lvert L_{1S\epsilon}^{m_k} \rvert &\text{if $m$ is odd}\\ & \lvert L_{1A\epsilon}^{m_k} \rvert &\text{if $m$ is even}\\ \end{cases} $$

If my thinking is right, the sum of $p \times r$ over the partitions of $n$ is the number of sentences of $L_2$ of length $n$, but this is a bit difficult for me.

 

My questions:

  • Is this the right way of thinking?
  • Can it be carried onwards to regular grammars of any complexity?
  • Is there a simpler way?
  • Is there prior art on this topic?
Ignat Insarov
  • 249
  • 1
  • 8

1 Answers1

2

If I understand the question, your problem is the following:

Given a regular language $L$ over alphabet $\Sigma$ and a positive integer $n$, compute the probability that a word chosen uniformly at random from $\Sigma^n$ will be in $L$.

That's equivalent to computing $|L \cap \Sigma^n|$, i.e., the cardinality of the language $L \cap \Sigma^n$. Note that $L \cap \Sigma^n$ is regular, so your problem is equivalent to the problem of counting the number of words in a finite regular language. This is a standard problem that is well-studied; see https://cstheory.stackexchange.com/q/8200/5038, https://cstheory.stackexchange.com/q/32473/5038, Why isn't it simple to count the number of words in a regular language?, Counting the number of words accepted by an acyclic NFA. Here are some results:

  • If the language $L$ is specified as a DFA or as an unambiguous regexp, then the problem can be solved in polynomial time.

  • If the language $L$ is specified as a NFA and $n$ is specified in unary, the problem is $\#P$-complete. Thus, there is an exponential-time algorithm but you should not expect a polynomial-time algorithm.

  • If the language $L$ is specified as a NFA and $n$ is specified in binary, the problem is $PSPACE$-complete. Thus, you should not expect a polynomial-time algorithm.

There are approximation algorithms that you might be able to use in practice to estimate the probability you're seeking (using a SAT solver as a subroutine).

D.W.
  • 167,959
  • 22
  • 232
  • 500