What is the density of a regular language $L$ over an alphabet $\Sigma$ in $\Sigma^n$?

Question

In other words, what is the likelihood that a recognizer of a given regular language will accept a random string of length $n$?

If there is only a single non-terminal $A$, then there are only two kinds of rules:

Intermediate rules of the form $ S \to \sigma S $.
Terminating rules of the form $ S \to \sigma $.

Such a grammar can then be rewritten in shorthand with exactly two rules, thusly:

$$\left\{\begin{align} &S \enspace \to \enspace \{\sigma, \tau, \dots\} S = ΤS\\ &S \enspace \to \enspace \{\sigma, \tau, \dots\} = Τ'\\ \end{align}\right.\\ \space \\ (Τ, Τ' \subset \Sigma) $$

So, we simply choose one of the $Τ$ (this is Tau) symbols at every position, except for the last one, which we choose from $Τ'$.

$$ d = \frac {\lvert Τ\rvert^{n - 1} \lvert Τ' \rvert} {\lvert\Sigma\rvert^n} $$

I will call an instance of such language $L_1$.

If there are two non-terminals, the palette widens:

Looping rules of the form $ S \to \sigma S $.
Alternating rules of the form $ S \to \sigma A $.
Terminating rules of the form $ S \to \sigma $.
Looping rules of the form $ A \to \sigma A $.
Alternating rules of the form $ A \to \sigma S $.
Terminating rules of the form $ A \to \sigma $.

In shorthand: $$\left\{\begin{align} &S \enspace \to \enspace Τ_{SS} S \\ &S \enspace \to \enspace Τ_{SA} A \\ &S \enspace \to \enspace Τ_{S\epsilon} \\ &A \enspace \to \enspace Τ_{AA} A \\ &A \enspace \to \enspace Τ_{AS} S \\ &A \enspace \to \enspace Τ_{S\epsilon} \\ \end{align}\right.\\ \space \\ (Τ_{SS}, Τ_{SA}, Τ_{S\epsilon}, Τ_{AA}, Τ_{AS}, Τ_{S\epsilon} \subset \Sigma) $$

Happily, we may deconstruct this complicated language into words of the simpler languages $L_1$ by taking only a looping rule and either an alternating or a terminating shorthand rule. This gives us four languages that I will intuitively denote $L_{1S}, L_{1S\epsilon}, L_{1A}, L_{1A\epsilon}$. I will also say $L^n$ meaning all the sentences of $L$ that are $n$ symbols long.

So, the sentences of this present language (let us call it $L_2$) consist of $k$ alternating words of $L_{1S}$ and $L_{1A}$ of lengths $m_1 \dots m_k, \sum_{i = 1 \dots k}m_i = n$, starting with $L_{1S}^{m_1}$ and ending on either $L_{1S\epsilon}^{m_k}$ if $k$ is odd or otherwise on $L_{1A\epsilon}^{m_k}$.

To compute the number of such sentences, we may start with the set $\{P\}$ of integer partitions of $n$, then from each partition $P = \langle m_1\dots m_k \rangle$ compute the following numbers:

The number $p$ of distinct permutations $\left(^k_Q\right)$ of the constituent words, where $Q = \langle q_1\dots\ \rangle$ is the number of times each integer is seen in $P$. For instance, for $n = 5$ and $P = \langle 2, 2, 1 \rangle$, $Q = \langle 1, 2 \rangle$ and $p = \frac{3!}{2! \times 1!} = 3$
The product $r$ of the number of words of lengths $m_i \in P$, given that the first word comes from $L_{1S}$, the second from $L_{1A}$, and so on (and accounting for the last word being of a slightly different form):

$$ r = \prod_{i = 1, 3\dots k - 1}\lvert L_{1S}^{m_i} \rvert \times \prod_{i = 2, 4\dots k - 1}\lvert L_{1A}^{m_i} \rvert \times \begin{cases} & \lvert L_{1S\epsilon}^{m_k} \rvert &\text{if $m$ is odd}\\ & \lvert L_{1A\epsilon}^{m_k} \rvert &\text{if $m$ is even}\\ \end{cases} $$

If my thinking is right, the sum of $p \times r$ over the partitions of $n$ is the number of sentences of $L_2$ of length $n$, but this is a bit difficult for me.

My questions:

Is this the right way of thinking?
Can it be carried onwards to regular grammars of any complexity?
Is there a simpler way?
Is there prior art on this topic?

score 2 · Accepted Answer · answered Oct 02 '19 at 18:54

If I understand the question, your problem is the following:

Given a regular language $L$ over alphabet $\Sigma$ and a positive integer $n$, compute the probability that a word chosen uniformly at random from $\Sigma^n$ will be in $L$.

That's equivalent to computing $|L \cap \Sigma^n|$, i.e., the cardinality of the language $L \cap \Sigma^n$. Note that $L \cap \Sigma^n$ is regular, so your problem is equivalent to the problem of counting the number of words in a finite regular language. This is a standard problem that is well-studied; see https://cstheory.stackexchange.com/q/8200/5038, https://cstheory.stackexchange.com/q/32473/5038, Why isn't it simple to count the number of words in a regular language?, Counting the number of words accepted by an acyclic NFA. Here are some results:

If the language $L$ is specified as a DFA or as an unambiguous regexp, then the problem can be solved in polynomial time.
If the language $L$ is specified as a NFA and $n$ is specified in unary, the problem is $\#P$-complete. Thus, there is an exponential-time algorithm but you should not expect a polynomial-time algorithm.
If the language $L$ is specified as a NFA and $n$ is specified in binary, the problem is $PSPACE$-complete. Thus, you should not expect a polynomial-time algorithm.

There are approximation algorithms that you might be able to use in practice to estimate the probability you're seeking (using a SAT solver as a subroutine).

What is the density of a regular language $L$ over an alphabet $\Sigma$ in $\Sigma^n$?

1 Answers1