I am studying constrained coding for composite DNA (based on arXiv:2501.10645), where each composite symbol represents a mixture of nucleotides. To enforce a maximum runlength $\ell$, they model valid sequences using a transition matrix over states of length $\ell$. However, the state space grows exponentially as $\left | \Sigma \right |^{\ell}$, where $\Sigma$ is the composite alphabet.
Runlength-Limited (RLL) codes are used in DNA-based storage to prevent long homopolymer runs that lead to sequencing and synthesis errors. Given an RLL constraint $\ell$ as above, so that restricts the longest run within a sequence to be at most $\ell$. This limits ambiguous signal regions in technologies like Illumina or Nanopore, reducing indel errors without affecting the chemical stability of the DNA. The capacity computed corresponds to the asymptotic growth rate (like Shannon capacity) of allowed sequences.
So, I was going to study two examples that show how this reserved construction impractical for large $\ell$.
Case $\ell= 1$:
$\bullet$ Alphabet: $\Sigma_{1}= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M} \right\}$ with $\texttt{M}= \texttt{A}/\texttt{C}$ mix (as in the given paper).
$\bullet$ Forbidden substrings: Runs of length $2$ (e.g., $\mathcal{F}= \left\{ \texttt{A}^{2}, \texttt{T}^{2}, \texttt{C}^{2}, \texttt{G}^{2}, \texttt{M}^{2}, \texttt{AM}, \texttt{MA}, \texttt{CM}, \texttt{MC} \right\}$).
$\bullet$ Transition matrix: $5\times 5$ adjacency matrix
$$\mathcal{A}= \begin{bmatrix} 0 & 1 & 1 & 1 & 0\\ 1 & 0 & 1 & 1 & 1\\ 1 & 1 & 0 & 1 & 0\\ 1 & 1 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 0\\ \end{bmatrix}$$
$\phantom{\bullet}\mathcal{A}_{i, j}^{n}$ counts the number of valid paths of length $n$ from state $i$ to state $j$.
$\bullet$ Perron–Frobenius theorem: For large $n$, the growth of $\mathcal{A}^{n}$ is dominated by the largest real eigenvalue $\lambda_{\max}$. So the total number of $\ell$-RLL sequences of length $n$, $\left| \mathcal{C}_{n} \right|$, is approximately:
$$\left| \mathcal{C}_{n} \right|\approx c\lambda_{\max}^{n}\Rightarrow\mathbf{cap}_{\ell; \Sigma}= \lim\limits_{n\rightarrow\infty}\frac{\log_{2}\left| \mathcal{C}_{n} \right|}{n}= \log_{2}\lambda_{\max}= \log_{2}3.323= 1.733$$
Case $\ell= 2$:
$\bullet$ Alphabet: $\Sigma_{1}= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M} \right\}$ with $\texttt{M}= \texttt{A}/\texttt{C}$ mix (as in the given paper).
$\bullet$ Forbidden substrings: Runs of length $3$ ($\mathcal{F} = \left \{ \texttt{AAA}, \texttt{TTT}, \texttt{CCC}, \texttt{GGG},
\texttt{AAM}, \texttt{AMA}, \texttt{MAA},
\texttt{CCM}, \texttt{CMC}, \texttt{MCC},
\texttt{MMM}, \texttt{AMM}, \texttt{MAM}, \texttt{MMA},
\texttt{CMM}, \texttt{MCM}, \texttt{MMC} \right \}$).
$\bullet$ Transition matrix: $5^{2}\times 5^{2}$ adjacency matrix.
$\bullet$ Perron–Frobenius eigenvalue (as following code):
import numpy as np
import itertools
alphabet = ['A', 'T', 'C', 'G', 'M']
l = 2
forbidden_substrings = [
'AAA', 'TTT', 'CCC', 'GGG',
'AAM', 'AMA', 'MAA', 'CCM',
'CMC', 'MCC', 'MMM', 'AMM',
'MAM', 'MMA', 'CMM', 'MCM', 'MMC'
]
Adjacency matrix
states = [''.join(p) for p in itertools.product(alphabet, repeat=l)]
state_to_idx = {s: i for i, s in enumerate(states)}
trans_matrix = np.zeros((len(states), len(states)), dtype=int)
for i, state in enumerate(states):
for char in alphabet:
new_seq = state + char
if new_seq not in forbidden_substrings:
new_state = new_seq[-l:]
j = state_to_idx[new_state]
trans_matrix[i, j] = 1
Power iteration
v = np.ones(len(states))
for _ in range(1000):
v = trans_matrix @ v
v /= np.linalg.norm(v)
lambda_max = np.max(trans_matrix @ v)
capacity = np.log2(lambda_max)
print(f"Capacity (ℓ=2): {capacity:.3f} bits/symbol") # Capacity: 2.170
Question: Is there an efficient way to compute the largest real eigenvalue (Perron–Frobenius eigenvalue) of the transition matrix for $\ell$-RLL constraint without explicitly constructing the full $\left | \Sigma \right |^{\ell}\times\left | \Sigma \right |^{\ell}$ matrix?
In particular:
- Can this eigenvalue be estimated using methods such as the permanent, the characteristic polynomial, or other symbolic techniques (e.g., generating functions, spectral graph theory)?
- Among possible composite DNA alphabets (e.g., $\Sigma= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M}, \texttt{K} \right\}$), which alphabet yields the highest channel capacity under $\ell$-RLL constraint?
Thank you very much for your valuable insights. They will deepen my understanding and help advance my research career in this field.