5

I am studying constrained coding for composite DNA (based on arXiv:2501.10645), where each composite symbol represents a mixture of nucleotides. To enforce a maximum runlength $\ell$, they model valid sequences using a transition matrix over states of length $\ell$. However, the state space grows exponentially as $\left | \Sigma \right |^{\ell}$, where $\Sigma$ is the composite alphabet.

Runlength-Limited (RLL) codes are used in DNA-based storage to prevent long homopolymer runs that lead to sequencing and synthesis errors. Given an RLL constraint $\ell$ as above, so that restricts the longest run within a sequence to be at most $\ell$. This limits ambiguous signal regions in technologies like Illumina or Nanopore, reducing indel errors without affecting the chemical stability of the DNA. The capacity computed corresponds to the asymptotic growth rate (like Shannon capacity) of allowed sequences.

So, I was going to study two examples that show how this reserved construction impractical for large $\ell$.

Case $\ell= 1$:
$\bullet$ Alphabet: $\Sigma_{1}= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M} \right\}$ with $\texttt{M}= \texttt{A}/\texttt{C}$ mix (as in the given paper).
$\bullet$ Forbidden substrings: Runs of length $2$ (e.g., $\mathcal{F}= \left\{ \texttt{A}^{2}, \texttt{T}^{2}, \texttt{C}^{2}, \texttt{G}^{2}, \texttt{M}^{2}, \texttt{AM}, \texttt{MA}, \texttt{CM}, \texttt{MC} \right\}$).
$\bullet$ Transition matrix: $5\times 5$ adjacency matrix
$$\mathcal{A}= \begin{bmatrix} 0 & 1 & 1 & 1 & 0\\ 1 & 0 & 1 & 1 & 1\\ 1 & 1 & 0 & 1 & 0\\ 1 & 1 & 1 & 0 & 1\\ 0 & 1 & 0 & 1 & 0\\ \end{bmatrix}$$ $\phantom{\bullet}\mathcal{A}_{i, j}^{n}$ counts the number of valid paths of length $n$ from state $i$ to state $j$.
$\bullet$ Perron–Frobenius theorem: For large $n$, the growth of $\mathcal{A}^{n}$ is dominated by the largest real eigenvalue $\lambda_{\max}$. So the total number of $\ell$-RLL sequences of length $n$, $\left| \mathcal{C}_{n} \right|$, is approximately: $$\left| \mathcal{C}_{n} \right|\approx c\lambda_{\max}^{n}\Rightarrow\mathbf{cap}_{\ell; \Sigma}= \lim\limits_{n\rightarrow\infty}\frac{\log_{2}\left| \mathcal{C}_{n} \right|}{n}= \log_{2}\lambda_{\max}= \log_{2}3.323= 1.733$$ Case $\ell= 2$:
$\bullet$ Alphabet: $\Sigma_{1}= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M} \right\}$ with $\texttt{M}= \texttt{A}/\texttt{C}$ mix (as in the given paper).
$\bullet$ Forbidden substrings: Runs of length $3$ ($\mathcal{F} = \left \{ \texttt{AAA}, \texttt{TTT}, \texttt{CCC}, \texttt{GGG}, \texttt{AAM}, \texttt{AMA}, \texttt{MAA}, \texttt{CCM}, \texttt{CMC}, \texttt{MCC}, \texttt{MMM}, \texttt{AMM}, \texttt{MAM}, \texttt{MMA}, \texttt{CMM}, \texttt{MCM}, \texttt{MMC} \right \}$).
$\bullet$ Transition matrix: $5^{2}\times 5^{2}$ adjacency matrix.
$\bullet$ Perron–Frobenius eigenvalue (as following code):

import numpy as np
import itertools

alphabet = ['A', 'T', 'C', 'G', 'M'] l = 2 forbidden_substrings = [ 'AAA', 'TTT', 'CCC', 'GGG', 'AAM', 'AMA', 'MAA', 'CCM', 'CMC', 'MCC', 'MMM', 'AMM', 'MAM', 'MMA', 'CMM', 'MCM', 'MMC' ]

Adjacency matrix

states = [''.join(p) for p in itertools.product(alphabet, repeat=l)] state_to_idx = {s: i for i, s in enumerate(states)} trans_matrix = np.zeros((len(states), len(states)), dtype=int)

for i, state in enumerate(states): for char in alphabet: new_seq = state + char if new_seq not in forbidden_substrings: new_state = new_seq[-l:] j = state_to_idx[new_state] trans_matrix[i, j] = 1

Power iteration

v = np.ones(len(states)) for _ in range(1000): v = trans_matrix @ v v /= np.linalg.norm(v) lambda_max = np.max(trans_matrix @ v) capacity = np.log2(lambda_max)

print(f"Capacity (ℓ=2): {capacity:.3f} bits/symbol") # Capacity: 2.170


Question: Is there an efficient way to compute the largest real eigenvalue (Perron–Frobenius eigenvalue) of the transition matrix for $\ell$-RLL constraint without explicitly constructing the full $\left | \Sigma \right |^{\ell}\times\left | \Sigma \right |^{\ell}$ matrix?
In particular:

  1. Can this eigenvalue be estimated using methods such as the permanent, the characteristic polynomial, or other symbolic techniques (e.g., generating functions, spectral graph theory)?
  2. Among possible composite DNA alphabets (e.g., $\Sigma= \left\{ \texttt{A}, \texttt{T}, \texttt{G}, \texttt{C}, \texttt{M}, \texttt{K} \right\}$), which alphabet yields the highest channel capacity under $\ell$-RLL constraint?

Thank you very much for your valuable insights. They will deepen my understanding and help advance my research career in this field.

Dang Dang
  • 320
  • 1
    I don't know if it's possible to know the largest eigenvalue without constructing the full matrix. Nevertherless, you can find low and up boundarys. For example in Spectra of Graphs of Andries E. Bower and William H. Haemers, Proposition 3.1.2 it says that for strongly connected graphs the largest eigenvalue it's between $\overline{k}$ and $k_{max}$, the averege degree and the max degree. Since you have sparse graphs it's not the best bounday, but maybe you find something more of your interest. – user_sion Apr 21 '25 at 09:47
  • @user_sion Yes, I also find that degree-based bounds are an effective way to estimate the spectral radius. I hope to soon find a way to formulate my problem more simply. Thank you for your dedicated comment! – Dang Dang Apr 21 '25 at 10:43
  • 2
    https://cs.stackexchange.com/q/171782/755 – D.W. Apr 24 '25 at 04:20

1 Answers1

2

Note that the matrix for $\ell = 2$ has multiple equal rows and columns (for example corresponding to $\mathrm{TM}$, $\mathrm{MT}$, $\mathrm{CT}$ and $\mathrm{TC}$).

Repeated rows and columns can be removed in the following way:

  1. Firstly make a two-side multiplication by an invertible matrix $P$: matrices $A$ and $P^{-1}AP$ have the same multiset of eigenvalues. Selecting an appropriate $P$ we can turn all but one repeated rows (columns) to zeros.
  2. Remove zero rows and columns: every zero cross of row and column sharing a diagonal element correspond to an eigenvalue $0$. Removing such a zero cross doesn't influence other eigenvalues.

This is the way one can reduce the size of the matrix down to the number of forbidden substrings (or even less) without constructing the full matrix.

Smylic
  • 8,098