How does the choice of alphabet impact the Shannon entropy of a sequence (if at all)?

Question

MSA

The context for my problem is multiple sequence alignment, column entropy. So basically:

finding $H$ for sequences like MKR--KK-RR---RRM provided 1-letter code for amino acids and
the change in entropy when a new sequence is added to the alignemnt(the column sequence is extended by one symbol).

Question

I can't quite convince myself that $p(x)$ in the entropy [defintion below] should be interpreted as:

as a frequency of a particular symbol [from the set of observed symbols] (M,R,K,-) as opposed to
the probability of observing a symbol in any position [given the amino acid alphabet of 20 symbols].

In the context of MSA people generally use the former approach (1), but to me it seems like it doesn't account for the important aspect of the underlying "alphabet" as the definition calls it.

Example. So if have two sequences of same length: $S_{1}:=$K-K-P-RM and $S_{2}$:=KK-RR-MM where one features only 3 symbols out of 20 but the other features 4 symbols out of 20, it seems wrong to compare their "entropies" $H$ because the sets over which the events are distributed are not the same: $\mathcal{X}_{1}=\{K,P,R,M\} \neq \mathcal{X}_{2}=\{K,R,M\}$. Yet, that's what happens when we add "column entropies". Furthermore, i know that people also sometimes account for stereochemical properties of amino acids when calculating $H$ (some amino acids become more likely to occur than others in that case as opposed to being equiprobable).

I think i might be conflating some things, maybe someone can state the difference them for me. In particular:

how does the choice of alphabet affect the entropy of a sequence if not all symbols of the alphabet are observed in the sequence (as per def.)?
If it does not, what is the extended entropy defintion that would account for things like that?
Am i correct to say that the underlying distributions are different for seqs with different number of distinct symbols?

Shannon Entropy

For reference, wikipidea definiton of entropy $H$:

Given a discrete random variable $X$, which takes values $x$ in the alphabet $\mathcal{X}$ and is distributed according to ${\displaystyle p:{\mathcal {X}}\to [0,1]}$:

$${\displaystyle \mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)=\mathbb {E} [-\log p(X)],}$$

where $\Sigma$ denotes the sum over the variable's possible values.

Related answers that didn't help:Answer 1, Answer 2

Try reading the first part of this paper by Hartley. Give an intuitive construction of the formula of Entropy. — Joako, May 07 '23 at 00:07
(2) is probably correct - it depends in your expected alphabet. The underlying distribution doesn’t depend on the number of sequences in your sequence it depends on your prior distribution before you look at the sequence. Usually this will be $1/N$ if you have $N$ symbols, but might over or under weight based on Prior distributions. — Eric, May 07 '23 at 01:43
Of course, maybe if the first symbol is X the. The second symbol might heavily depend on that (i.e. be more likely to be $X$) which should then be included in your entropy calculation — Eric, May 07 '23 at 01:44

score 2 · Accepted Answer · edited May 07 '23 at 05:08

it seems wrong to compare their "entropies" H because the sets over which the events are distributed are not the same: $\mathcal{X}_1=\{K,P,R,M\} \neq \mathcal{X}_2=\{K,R,M\}$.

You've misunderstood the probability space over which $p$ is defined. The column entropy of a particular index in an MSA is calculated over the distribution of bases at that index in the MSA. In other words, let's represent your MSA by an $m \times n$ matrix $S$ where each row is a sequence $S_i$ in your MSA of length $n$. You can think of $S$ as defining $n$ probability distributions $p_j$ over the (single!) alphabet $\mathcal{X}$ of all 20 amino acid symbols, where

$$p_j(x) := \frac{|\{i \in [m] : S_{i,j} = x\}|}{m}$$

Now to compute the Shannon entropy of the $j$th column, $p_j$ would be the probability mass function $p$ that you use in the definition you provided and $\mathcal{X}$ would be the alphabet. Also, this is indeed what's going on in the github link you provided. Taken from a comment earlier on in the code:

H ranges from $0$ (only one base/residue in present at that position) to $4.322$ (all $20$ residues are equally represented in that position)

Thank you, your paragraph makes it very explicit, should be in every MSA doc. This still eludes me, outside the MSA context even: how do the basic definition of entropy above + this pmf you provide, $p_{j}(x)$ account for the size of the alphabet in any way? Say, my alignment (not just one column) as a whole features only 10 out of 20 AAs. The pmf only concerns itself with the $x$s observed in the $j^{th}$ column. At the aggregate level $p_{j}(x) log(p_{j}(x))$ is undefined for a symbol that doesn't appear in $j$. Yet the sum index still has us iterating over all ${x\in {\mathcal {X}}}$ — rtviii, May 07 '23 at 19:29
$p_j(x)$ isn't undefined for an $x$ that doesn't appear in column $j$ - it's simply $0$. — Johnny, May 07 '23 at 21:15

How does the choice of alphabet impact the Shannon entropy of a sequence (if at all)?

MSA

Question

Shannon Entropy

1 Answers1