MSA
The context for my problem is multiple sequence alignment, column entropy. So basically:
- finding $H$ for sequences like
MKR--KK-RR---RRMprovided 1-letter code for amino acids and - the change in entropy when a new sequence is added to the alignemnt(the column sequence is extended by one symbol).
Question
I can't quite convince myself that $p(x)$ in the entropy [defintion below] should be interpreted as:
- as a frequency of a particular symbol [from the set of observed symbols] (
M,R,K,-) as opposed to - the probability of observing a symbol in any position [given the amino acid alphabet of 20 symbols].
In the context of MSA people generally use the former approach (1), but to me it seems like it doesn't account for the important aspect of the underlying "alphabet" as the definition calls it.
Example. So if have two sequences of same length: $S_{1}:=$K-K-P-RM and $S_{2}$:=KK-RR-MM where one features only 3 symbols out of 20 but the other features 4 symbols out of 20, it seems wrong to compare their "entropies" $H$ because the sets over which the events are distributed are not the same: $\mathcal{X}_{1}=\{K,P,R,M\} \neq \mathcal{X}_{2}=\{K,R,M\}$. Yet, that's what happens when we add "column entropies". Furthermore, i know that people also sometimes account for stereochemical properties of amino acids when calculating $H$ (some amino acids become more likely to occur than others in that case as opposed to being equiprobable).
I think i might be conflating some things, maybe someone can state the difference them for me. In particular:
- how does the choice of alphabet affect the entropy of a sequence if not all symbols of the alphabet are observed in the sequence (as per def.)?
- If it does not, what is the extended entropy defintion that would account for things like that?
- Am i correct to say that the underlying distributions are different for seqs with different number of distinct symbols?
Shannon Entropy
For reference, wikipidea definiton of entropy $H$:
Given a discrete random variable $X$, which takes values $x$ in the alphabet $\mathcal{X}$ and is distributed according to ${\displaystyle p:{\mathcal {X}}\to [0,1]}$:
$${\displaystyle \mathrm {H} (X):=-\sum _{x\in {\mathcal {X}}}p(x)\log p(x)=\mathbb {E} [-\log p(X)],}$$
where $\Sigma$ denotes the sum over the variable's possible values.