Calculating Shannon Entropy for DNA sequence?

Question

I'm following the formula on http://www.shannonentropy.netmark.pl/calculate to calculate the Shannon Entropy of a string of nucleotides [nt]. Since their are 4 nt, I assigned them each with equal probability P(nt) = 0.25. The equation I'm using is -sum([Pr(x)*log2(Pr(x)) for all x in X]) #X is the DNA sequence (e.g. ATCG).

So my question is this: In Shannon Entropy, MUST the probability be based solely on the sequence itself or can the probabilities be predetermined (i.e. nt_set = {A, T, C, G} and each P(nt) = 0.25)

If I used predetermined probabilities, would that still be entropy and if not, what would I be calculating?

leonbloy · Accepted Answer · 2015-08-24T20:32:01.293

In Shannon Entropy, MUST the probability be based solely on the sequence itself or can the probabilities be predetermined

Rather on the contrary (if I understand you right): the probabilities must be predetermined. More precisely: the Shannon entropy is defined in terms of a probabilistic model, it assumes that the probabilities are known. Hence, it does not make much sense to speak of the entropy of a particular sequence, but rather of the entropy of a source that emit that kind of sequence (in a probabilistic sense).

In your case, if you assume that you have 4 symbols, and that they are equiprobable and independent, then the entropy is 2 bits per symbol.

score 0 · Answer 2 · edited Apr 23 '17 at 10:43

In communication theory, a particular sequence of message elements may have its element probability calculated from the sequence itself, e.g for four possible elements, $P_i$ will not generally be equal to 0.25. Shannon Entropy is then calculated to determine the information content of the particular sequence, and therefore in a sense, the possible complexity of the particular sequence in the message. Determining starting and stopping points is often problematic in serial data streams. Decreases and increases in Shannon Entropy may define message points where simple sequences, e.g. a low information sequence to define the start of an information block, separate more complex sequences that contain a possible underlying message.

Calculating Shannon Entropy for DNA sequence?

2 Answers2

Linked