1

I am planning to give a lesson on Shannon entropy to high school seniors who are likely to pursue scientific studies at university next year. As part of the lesson, I want to provide a practical example of calculating entropy manually.

I thought about using the famous scene from The Shining where Jack Torrance repeatedly types the sentence "All work and no play makes Jack a dull boy." My idea is to treat each word as a symbol and calculate the entropy of the text by determining the probability of each word occurring ("all", "work", "and", etc.). The goal is to show how repeated patterns decrease the overall entropy of the text, as Shannon entropy measures redundancy.

  1. What do you think of this approach? Do you have any suggestions to make it more engaging or effective?

  2. If I were to calculate the entropy based on characters instead of words, how would the results differ? Would this method better illustrate the concept of Shannon entropy, or is the word-based approach more suitable for high school students?

Thank you in advance for your insights and suggestions!

enter image description here

Mark
  • 7,702
  • 6
  • 41
  • 80
  • 2
    FYI, I believe questions like this one are better suited for the Mathematics Educators SE site. – John Omielan Jun 01 '24 at 21:21
  • 2
    How are you assigning probabilities? Frequency of word use? Tough, since there are dependencies. – lulu Jun 01 '24 at 21:24
  • 1
    You could do letter counts. – lulu Jun 01 '24 at 21:25
  • 3
    I think this is far too subtle of a case to use in an introductory example. As @lulu points out, assigning probabilities is the first step for any information-theoretic analysis and that's highly contextual - in natural language, this is not at all easy to do. I'll write a full answer later, but in short I suggest that you make sure that your examples are strongly contextualized so that the application immediately and obviously resolves the potentially-deep interpretational issues. – user3716267 Jun 02 '24 at 01:47
  • Thank you all! Yes, I was thinking of assigning probabilities based on the frequency of word usage... The thing is, this class is studying Kubrick's films and the literature that inspired them, so I liked the idea of calculating entropy for this example. – Mark Jun 02 '24 at 10:35

1 Answers1

3

Natural language is a tempting example for information theory - after all, it is the most familiar form of code for almost everyone. But for a first introduction to the concept, it's deceptively complicated. Natural language statements contain much more information than meets the eye.

Teaching Shannon entropy to high school students is particularly hard because they're unlikely to yet have the mathematical fundamentals to really understand why entropy "works." It's fairly easy to describe entropy loosely as the amount of "uncertainty" or "redundancy" in a random variable or signal. However, there are many such measures we could use - why is this funky sum-of-logarithms the "correct" one?

I have a fairly specific opinion on this, which I'll borrow from an earlier post of mine on cross-validated:

Given some (discrete) random variable X, we may ask ourselves what the "average probability" of the outcomes of X are. To do this properly, we need to take a geometric mean. Unfortunately, geometric means are obnoxious - conveniently for us, though, the logarithm of the geometric mean is the arithmetic mean of the logarithm.

Viewed this way, information theory really is "just" probability theory, except we've taken logarithms to turn multiplication into addition.

There are a number of other ways to "single out" Shannon's formula as the "natural" measure of information (see: convexity), but to me this one is the most-intuitively-accessible. Even so, most high school students are unlikely to be familiar with the geometric mean, much less have any intuition for it, and so this can be really hard to get across.

Since students are already likely to struggle understanding the motivation for the entropy formula, it's doubly hard for them to deal with ambiguities in interpreting its output. The above intuition is pretty hard to teach a high school student even when describing the simple case of a weighted coin-flip (or die toss), where there is only one possible way to assign probabilities and it is plainly obvious what the "information" we've calculated actually corresponds to in terms of the values of the signal.

But natural language is (clearly) not just a simple coin-flip or die-toss. If we naively count character frequencies over a representative corpus, the information content we calculate using Shannon's formula is from the point of view of a generating process that looks more like The Library of Babel than natural language:

if all characters are flatly-weighted, you get this

The idea that information is contextualized in this way - that it is "from the point of view" of some generating process - is itself crucial to understanding and using entropy in practice. But if you start off with one of the most-complicated cases imaginable when first introducing the concept, this can easily be misunderstood or overlooked entirely.

I suggest a phased approach: start with a weighted coin toss or some other trivial base case when first introducing the concept. Once the basic idea has been established, you can work through calculating entropies for strings of text with increasing amounts of detail/memory (i.e., as a Markov process of increasing order) so you can illustrate how we use this (finitary) theory to approximately describe signals that may carry infinite information. Without this foundation, it's very hard to understand what the entropy value means.

How much "hard" mathematics you do at each step depends on your students and the amount of time you have, but I strongly suggest painting the full picture - simple to complex - and really taking time to talk about the interpretational issues. Interpretation here is more important than the math.

user3716267
  • 1,421