How to measure information using entropy

Which text gives us more information – the full body of work of Shakespeare, or 884,647 random words written by 1000 monkeys?
To answer this, we talk with our good friend Alice, and ask her to send both of these texts to her close friend Bob. It should seem reasonable that the more information “dense” a text is, the harder it is to send it. Since the time of Shakespeare, we invented the internet over which we can only send messages over the simple 0\1 alphabet so Alice’s job is to translate the text to this 0\1 alphabet. If she manages to find a good translation with fewer 0\1 bits, the easier her job is. If the original text contained a lot of redundancy, namely many parts of the text don’t contain “new” information, then we expect Alice could shorten the message a lot. Conversely, the more information the message has, the harder it is to compress it.

So now, Alice needs to chose for each English letter $\sigma$ a word $w(\sigma)$ over $\{0,1\}$ and then the message $\sigma_1 \sigma_2 ... \sigma_n$ is coded to be $w(\sigma_1)...w(\sigma_n)$ . This leaves us with the question of what is the best that we can hope to gain this way?

For simplicity, lets assume that the Englih alphabet only contains $\Sigma=\{a,b,c,d\}$ . A simple coding is the mapping $a\mapsto 00, b\mapsto 01, c\mapsto 10, d\mapsto 11$ , so for example we have

$M=cabdbaabadacaaba \mapsto 10,00,01,11,01,00,00,01,00,11,00,10,00,00,01,00.$

When Bob sees the word over $\{0,1\}$ (without the commas), he can decode it by dividing it to pairs of 0\1 and then decode each pair. In this case we needed $2n$ bits for a legth- $n$ message (and in this example, total of 32 bits). Can we do better than that?

Prefix free codes:

At this point Alice notes that she uses a lot of $a$ ‘s and thinks that maybe she can use a shorter coding for $a$ , for example $w(a)=0$ , and keep the rest of the letters with the same coding. The problem then is that when Bob sees $0$ he doesn’t know if it belong to an $a$ or it is the beginning of $b$ (which is mapped to 01). In order to help Bob decode the message, Alice decides to use a prefix free code – no coding of one letter is a prefix of a coding of another letter. This way, once Bob reads the coding of $w(\sigma)$ for some $\sigma \in \Sigma$ he knows that this coding must belong to $\sigma$ and not part of a coding of a different letter. The best way to visualize such a code is using a binary tree, for example:

The coding of a letter $\sigma$ is the string on the path leading from the root of the tree to the leaf labeled with $\sigma$ . So in the tree above we have that $a\mapsto 0, b\mapsto 10, c\mapsto 110$ and $d\mapsto 111$ . Now the coding will be

$M=cabdbaabadacaaba \mapsto w(M)=1100101111000100111011000100.$

You should check that you can decode the coded message above back to the original message. Letting $l(\sigma)=|w(\sigma)|$ be the length of the coding of $\sigma$ and $M_\sigma$ be the number of occurrences of $\sigma$ in the message, we get that the length of the coded message is $|w(M)|=\sum l(\sigma) M_\sigma$ . In this example we have

$M_a = 8, M_b=4, M_c=M_d=2$
$l(a) = 1, l(b)=2, l(c)=l(d)=3$ ,

and therefore the number of bits used is $1\cdot 8 + 2\cdot 4 + 3\cdot 2 + 3\cdot 2 = 28$ , which is less than the 32 bits in Alice’s previous attempt.

Before, we used exactly two bits per letter. How many do we have now? Well, the computation is simply $\frac{|w(M)|}{|M|}$ which in our example is $\frac{28}{16}=\frac{7}{4}$ which is less than the 2 bits per letter in the previous coding. Note also that we can write this as

$\frac{|w(M)|}{|M|} = \sum l(\sigma) \frac{M_\sigma}{|M|}$ ,

where $p_\sigma:=\frac{M_\sigma}{|M|}$ is exactly the probability of seeing $\sigma$ in a random place in $M$ which is independent of the coding that Alice chooses. In the example above we get

$l(a)=1, p_a = \frac{1}{2},$

$l(b)=2, p_b= \frac{1}{4}$ ,

$l(c)=l(d)=3, p_c=p_d=\frac{1}{8}$ .

This let us compute the bits per letter directly from the coding scheme, instead of having to code the entire message and then count the number of bits there.

The best prefix code:

Can we do better than that by choosing a different coding? In the case above, the answer is no, but how can such a claim be proven? To do that, we first need to ask ourselves what conditions do the lengths $l(\sigma)$ must satisfy for any given prefix code. Clearly, the lengths cannot be too small. For example, if $l(a)=l(b)=1$ , than the tree must look like

and in particular there is no room for other letters. Similarly, we cannot have $l(a)=1, l(b)=l(c)=l(d)=2$ . To get some intuition – if all the leaves had height $l$ , then we would had at most $2^l$ leaves. You should think of that as giving each leaf of height $l$ the weight $2^{-l}$ and asking that the total weight would be at most $1$ . The next claim shows that this is true even when the leaves have different heights.

Claim: Let $l_1, ..., l_n \geq 0$ be integers. Then there is a binary tree $T$ with leaves $\sigma_1,...,\sigma_n$ at heights $l(\sigma_i) = l_i$ if and only if $\sum 2^{-l_i}\leq 1$ .

Proof: Assume first that all the $l_i$ are equal to $l$ . In this case, the claim there is a binary tree with $n$ leaves at height $l$ if and only if $n\cdot 2^{-l}\leq 1$ . The claim is now proved by noting that the maximum number of leaves of height $l$ in a binary tree is $2^l$ .

For the more general claim, we note that given a binary tree, we can always change a leaf of height $l$ to an internal node and give it two children leaves of height $l+1$ .

Moreover, these two new leaves contribute the same weight as the original leaf, since

$2^{-l(\sigma_i)}=\frac{1}{2}2^{-l(\sigma_i)}+\frac{1}{2}2^{-l(\sigma_i)}=2^{-l(\sigma_{i,1})}+2^{-l(\sigma_{i,2})}$ .

If we are now given a binary tree with leaves at heights $l_1, ..., l_n$ , then we can always split a leaf at height $l_i$ into two leaves of height $l_i+1$ without changing the total weight. Doing this process several times until all the leaves are at the same height will show that their total weight must be smaller than 1, and therefore $\sum 2^{-l_i} \leq 1$ .

The other direction is proved similarly, and I will leave it as an exercise. $\square$

Returning back to Alice’s problem, we want to minimize the length of the coded message $\sum M_\sigma l(\sigma)$ given the condition $\sum 2^{-l(\sigma_i)} \leq 1$ where the $l(\sigma_i)$ are positive integers. A minimization problem with integer coefficients can be quite hard (though this one is not too difficult), so we relax the problem and ask only that $l_i >0$ . A simple Lagrange multipliers argument will show that the best that we can hope for is when $l_i = \log_2(\frac{1}{p_i})$ . Doing a little sanity check – note that if a letter appears many times in $M$ , then $p_i$ is going to be large and therefore $l_i$ is going to be small.

If the $p_i$ are all powers of $2$ , then $\log_2(p_i^{-1})$ are all integers, so we solved the minimization problem with integer lengths as well – which is the case with our example above. Otherwise we can try to round the logarithms up or down to get an integer solution. In general, there is a way to find the best solution for such coding (with integer lengths!) which is called Huffman’s coding where the ideas there are very similar to the claim that we proved above.

Considering the number of bits per letter leads us to the definition of entropy.

Definition: Let $\bar{p}=(p_1,...,p_n)$ be a probability vector. The entropy of $\bar{p}$ is then defined to be $H(\bar{p})= \sum p_i \log(p_i^{-1})$ , where we take $0 \log(0^{-1}) :=0$ .

Thus, with this notation, the best bits per letter that we can hope for is $H(\bar{p})$ . In our example above, we have $\bar{p} = (1/2, 1/4, 1/8, 1/8)$ , so that the best bits per letter we can hope for is

$\frac{1}{2} log(2) + \frac{1}{4} log(4) + \frac{1}{8} log(8) + \frac{1}{8} log(8) = \frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{3}{8} = \frac{7}{4}$

which is what we got in Alice’s second coding.

A single letter is not enough:

After enough time Alice decides that spending $H(\bar{p})$ bits per letter is too much and she wonders if she can reduce it even further. The main idea in the previous coding is that if we have some information about the message (e.g. how many times each letter appears), then we can construct a coding that uses this information in order to get good bound on the bits per letter.

With this in mind, instead of looking on how many times each letter appears, we can consider instead how many times each pairs of letters appear in the message. For example, in the message $aabababb$ , the letters $a$ and $b$ appear each half of the time, so in a sense we don’t have any extra information. On the other hand, if we consider it as a word over $\{aa,ab,ba,bb\}$ , then the number of times that we see $aa,ab,bb$ and $ba$ is $1,0,1$ and $2$ . This is not uniform anymore, namely it gives us some information that we can work with.

We can now use the same idea from before, just for pairs of letters. We choose a prefix code for the new alphabet $\Sigma^2=\{\sigma \tau \mid \sigma,\tau \in \Sigma\}$ , i.e. lengths $l_{\sigma \tau}$ such that $\sum_{\sigma,\tau} 2^{-l(\sigma \tau)}\leq 1$ to get a coded message of length $\sum M_{\sigma \tau} l_{\sigma \tau}$ . Note that now $p_{\sigma \tau} = \frac{M_{\sigma \tau}}{|M|/2}$ , so that the number of bit per letter is

$\frac{1}{|M|} \sum M_{\sigma \tau} l_{\sigma \tau}=\frac{1}{2}\sum p_{\sigma \tau} l_{\sigma \tau}.$

While it is not exactly clear from the formulas, this can only improve our bits per letter, since any coding that relies on one letter information can be done using two letters information. In particular, in our original example, this will reduce the $\frac{7}{8}$ bits per letter to ${5}{4}$ .

If we can do two letters, then we can also do three letters, and $m$ letters, where the bits per letter is (up to integer rounding) at most

$\frac{1}{m} \sum_{w\in \Sigma^m} p_w l_w$ .

In a sense, we would like to take the limit of such codings, however we need to be careful. We ignored it up until now, but Alice also has to tell Bob how to encode her message, namely the map from words of length $m$ to their coding. If $m$ is relatively small compared to $n$ , then this extra information doesn’t add to much to the message that Alice sends. But if it is too large (e.g. $n=m$ ), then this extra information is too large and changes the whole computation.

Returning to our original question, the text of Shakespeare’s “A midsummer night’s dream” has 63 types of characters (so for the uniform coding you will need about $6\sim \log_2(63)$ bits per letter, but if you compute the 1-letter entropy, you will get that you only need around 4.687 bits per letter. In particular, a uniform typing monkey (now sold in stores near you) will produce a text with more “information”, though probably useless information. While it is a bit counter intuitive, you can think of this as follows: If you already read 1000 words from Shakespeare, there is a good chance that you can guess the 1001 word, so it carries less information, while every word the monkey writes is a new surprise.

To infinity and beyond:

I will finish with a more advanced section which shows how the entropy definition so far leads to entropy definition for more general systems.

Up until now we had a single message and we used a preprocess to determine the probabilities of letters, pair of letters etc, in order to construct a good coding. Usually we will have a lot of messages to code and we want to think of them as coming from some “process” (for example, Shakespeare’s mind) and construct the code according to this process.

In order to formalize this notion we consider the space

$\Sigma^\mathbb{N}:= \{ \sigma_0 \sigma_1 \sigma_2 ... \mid \;\sigma_i \in \Sigma\},$

namely all the infinite (one sided) words over the finite alphabet $\Sigma$ . The “process” is going to be a measure on this space, which basically tells us that we are more likely to see some (infinite) messages over others. For a finite word $w$ over $\Sigma$ and an integer $k$ we will denote by $C_{k,w}\subseteq \Sigma^\mathbb{N}$ all the infinite words which contain the word $w$ beginning from the index $k$ . For example, $C_{2,aa}$ contain words which look like $**aa****...$ where $*$ can be any letter from $\Sigma$ . We call such sets cylinders. The topology defined by these sets on $\Sigma^\mathbb{N}$ is exactly the product topology, though we will not use this fact here.

Let $\mu$ be a measure on $\Sigma^\mathbb{N}$ which assigns a measure to each of the cylinder sets. We think of this measure as the probability of seeing a certain finite subword in a message. For example, the probability that the message will start with $\sigma$ is $\mu(C_{0,\sigma})$ . The probability to start with some letter and then $\sigma \tau$ is $\mu(C_{1,\sigma\tau})$ , etc.

Finally, if we have a lot of messages coming from the same source, we expect that the probability of seeing $\sigma$ in the 100 place, is the same as seeing it in the 1000 place. There is no special reason why $\sigma$ should appear more in the $n$ -th location than in the $m$ -th location for two indices $m,n$ . The only caveat is maybe for the first few letters, but we will ignore this. To formalize this notion we define the shift left function $S:\Sigma^\mathbb{N} \to \Sigma^\mathbb{N}$ by $S(a_0 a_1 a_2 ....)=(a_1 a_2 a_3 ...)$ . The property above is then written as $\mu(C_{i,w})=\mu(C_{j,w})$ for any finite word $w$ and any two indices $i,j$ . More generally, it means that $\mu$ is shift invariant, namely $\mu(\Omega)=\mu(S^{-1}(\Omega))$ , where we note that $S^{-1}(C_{i,w})=C_{i+1,w}$ .

Because of this shift invariance , we should think of the probability of seeing some $\sigma$ in the first $N$ letters as the probability of seeing it in the first letter, namely $\mu(C_{0,\sigma})$ . To get some intuition regarding this statement, let $f:\Sigma^\mathbb{N} \to \mathbb{R}$ be the function defined by $f(x)=1$ if the first letter is $\sigma$ and zero otherwise, or in other words this is the characteristic function of $C_{0,\sigma}$ . Note that $x$ has $\sigma$ as its k-th letter exactly if $f(S^k(x))=1$ . Thus, the probability of seeing $\sigma$ in the first $N$ letters is $\frac{1}{N}\sum_{i=0}^{N-1} f(S^i(x))$ . If we are lucky, then this average is close to the integral of $f$ , namely $\int f(x) dx = \mu(C_{0,\sigma})$ . This is indeed the case when $\mu$ is simple enough (ergodic), so we can use the measures $\mu(C_{0,w})$ as our replacements for the $p_w$ from before.

Thus, the first step entropy is going to be

$-\sum_{\sigma \in \Sigma} \mu(C_{0,\sigma}) \ln (\mu(C_{0,\sigma})).$

Similarly, the $m$ step entropy is going to be

$-\frac{1}{m}\sum_{w \in \Sigma^m} \mu(C_{0,w}) \ln (\mu(C_{0,w})).$

As we want a general definition, we can no longer use the “decomposition” of elements in our space into letters. For example, maybe we want to compress the messages using the ideas in the message, or any other way which doesn’t directly rely on their spelling. The abstract formulation is instead of looking on $C_{0,\sigma_1},...,C_{0,\sigma_n}$ which is just a finite partition of the space (since every message begin with some letter from $\Sigma$ ), we can instead look on a general partition $\mathcal{P}:=\{P_1,...,P_n\}$ where $\Sigma^\mathbb{N} = \bigsqcup_1^n P_i$ . Then the first step entropy is defined by

$H(\mathcal{P}):=-\sum_1^n \mu(P_i) \ln(\mu(P_i)).$

How do we define the $m$ -step? Well, the message begins with $\sigma\tau$ if it begins with $\sigma$ and after one shift left the new message begins with $\tau$ . In other words $C_{0,\sigma \tau}= C_{0,\sigma}\cap S^{-1} C_{0,\tau}$ . Note that we only used our original partition and the shift map. In general, we define $\mathcal{P}^m$ to be the partition with atoms of the form $P_{i_0} \cap S^{-1} P_{i_1}\cap S^{-2} P_{i_2} \cdots \cap S^{-(m-1)} P_{i_{m-1}}$ , and the $m$ -step entropy is

$\frac{1}{m}H(\mathcal{P}^m):=-\frac{1}{m}\sum_{P\in \mathcal{P}^m} \mu(P)\ln(\mu(P)).$

Finally, one can show that there is a limit to this entropies as $m\to \infty$ , so the final definition of the entropy of $\mu$ with respect to the shift action $S$ and the partition $\mathcal{P}$ is

$h_\mu(S,\mathcal{P}) = \lim_{m\to \infty} \frac{1}{m}H(\mathcal{P}^m).$