How to measure information using entropy

Which text gives us more information – the full body of work of Shakespeare, or 884,647 random words written by 1000 monkeys?
To answer this, we consider the following problem: Alice has a message over some finite alphabet $\Sigma$ and she wants to send it to Bob (for example, the text of “A midsummer night’s dream”). Since the time of Shakespeare we invented the internet which Alice uses, so she can use only 0 and 1, and she wants her 0\1 coded message to be as short as possible so it would reach Bob as fast as possible. If the original message contained a lot of redundancy, namely many parts of the text don’t contain “new” information, then we expect that Alice can shorten the message a lot. Conversely, the more information the message has, the harder it is to compress it.

The algorithm that Alice tries is choosing for each letter $\sigma \in \Sigma$ a word $w(\sigma)$ over $\{0,1\}$ and then the message $\sigma_1 \sigma_2 ... \sigma_n$ is coded to be $w(\sigma_1)...w(\sigma_n)$. What is the best that we can hope to gain this way?

For simplicity, lets assume that $\Sigma=\{a,b,c,d\}$. The simplest coding is the mapping $a\mapsto 00, b\mapsto 01, c\mapsto 10, d\mapsto 11$, so for example we have

$M=cabdbaabadacaaba \mapsto 10,00,01,11,01,00,00,01,00,11,00,10,00,00,01,00.$

Bob sees the word over $\{0,1\}$ (without the commas) and needs to decode it back to $\{a,b,c,d\}$. He does it by reading each time two bits and translating them back to the corresponding letter in $\Sigma$. Thus, for the original message of size $n$ we used $2n$ bits, 2 bits per letter.

Prefix free codes:

At this point Alice notes that she uses a lot of $a$‘s and thinks that maybe she can use a shorter coding for $a$, for example $w(a)=0$ and keeps the rest of the letters with the same coding. The problem then is that when Bob sees $0$ he doesn’t know if it belong to an $a$ or it is the beginning of $b$. In order to help Bob decode the message, Alice decides to use a prefix free code – no coding of one letter is a prefix of a coding of another letter. Thus, once Bob reads the coding of $w(\sigma)$ for some $\sigma \in \Sigma$ he know that this coding must belong to $\sigma$ and not part of a coding of a different letter. The best way to visualize such a code is using a binary tree, for example:

The coding of a letter $\sigma$ is the string on the path leading from the root of the tree to the leaf labeled with $\sigma$. So in the tree above we have that $a\mapsto 0, b\mapsto 10, c\mapsto 110$ and $d\mapsto 111$. Now the coding will be

$M=cabdbaabadacaaba \mapsto w(M)=1100101111000100111011000100.$

You should check that you can decode the coded message above back to the original message. Letting $l(\sigma)=|w(\sigma)|$ be the length of the coding of $\sigma$ and $M_\sigma$ be the number of occurrences of $\sigma$ in the message, we get that the length of the coded message is $|w(M)|=\sum l(\sigma) M_\sigma$, and the number of bits used per letter is $\frac{|w(M)|}{|M|} = \sum l(\sigma) \frac{M_\sigma}{|M|}$. Note that $p_\sigma:=\frac{M_\sigma}{|M|}$ is exactly the probability of seeing $\sigma$ in a random place in $M$. In the example above we get that $M_a = 8, M_b=4, M_c=M_d=2$ , hence

$l(a)=1, p_a = \frac{1}{2} ; l(b)=2, p_b= \frac{1}{4} ; l(c)=l(d)=3, p_c=p_d=\frac{1}{8}$.

The length of the coded message is $1\cdot 8 + 2\cdot 4 + 3\cdot 2 + 3\cdot 2 = 28$ which is less than the $32$ bits we needed in the first coding, and moreover the bits per letter is $\frac{28}{16}=\frac{7}{4}$ which is less than the $2$ bits per letter originally.

The best prefix code:

Can we do better than that by choosing a different coding? In the case above the answer is no. In order to show that, we first need to ask ourselves what conditions do the lengths $l(\sigma)$ must satisfy for any given prefix code. In other words, given these length, can we build a prefix code which has these lengths? Clearly, the lengths cannot be too small. For example, if $l(a)=l(b)=1$, than the tree must look like

and in particular there is no room for other letters. Similarly, we cannot have $l(a)=1, l(b)=l(c)=l(d)=2$. To get some intuition – if all the leaves had height $l$, then we would had at most $2^l$ leaves. You should think of that as giving each leaf of height $l$ the weight $2^{-l}$ and asking that the total weight would be at most $1$. The next claim shows that this is true even when the leaves have different heights.

Claim: Let $T$ be a binary tree with leaves $\sigma_1,...,\sigma_n$ of heights $l(\sigma_1),...,l(\sigma_n)$. Then $\sum 2^{-l(\sigma_i)}\leq 1$.

Proof: Let $\max{l(\sigma_i)}=l$. The simplest case is when all the leaves have the same height $l$. Since there are $2^l$ leaves in the full binary tree of height $l$, we get that $\sum 2^{-l(\sigma_i)} = n 2^{-l} \leq 2^l 2^{-l} = 1.$

If there is some leaf $\sigma_i$ of length $, we can make it into an internal node by adding two leaves $\sigma_{i,1}, \sigma_{i,2}$ to it:

These two new leaves contribute the same weight as the original leaf, since $2^{-l(\sigma_i)}=\frac{1}{2}2^{-l(\sigma_i)}+\frac{1}{2}2^{-l(\sigma_i)}=2^{-l(\sigma_{i,1})}+2^{-l(\sigma_{i,2})}$. Thus, we see that it is enough to prove the claim for the second tree. We can continue doing that process as long as there are leaves of height $, and when they are all of height $l$ we can use the previous argument, thus completing the proof. $\square$

It is not hard to show that the converse is true as well – given positive integers $l(\sigma_1),...,l(\sigma_n)$ which satisfy $\sum 2^{-l(\sigma_i)}\leq 1$, we can find a corresponding binary tree with leaves at these heights (and thus constructing a prefix code). We are now left with the problem of minimizing the length of the coded message $\sum M_\sigma l(\sigma)$ given the condition $\sum 2^{-l(\sigma_i)} \leq 1.$ Of course we would also want that the $l_i:=l(\sigma_i)$ would be positive integers. A minimization problem with integer coefficients can be quite hard (though this one is not too difficult), so we relax the problem and ask only that $l_i >0$. A simple Lagrange multipliers argument will show that the best that we can hope for is when $l_i = \log_2(\frac{1}{p_i})$. In particular, if a letter appears many times in $M$, then $p_i$ is going to be large and therefore $l_i$ is going to be small.

If the $p_i$ are all powers of $2$, then $\log_2(p_i^{-1})$ are all integers, so we solved the minimization problem with integer lengths as well – which is the case with our example above. Otherwise we can try to round the logarithms up or down to get an integer solution.

Considering the number of bits per letter leads us to the definition of entropy.

Definition: Let $\bar{p}=(p_1,...,p_n)$ be a probability vector. The entropy of $\bar{p}$ is then defined to be $H(\bar{p})= \sum p_i \log(p_i^{-1})$, where we take $0 \log(0^{-1}) :=0$.

Thus, with this notation, the best bits per letter that we can hope for is $H(\bar{p})$.

A single letter is not enough:

After enough time Alice decides that spending $H(\bar{p})$ bits per letter is too much and she wonders if she can reduce it even further. The main idea in the previous coding is that if we have some information about the message (e.g. how many times each letter appears), then we can construct a coding that uses this information in order to get good bound on the bits per letter.

With this in mind, instead of looking on how many times each letter appears, we can consider instead how many times each pairs of letters appear in the message. For example, in the message $aabababb$, the letters $a$ and $b$ appear each half of the time, so in a sense we don’t have any extra information. On the other hand, if we consider it as a word over $\{aa,ab,ba,bb\}$, then the number of times that we see $aa,ab,bb$ and $ba$ is $1,0,1$ and $2$. This is not uniform anymore, namely it gives us some information.

We can now use the same idea from before, just for pairs of letter. We choose a prefix code for the new alphabet $\Sigma^2=\{\sigma \tau \mid \sigma,\tau \in \Sigma\}$, i.e. lengths $l_{\sigma \tau}$ such that $\sum_{\sigma,\tau} 2^{-l(\sigma \tau)}\leq 1$ to get a coded message of length $\sum M_{\sigma \tau} l_{\sigma \tau}$.  Note that now $p_{\sigma \tau} = \frac{M_{\sigma \tau}}{|M|/2}$ so that the number of bit per letter is

$\frac{1}{|M|} \sum M_{\sigma \tau} l_{\sigma \tau}=\frac{1}{2}\sum p_{\sigma \tau} l_{\sigma \tau}.$

While it is not exactly clear from the formulas, note that this can only improve our bits per letter, since any coding that relies on one letter information can be done using two letters information.

If we can do two letters, then we can also do three letters, and $m$ letters. Under the assumption that $n$ is very big with respect to $m$, we see that we can get the bits per letter to be (up to integer rounding) at most

$\frac{1}{m} \sum_{w\in \Sigma^m} p_w l_w$.

Returning to our original question, the text of Shakespeare’s “A midsummer night’s dream” has 63 types of characters (so for the uniform coding you will need about $6\sim \log_2(63)$ bits per letter, but if you compute the 1-letter entropy, you will get that you only need around 4.687 bits per letter. In particular, a uniform typing monkey (now sold in stores near you) will produce a text with more “information”, though probably useless information. While it is a bit counter intuitive, you can think of this as follows: If you already read 1000 words from Shakespeare, there is a good chance that you can guess the 1001 word, so it carries less information, while every word the monkey writes is a new surprise.

To infinity and beyond:

I will finish with a more advanced section which shows how the entropy definition so far leads to entropy definition for more general systems.

Up until now we had a single message and we used a preprocess to determine the probabilities of letters, pair of letters etc, in order to construct a good coding. Usually we will have a lot of messages to code and we want to think of them as coming from some “process” (for example, Shakespeare’s mind) and construct the code according to this process.

In order to formalize this notion we consider the space

$\Sigma^\mathbb{N}:= \{ \sigma_0 \sigma_1 \sigma_2 ... \mid \;\sigma_i \in \Sigma\},$

namely all the infinite (one sided) words over the finite alphabet $\Sigma$. The “process” is going to be a measure on this space, which basically tells us that we are more likely to see some (infinite) messages over others. For a finite word $w$ over $\Sigma$ and an integer $k$ we will denote by $C_{k,w}\subseteq \Sigma^\mathbb{N}$ all the infinite words which contain the word $w$  beginning from the index $k$. For example, $C_{2,aa}$ contain words which look like $**aa****...$ where $*$ can be any letter from $\Sigma$. We call such sets cylinders. The topology defined by these sets on $\Sigma^\mathbb{N}$ is exactly the product topology, though we will not use this fact here.

Let $\mu$ be a measure on $\Sigma^\mathbb{N}$ which assigns a measure to each of the cylinder sets. We think of this measure as the probability of seeing a certain finite subword in a message. For example, the probability that the message will start with $\sigma$ is $\mu(C_{0,\sigma})$. The probability to start with some letter and then $\sigma \tau$ is $\mu(C_{1,\sigma\tau})$, etc.

Finally, if we have a lot of messages coming from the same source, we expect that the probability of seeing $\sigma$ in the 100 place, is the same as seeing it in the 1000 place. There is no special reason why $\sigma$ should appear more in the $n$-th location than in the $m$-th location for two indices $m,n$. The only caveat is maybe for the first few letters, but we will ignore this. To formalize this notion we define the shift left function $S:\Sigma^\mathbb{N} \to \Sigma^\mathbb{N}$ by $S(a_0 a_1 a_2 ....)=(a_1 a_2 a_3 ...)$. The property above is then written as  $\mu(C_{i,w})=\mu(C_{j,w})$ for any finite word $w$ and any two indices $i,j$. More generally, it means that $\mu$ is shift invariant, namely $\mu(\Omega)=\mu(S^{-1}(\Omega))$, where we note that $S^{-1}(C_{i,w})=C_{i+1,w}$.

Because of this shift invariance , we should think of the probability of seeing some $\sigma$ in the first $N$ letters as the probability of seeing it in the first letter, namely $\mu(C_{0,\sigma})$. To get some intuition regarding this statement, let $f:\Sigma^\mathbb{N} \to \mathbb{R}$ be the function defined by $f(x)=1$ if the first letter is $\sigma$ and zero otherwise, or in other words this is the characteristic function of $C_{0,\sigma}$. Note that $x$ has $\sigma$ as its k-th letter exactly if $f(S^k(x))=1$. Thus, the probability of seeing  $\sigma$ in the first $N$ letters is $\frac{1}{N}\sum_{i=0}^{N-1} f(S^i(x))$. If we are lucky, then this average is close to the integral of $f$, namely $\int f(x) dx = \mu(C_{0,\sigma})$. This is indeed the case when $\mu$ is simple enough (ergodic), so we can use the measures $\mu(C_{0,w})$ as our replacements for the $p_w$ from before.

Thus, the first step entropy is going to be

$-\sum_{\sigma \in \Sigma} \mu(C_{0,\sigma}) \ln (\mu(C_{0,\sigma})).$

Similarly, the $m$ step entropy is going to be

$-\frac{1}{m}\sum_{w \in \Sigma^m} \mu(C_{0,w}) \ln (\mu(C_{0,w})).$

As we want a general definition, we can no longer use the “decomposition” of elements in our space into letters. For example, maybe we want to compress the messages using the ideas in the message, or any other way which doesn’t directly rely on their spelling. The abstract formulation is instead of looking on $C_{0,\sigma_1},...,C_{0,\sigma_n}$ which is just a finite partition of the space (since every message begin with some letter from $\Sigma$), we can instead look on a general partition $\mathcal{P}:=\{P_1,...,P_n\}$ where $\Sigma^\mathbb{N} = \bigsqcup_1^n P_i$. Then the first step entropy is defined by

$H(\mathcal{P}):=-\sum_1^n \mu(P_i) \ln(\mu(P_i)).$

How do we define the $m$-step? Well, the message begins with $\sigma\tau$ if it begins with $\sigma$ and after one shift left the new message begins with $\tau$. In other words $C_{0,\sigma \tau}= C_{0,\sigma}\cap S^{-1} C_{0,\tau}$. Note that we only used our original partition and the shift map. In general, we define $\mathcal{P}^m$ to be the partition with atoms of the form $P_{i_0} \cap S^{-1} P_{i_1}\cap S^{-2} P_{i_2} \cdots \cap S^{-(m-1)} P_{i_{m-1}}$, and the $m$-step entropy is

$\frac{1}{m}H(\mathcal{P}^m):=-\frac{1}{m}\sum_{P\in \mathcal{P}^m} \mu(P)\ln(\mu(P)).$

Finally, one can show that there is a limit to this entropies as $m\to \infty$, so the final definition of the entropy of $\mu$ with respect to the shift action $S$ and the partition $\mathcal{P}$ is

$h_\mu(S,\mathcal{P}) = \lim_{m\to \infty} \frac{1}{m}H(\mathcal{P}^m).$

This entry was posted in Entropy and tagged , , . Bookmark the permalink.