Which text gives us more information – the full body of work of Shakespeare, or 884,647 random words written by 1000 monkeys?
To answer this, we consider the following problem: Alice has a message over some finite alphabet and she wants to send it to Bob (for example, the text of “A midsummer night’s dream”). Since the time of Shakespeare we invented the internet which Alice uses, so she can use only 0 and 1, and she wants her 0\1 coded message to be as short as possible so it would reach Bob as fast as possible. If the original message contained a lot of redundancy, namely many parts of the text don’t contain “new” information, then we expect that Alice can shorten the message a lot. Conversely, the more information the message has, the harder it is to compress it.
The algorithm that Alice tries is choosing for each letter a word over and then the message is coded to be . What is the best that we can hope to gain this way?
For simplicity, lets assume that . The simplest coding is the mapping , so for example we have
Bob sees the word over (without the commas) and needs to decode it back to . He does it by reading each time two bits and translating them back to the corresponding letter in . Thus, for the original message of size we used bits, 2 bits per letter.
Prefix free codes:
At this point Alice notes that she uses a lot of ‘s and thinks that maybe she can use a shorter coding for , for example and keeps the rest of the letters with the same coding. The problem then is that when Bob sees he doesn’t know if it belong to an or it is the beginning of . In order to help Bob decode the message, Alice decides to use a prefix free code – no coding of one letter is a prefix of a coding of another letter. Thus, once Bob reads the coding of for some he know that this coding must belong to and not part of a coding of a different letter. The best way to visualize such a code is using a binary tree, for example:
The coding of a letter is the string on the path leading from the root of the tree to the leaf labeled with . So in the tree above we have that and . Now the coding will be
You should check that you can decode the coded message above back to the original message. Letting be the length of the coding of and be the number of occurrences of in the message, we get that the length of the coded message is , and the number of bits used per letter is . Note that is exactly the probability of seeing in a random place in . In the example above we get that , hence
The length of the coded message is which is less than the bits we needed in the first coding, and moreover the bits per letter is which is less than the bits per letter originally.
The best prefix code:
Can we do better than that by choosing a different coding? In the case above the answer is no. In order to show that, we first need to ask ourselves what conditions do the lengths must satisfy for any given prefix code. In other words, given these length, can we build a prefix code which has these lengths? Clearly, the lengths cannot be too small. For example, if , than the tree must look like
and in particular there is no room for other letters. Similarly, we cannot have . To get some intuition – if all the leaves had height , then we would had at most leaves. You should think of that as giving each leaf of height the weight and asking that the total weight would be at most . The next claim shows that this is true even when the leaves have different heights.
Claim: Let be a binary tree with leaves of heights . Then .
Proof: Let . The simplest case is when all the leaves have the same height . Since there are leaves in the full binary tree of height , we get that
If there is some leaf of length , we can make it into an internal node by adding two leaves to it:
These two new leaves contribute the same weight as the original leaf, since . Thus, we see that it is enough to prove the claim for the second tree. We can continue doing that process as long as there are leaves of height , and when they are all of height we can use the previous argument, thus completing the proof.
It is not hard to show that the converse is true as well – given positive integers which satisfy , we can find a corresponding binary tree with leaves at these heights (and thus constructing a prefix code). We are now left with the problem of minimizing the length of the coded message given the condition Of course we would also want that the would be positive integers. A minimization problem with integer coefficients can be quite hard (though this one is not too difficult), so we relax the problem and ask only that . A simple Lagrange multipliers argument will show that the best that we can hope for is when . In particular, if a letter appears many times in , then is going to be large and therefore is going to be small.
If the are all powers of , then are all integers, so we solved the minimization problem with integer lengths as well – which is the case with our example above. Otherwise we can try to round the logarithms up or down to get an integer solution.
Considering the number of bits per letter leads us to the definition of entropy.
Definition: Let be a probability vector. The entropy of is then defined to be , where we take .
Thus, with this notation, the best bits per letter that we can hope for is .
A single letter is not enough:
After enough time Alice decides that spending bits per letter is too much and she wonders if she can reduce it even further. The main idea in the previous coding is that if we have some information about the message (e.g. how many times each letter appears), then we can construct a coding that uses this information in order to get good bound on the bits per letter.
With this in mind, instead of looking on how many times each letter appears, we can consider instead how many times each pairs of letters appear in the message. For example, in the message , the letters and appear each half of the time, so in a sense we don’t have any extra information. On the other hand, if we consider it as a word over , then the number of times that we see and is and . This is not uniform anymore, namely it gives us some information.
We can now use the same idea from before, just for pairs of letter. We choose a prefix code for the new alphabet , i.e. lengths such that to get a coded message of length . Note that now so that the number of bit per letter is
While it is not exactly clear from the formulas, note that this can only improve our bits per letter, since any coding that relies on one letter information can be done using two letters information.
If we can do two letters, then we can also do three letters, and letters. Under the assumption that is very big with respect to , we see that we can get the bits per letter to be (up to integer rounding) at most
Returning to our original question, the text of Shakespeare’s “A midsummer night’s dream” has 63 types of characters (so for the uniform coding you will need about bits per letter, but if you compute the 1-letter entropy, you will get that you only need around 4.687 bits per letter. In particular, a uniform typing monkey (now sold in stores near you) will produce a text with more “information”, though probably useless information. While it is a bit counter intuitive, you can think of this as follows: If you already read 1000 words from Shakespeare, there is a good chance that you can guess the 1001 word, so it carries less information, while every word the monkey writes is a new surprise.
To infinity and beyond:
I will finish with a more advanced section which shows how the entropy definition so far leads to entropy definition for more general systems.
Up until now we had a single message and we used a preprocess to determine the probabilities of letters, pair of letters etc, in order to construct a good coding. Usually we will have a lot of messages to code and we want to think of them as coming from some “process” (for example, Shakespeare’s mind) and construct the code according to this process.
In order to formalize this notion we consider the space
namely all the infinite (one sided) words over the finite alphabet . The “process” is going to be a measure on this space, which basically tells us that we are more likely to see some (infinite) messages over others. For a finite word over and an integer we will denote by all the infinite words which contain the word beginning from the index . For example, contain words which look like where can be any letter from . We call such sets cylinders. The topology defined by these sets on is exactly the product topology, though we will not use this fact here.
Let be a measure on which assigns a measure to each of the cylinder sets. We think of this measure as the probability of seeing a certain finite subword in a message. For example, the probability that the message will start with is . The probability to start with some letter and then is , etc.
Finally, if we have a lot of messages coming from the same source, we expect that the probability of seeing in the 100 place, is the same as seeing it in the 1000 place. There is no special reason why should appear more in the -th location than in the -th location for two indices . The only caveat is maybe for the first few letters, but we will ignore this. To formalize this notion we define the shift left function by . The property above is then written as for any finite word and any two indices . More generally, it means that is shift invariant, namely , where we note that .
Because of this shift invariance , we should think of the probability of seeing some in the first letters as the probability of seeing it in the first letter, namely . To get some intuition regarding this statement, let be the function defined by if the first letter is and zero otherwise, or in other words this is the characteristic function of . Note that has as its k-th letter exactly if . Thus, the probability of seeing in the first letters is . If we are lucky, then this average is close to the integral of , namely . This is indeed the case when is simple enough (ergodic), so we can use the measures as our replacements for the from before.
Thus, the first step entropy is going to be
Similarly, the step entropy is going to be
As we want a general definition, we can no longer use the “decomposition” of elements in our space into letters. For example, maybe we want to compress the messages using the ideas in the message, or any other way which doesn’t directly rely on their spelling. The abstract formulation is instead of looking on which is just a finite partition of the space (since every message begin with some letter from ), we can instead look on a general partition where . Then the first step entropy is defined by
How do we define the -step? Well, the message begins with if it begins with and after one shift left the new message begins with . In other words . Note that we only used our original partition and the shift map. In general, we define to be the partition with atoms of the form , and the -step entropy is
Finally, one can show that there is a limit to this entropies as , so the final definition of the entropy of with respect to the shift action and the partition is