Which text gives us more information – the full body of work of Shakespeare, or 884,647 random words written by 1000 monkeys?
To answer this, we consider the following problem: Alice has a message over some finite alphabet and she wants to send it to Bob (for example, the text of “A midsummer night’s dream”). Since the time of Shakespeare we invented the internet which Alice uses, so she can use only 0 and 1, and she wants her 0\1 coded message to be as short as possible so it would reach Bob as fast as possible. If the original message contained a lot of redundancy, namely many parts of the text don’t contain “new” information, then we expect that Alice can shorten the message a lot. Conversely, the more information the message has, the harder it is to compress it.
The algorithm that Alice tries is choosing for each letter a word
over
and then the message
is coded to be
. What is the best that we can hope to gain this way?
For simplicity, lets assume that . The simplest coding is the mapping
, so for example we have
Bob sees the word over (without the commas) and needs to decode it back to
. He does it by reading each time two bits and translating them back to the corresponding letter in
. Thus, for the original message of size
we used
bits, 2 bits per letter.
Prefix free codes:
At this point Alice notes that she uses a lot of ‘s and thinks that maybe she can use a shorter coding for
, for example
and keeps the rest of the letters with the same coding. The problem then is that when Bob sees
he doesn’t know if it belong to an
or it is the beginning of
. In order to help Bob decode the message, Alice decides to use a prefix free code – no coding of one letter is a prefix of a coding of another letter. Thus, once Bob reads the coding of
for some
he know that this coding must belong to
and not part of a coding of a different letter. The best way to visualize such a code is using a binary tree, for example:
The coding of a letter is the string on the path leading from the root of the tree to the leaf labeled with
. So in the tree above we have that
and
. Now the coding will be
You should check that you can decode the coded message above back to the original message. Letting be the length of the coding of
and
be the number of occurrences of
in the message, we get that the length of the coded message is
, and the number of bits used per letter is
. Note that
is exactly the probability of seeing
in a random place in
. In the example above we get that
, hence
.
The length of the coded message is which is less than the
bits we needed in the first coding, and moreover the bits per letter is
which is less than the
bits per letter originally.
The best prefix code:
Can we do better than that by choosing a different coding? In the case above the answer is no. In order to show that, we first need to ask ourselves what conditions do the lengths must satisfy for any given prefix code. In other words, given these length, can we build a prefix code which has these lengths? Clearly, the lengths cannot be too small. For example, if
, than the tree must look like
and in particular there is no room for other letters. Similarly, we cannot have . To get some intuition – if all the leaves had height
, then we would had at most
leaves. You should think of that as giving each leaf of height
the weight
and asking that the total weight would be at most
. The next claim shows that this is true even when the leaves have different heights.
Claim: Let be a binary tree with leaves
of heights
. Then
.
Proof: Let . The simplest case is when all the leaves have the same height
. Since there are
leaves in the full binary tree of height
, we get that
If there is some leaf of length
, we can make it into an internal node by adding two leaves
to it:
These two new leaves contribute the same weight as the original leaf, since . Thus, we see that it is enough to prove the claim for the second tree. We can continue doing that process as long as there are leaves of height
, and when they are all of height
we can use the previous argument, thus completing the proof.
It is not hard to show that the converse is true as well – given positive integers which satisfy
, we can find a corresponding binary tree with leaves at these heights (and thus constructing a prefix code). We are now left with the problem of minimizing the length of the coded message
given the condition
Of course we would also want that the
would be positive integers. A minimization problem with integer coefficients can be quite hard (though this one is not too difficult), so we relax the problem and ask only that
. A simple Lagrange multipliers argument will show that the best that we can hope for is when
. In particular, if a letter appears many times in
, then
is going to be large and therefore
is going to be small.
If the are all powers of
, then
are all integers, so we solved the minimization problem with integer lengths as well – which is the case with our example above. Otherwise we can try to round the logarithms up or down to get an integer solution.
Considering the number of bits per letter leads us to the definition of entropy.
Definition: Let be a probability vector. The entropy of
is then defined to be
, where we take
.
Thus, with this notation, the best bits per letter that we can hope for is .
A single letter is not enough:
After enough time Alice decides that spending bits per letter is too much and she wonders if she can reduce it even further. The main idea in the previous coding is that if we have some information about the message (e.g. how many times each letter appears), then we can construct a coding that uses this information in order to get good bound on the bits per letter.
With this in mind, instead of looking on how many times each letter appears, we can consider instead how many times each pairs of letters appear in the message. For example, in the message , the letters
and
appear each half of the time, so in a sense we don’t have any extra information. On the other hand, if we consider it as a word over
, then the number of times that we see
and
is
and
. This is not uniform anymore, namely it gives us some information.
We can now use the same idea from before, just for pairs of letter. We choose a prefix code for the new alphabet , i.e. lengths
such that
to get a coded message of length
. Note that now
so that the number of bit per letter is
While it is not exactly clear from the formulas, note that this can only improve our bits per letter, since any coding that relies on one letter information can be done using two letters information.
If we can do two letters, then we can also do three letters, and letters. Under the assumption that
is very big with respect to
, we see that we can get the bits per letter to be (up to integer rounding) at most
.
Returning to our original question, the text of Shakespeare’s “A midsummer night’s dream” has 63 types of characters (so for the uniform coding you will need about bits per letter, but if you compute the 1-letter entropy, you will get that you only need around 4.687 bits per letter. In particular, a uniform typing monkey (now sold in stores near you) will produce a text with more “information”, though probably useless information. While it is a bit counter intuitive, you can think of this as follows: If you already read 1000 words from Shakespeare, there is a good chance that you can guess the 1001 word, so it carries less information, while every word the monkey writes is a new surprise.
To infinity and beyond:
I will finish with a more advanced section which shows how the entropy definition so far leads to entropy definition for more general systems.
Up until now we had a single message and we used a preprocess to determine the probabilities of letters, pair of letters etc, in order to construct a good coding. Usually we will have a lot of messages to code and we want to think of them as coming from some “process” (for example, Shakespeare’s mind) and construct the code according to this process.
In order to formalize this notion we consider the space
namely all the infinite (one sided) words over the finite alphabet . The “process” is going to be a measure on this space, which basically tells us that we are more likely to see some (infinite) messages over others. For a finite word
over
and an integer
we will denote by
all the infinite words which contain the word
beginning from the index
. For example,
contain words which look like
where
can be any letter from
. We call such sets cylinders. The topology defined by these sets on
is exactly the product topology, though we will not use this fact here.
Let be a measure on
which assigns a measure to each of the cylinder sets. We think of this measure as the probability of seeing a certain finite subword in a message. For example, the probability that the message will start with
is
. The probability to start with some letter and then
is
, etc.
Finally, if we have a lot of messages coming from the same source, we expect that the probability of seeing in the 100 place, is the same as seeing it in the 1000 place. There is no special reason why
should appear more in the
-th location than in the
-th location for two indices
. The only caveat is maybe for the first few letters, but we will ignore this. To formalize this notion we define the shift left function
by
. The property above is then written as
for any finite word
and any two indices
. More generally, it means that
is shift invariant, namely
, where we note that
.
Because of this shift invariance , we should think of the probability of seeing some in the first
letters as the probability of seeing it in the first letter, namely
. To get some intuition regarding this statement, let
be the function defined by
if the first letter is
and zero otherwise, or in other words this is the characteristic function of
. Note that
has
as its k-th letter exactly if
. Thus, the probability of seeing
in the first
letters is
. If we are lucky, then this average is close to the integral of
, namely
. This is indeed the case when
is simple enough (ergodic), so we can use the measures
as our replacements for the
from before.
Thus, the first step entropy is going to be
Similarly, the step entropy is going to be
As we want a general definition, we can no longer use the “decomposition” of elements in our space into letters. For example, maybe we want to compress the messages using the ideas in the message, or any other way which doesn’t directly rely on their spelling. The abstract formulation is instead of looking on which is just a finite partition of the space (since every message begin with some letter from
), we can instead look on a general partition
where
. Then the first step entropy is defined by
How do we define the -step? Well, the message begins with
if it begins with
and after one shift left the new message begins with
. In other words
. Note that we only used our original partition and the shift map. In general, we define
to be the partition with atoms of the form
, and the
-step entropy is
Finally, one can show that there is a limit to this entropies as , so the final definition of the entropy of
with respect to the shift action
and the partition
is