Codes, trees and the prefix property

Issue 10
January 2000


NB: UNDER CONSTRUCTION.

Part 2: Information Theory and Entropy

Introduction

In part 1 of this three-part article, we considered what a code is, and why the prefix property of codes is important.

In this part, we'll be introducing the field of Information Theory and discovering the important property of entropy.

Information

Information Theory is a field of mathematics concerned with ...

The fundamental basis of information theory is, unsurprisingly, the concept of information. In everyday life, we use the word "information" in quite a loose way: "I'm just off to the library to look up some information about parrots". In the world of probability theory and statistics, however, information is a precisely-defined quantity.

Given some event $x$ that occurs with probability $Pr(x)$, the information $I$ conveyed by some message $M$ stating that $x$ has occurred is defined as:

  \[  I(M) = - \log (Pr(x))  \]    

This definition was first suggested by R.V.L. Hartley of Bell Laboratories in 1928[1] and extended by Claude Shannon, also of Bell, in 1948[2].

As the definition suggests, the actual unit of information is arbitrary: it varies with the logarithmic base used to calculate $log(p)$. The most common bases are 2 (giving a unit of bits) and $e$ (giving a unit of natural bits or nits).

An example

Consider rolling two ordinary dice and adding the two numbers rolled together to produce a sum $S$. Obviously, the sum $S$ can have any value between 2 (1+1) and 12 (6+6).

Now, you send a message $M_ S$ to your friend saying which of its possible values $S$ has taken. How much information $I(M_ S)$ does this note contain? From the definition, it contains $I(M_ S) = -\log _2(Pr(S))$ bits.

Table 1 below shows the relative probability of obtaining each sum $S$ on a given pair of rolls, and the amount of information in bits that the relevant message $M_ S$ contains. Graphs 1a and 1b plot the probability and information data respectively as histograms.

Event $S$ Probability $Pr(S)$ Information $I(M_ S)$
2 1/36 5.17
3 2/36 4.17
4 3/36 3.58
5 4/36 3.17
7 5/36 2.85
7 6/36 2.58
8 5/36 2.85
9 4/36 3.17
10 3/36 3.58
11 2/36 4.17
12 1/36 5.17
[an error occurred while processing this directive]
Graph 1a - Probability distribution for S.

Graph 1a - Probability distribution for S.

Graph 1b - Information in bits contained in a message conveying S.

Graph 1b - Information in bits contained in a message conveying S.

What does it mean?

Notice the interesting relationship between graphs 1a and 1b. Where the probability is high, the information is low, and where the probability is low, the information is high. This suggests that information is a measure of how surprising a particular event is.

If an event is very likely, discovering that it has occurred is not particularly surprising (for example, being told on a winter's day that it is cold outside). A message that tells you about this likely event contains, by definition, not very much information.

If an event is very unlikely, discovering that it has occurred is extremely surprising (for example, being told on a winter's day that it is so hot outside that the asphalt is melting). A message that tells you about this unlikely event contains, by definition, a lot of information.

Subjectivity

Interestingly, because information is measured using the probability assigned to particular events, it must be a subjective quantity.

The amount of information provided by a particular message depends on who the receiver of that message actually is. It depends on just what that receiver's prior beliefs are about the likelihood of the events or conditions the message describes: the probability $Pr(X)$ that the receiver thinks is correct for any $X$.

Similarly, the amount of information needed to convey a particular fact to a given receiver varies with that receiver's prior beliefs about the likelihood of that fact. In a way, this is quite intuitive: you'd expect it to take more information to convince you of something you think is completely absurd than it would to convince you of something you're already pretty sure is correct.

Entropy

Another important concept in information theory, introduced by Shannon, is that of expected information or entropy, $H$.

Given some discrete random variable $x$ with a set of possible values (or distribution) $\{ x_1 ... x_ n\} $ each with probability $p_ i = Pr(x_ i)$, the entropy $H$ of $x$ is :

  \[  H(x) = - \sum _ i p_ i \log (p_ i)  \]    

In other words, if $M$ is a message describing which particular value $x_ i$ is taken by $x$ in a given trial, then $H(x)$ represents the average (or expected) amount of information contained in $M$. Entropy can also be said to represent the degree of uncertainty: the higher the entropy is, the less it is possible to predict what the value of the variable will be.

For example, imagine we have three dice. The first die is unbiased: there is an equal probability of 1/6 that any of the six numbers is rolled, giving a probability distribution as follows:

  $\displaystyle  Pr(X=1) = 1/6; Pr(X=2) = 1/6; Pr(X=3) = 1/6;  $    
  $\displaystyle Pr(X=4) = 1/6; Pr(X=5) = 1/6; Pr(X=6) = 1/6;  $    

The second die is slightly biased: even numbers are twice as likely to be rolled as odd numbers, giving a probability distribution as follows:

  $\displaystyle  Pr(X=1) = 1/9; Pr(X=2) = 2/9; Pr(X=3) = 1/9;  $    
  $\displaystyle Pr(X=4) = 2/9; Pr(X=5) = 1/9; Pr(X=6) = 2/9;  $    

The third die is more strongly biased, with 1 the most likely and 6 least likely and a probability distribution as follows:

  $\displaystyle  Pr(X=1) = 6/20; Pr(X=2) = 5/20; Pr(X=3) = 4/20;  $    
  $\displaystyle Pr(X=4) = 3/20; Pr(X=5) = 2/20; Pr(X=6) = 1/20;  $    

Tables 2a-2c below show what happens for each of the three dice when we let $X$ be the outcome of a roll and calculate $-log_2(Pr(X))$ and $-Pr(X) log_2 (Pr(X)$ for each possible value of $X$, then calculate the entropy by summing $-Pr(X) log_2 (Pr(X))$ over all values of $X$:

$X$ $Pr(X)$ $-log_2(Pr(X))$ $-Pr(X) log_2(Pr(X))$
1 1/6 2.58 0.430
2 1/6 2.58 0.430
3 1/6 2.58 0.430
4 1/6 2.58 0.430
5 1/6 2.58 0.430
6 1/6 2.58 0.430
Entropy 2.58
[an error occurred while processing this directive]
Die 2 - slightly biased
$X$ $Pr(X)$ $-log_2(Pr(X))$ $-Pr(X) log_2(Pr(X))$
1 1/9 3.17 0.352
2 2/9 2.17 0.482
3 1/9 3.17 0.352
4 2/9 2.17 0.482
5 1/9 3.17 0.352
6 2/9 2.17 0.482
Entropy 2.50
[an error occurred while processing this directive]
Die 3 - more biased
$X$ $Pr(X)$ $-log_2(Pr(X))$ $-Pr(X) log_2(Pr(X))$
1 6/20 1.74 0.522
2 5/20 2.00 0.500
3 4/20 2.32 0.464
4 3/20 2.74 0.411
5 2/20 3.32 0.332
6 1/20 4.32 0.216
Entropy 2.45
[an error occurred while processing this directive]

As you can see, the unbiased die has the highest entropy, and the most biased die has the lowest entropy. This is what we expect: it is harder to predict the outcome of rolling an unbiased die than it is to predict the outcome of rolling a biased die. (In the extreme case, consider a die that is so biased it rolls a 1 in 999 cases out of 1000: we can almost always predict the outcome correctly just by guessing "1", and the entropy of this die will be very small.)

Entropy can be calculated for continuous as well as for discrete probability distributions.

[maths]For a continuous, real-valued random variable $z$ with a {\em probability density function} (or PDF) $f(z)$ such that \begin{displaymath} Pr(a

References

[1] R.V.L. Hartley: Transmission of Information.
Bell System Technical Journal v.7 no.3, July 1928, pp. 535-563.
[2] C.E. Shannon: A Mathematical Theory of Communication.
Bell System Technical Journal v.27 no.3/4, July/October 1928, pp. 379-423/623-656.

On to Part 3

In the next part of this article, we'll be exploring how the mathematical property of entropy can be used to help us develop the notion of an optimal code, the long-run best-performing code we can make for any particular situation.

About the author

Kona Macphee is a member of the Plus Magazine editorial team.