Codes, trees and the prefix property
Issue 10NB: UNDER CONSTRUCTION.
Part 2: Information Theory and Entropy
Introduction
In part 1 of this three-part article, we considered what a code is, and why the prefix property of codes is important.
In this part, we'll be introducing the field of Information Theory and discovering the important property of entropy.
Information
Information Theory is a field of mathematics concerned with ...
The fundamental basis of information theory is, unsurprisingly, the concept of information. In everyday life, we use the word "information" in quite a loose way: "I'm just off to the library to look up some information about parrots". In the world of probability theory and statistics, however, information is a precisely-defined quantity.
Given some event
that occurs with probability
, the information
conveyed by some message
stating that
has occurred is defined as:
![]() |
This definition was first suggested by R.V.L. Hartley of Bell Laboratories in 1928[1] and extended by Claude Shannon, also of Bell, in 1948[2].
As the definition suggests, the actual unit of information is arbitrary: it varies with the logarithmic base used to calculate
. The most common bases are 2 (giving a unit of bits) and
(giving a unit of natural bits or nits).
An example
Consider rolling two ordinary dice and adding the two numbers rolled together to produce a sum
. Obviously, the sum
can have any value between 2 (1+1) and 12 (6+6).
Now, you send a message
to your friend saying which of its possible values
has taken. How much information
does this note contain? From the definition, it contains
bits.
Table 1 below shows the relative probability of obtaining each sum
on a given pair of rolls, and the amount of information in bits that the relevant message
contains. Graphs 1a and 1b plot the probability and information data respectively as histograms.
|
||||||||||||||||||||||||||||||||||||
Graph 1a - Probability distribution for S. |
||||||||||||||||||||||||||||||||||||
Graph 1b - Information in bits contained in a message conveying S. |
What does it mean?
Notice the interesting relationship between graphs 1a and 1b. Where the probability is high, the information is low, and where the probability is low, the information is high. This suggests that information is a measure of how surprising a particular event is.
If an event is very likely, discovering that it has occurred is not particularly surprising (for example, being told on a winter's day that it is cold outside). A message that tells you about this likely event contains, by definition, not very much information.
If an event is very unlikely, discovering that it has occurred is extremely surprising (for example, being told on a winter's day that it is so hot outside that the asphalt is melting). A message that tells you about this unlikely event contains, by definition, a lot of information.
Subjectivity
Interestingly, because information is measured using the probability assigned to particular events, it must be a subjective quantity.
The amount of information provided by a particular message depends on who the receiver of that message actually is. It depends on just what that receiver's prior beliefs are about the likelihood of the events or conditions the message describes: the probability
that the receiver thinks is correct for any
.
Similarly, the amount of information needed to convey a particular fact to a given receiver varies with that receiver's prior beliefs about the likelihood of that fact. In a way, this is quite intuitive: you'd expect it to take more information to convince you of something you think is completely absurd than it would to convince you of something you're already pretty sure is correct.
Entropy
Another important concept in information theory, introduced by Shannon, is that of expected information or entropy,
.
Given some discrete random variable
with a set of possible values (or distribution)
each with probability
, the entropy
of
is :
![]() |
In other words, if
is a message describing which particular value
is taken by
in a given trial, then
represents the average (or expected) amount of information contained in
. Entropy can also be said to represent the degree of uncertainty: the higher the entropy is, the less it is possible to predict what the value of the variable will be.
For example, imagine we have three dice. The first die is unbiased: there is an equal probability of 1/6 that any of the six numbers is rolled, giving a probability distribution as follows:
![]() |
|||
![]() |
The second die is slightly biased: even numbers are twice as likely to be rolled as odd numbers, giving a probability distribution as follows:
![]() |
|||
![]() |
The third die is more strongly biased, with 1 the most likely and 6 least likely and a probability distribution as follows:
![]() |
|||
![]() |
Tables 2a-2c below show what happens for each of the three dice when we let
be the outcome of a roll and calculate
and
for each possible value of
, then calculate the entropy by summing
over all values of
:
![]() |
![]() |
![]() |
![]() |
| 1 | 1/6 | 2.58 | 0.430 |
| 2 | 1/6 | 2.58 | 0.430 |
| 3 | 1/6 | 2.58 | 0.430 |
| 4 | 1/6 | 2.58 | 0.430 |
| 5 | 1/6 | 2.58 | 0.430 |
| 6 | 1/6 | 2.58 | 0.430 |
| Entropy | 2.58 | ||
| Die 2 - slightly biased | |||
![]() |
![]() |
![]() |
![]() |
| 1 | 1/9 | 3.17 | 0.352 |
| 2 | 2/9 | 2.17 | 0.482 |
| 3 | 1/9 | 3.17 | 0.352 |
| 4 | 2/9 | 2.17 | 0.482 |
| 5 | 1/9 | 3.17 | 0.352 |
| 6 | 2/9 | 2.17 | 0.482 |
| Entropy | 2.50 | ||
| Die 3 - more biased | |||
![]() |
![]() |
![]() |
![]() |
| 1 | 6/20 | 1.74 | 0.522 |
| 2 | 5/20 | 2.00 | 0.500 |
| 3 | 4/20 | 2.32 | 0.464 |
| 4 | 3/20 | 2.74 | 0.411 |
| 5 | 2/20 | 3.32 | 0.332 |
| 6 | 1/20 | 4.32 | 0.216 |
| Entropy | 2.45 | ||
As you can see, the unbiased die has the highest entropy, and the most biased die has the lowest entropy. This is what we expect: it is harder to predict the outcome of rolling an unbiased die than it is to predict the outcome of rolling a biased die. (In the extreme case, consider a die that is so biased it rolls a 1 in 999 cases out of 1000: we can almost always predict the outcome correctly just by guessing "1", and the entropy of this die will be very small.)
Entropy can be calculated for continuous as well as for discrete probability distributions.
[maths]For a continuous, real-valued random variable $z$ with a {\em probability density function} (or PDF) $f(z)$ such that \begin{displaymath} Pr(a
References
[1] R.V.L. Hartley: Transmission of Information.
Bell System Technical Journal v.7 no.3, July 1928, pp. 535-563.
[2] C.E. Shannon: A Mathematical Theory of Communication.
Bell System Technical Journal v.27 no.3/4, July/October 1928, pp. 379-423/623-656.

![\[ I(M) = - \log (Pr(x)) \]](/MI/5d83c7f0e49f91ffef2db7556e5a89d1/images/img-0005.png)



![\[ H(x) = - \sum _ i p_ i \log (p_ i) \]](/MI/54af74dd075f39154eee14ada6c78e01/images/img-0005.png)









