Codes, trees and the prefix property

January 2000


Part 3: Optimal codes


In part 1 of this three-part article, we considered what a code is, and why the prefix property of codes is important.

In part 2, we introduced the field of Information Theory and discovered the important property of entropy.

In this part, we'll be examining the notion of an optimal code, and showing how entropy can help us to specify and develop such a code.

Optimal Codes

The concept of entropy is closely related to the idea of an optimal code.

What is a code? To put it simply, a code is simply a regularised way of representing data. The information on a computer hard disk is stored in a code based on sequences of 1s and 0s, which are then decoded into program instructions, characters in text files, etc. Different kinds of codes suit different needs. Sometimes we want a very efficient code that stores data as concisely as possible (for example, in storing large image files). Sometimes, however, we want a less efficient code that has special properties allowing us to correct and detect errors (for example, when a distant space probe is sending back a faint and noisy signal. See Coding theory: the first 50 years in Issue 3 for more about error-correcting codes.)

To keep it simple, let's look at a binary code, or a code constructed of an alphabet consisting of only two symbols, 0 and 1, like that used in storing data on a disk. Initially, we'll consider an intuitive explanation of optimality.

A binary code is said to be optimal if the number of bits of information conveyed by any message in that code is equal to the length of the message in binary digits (which, confusingly, are often also described as bits).

For example, consider constructing a code that is a series of code words, or strings of binary digits, each of which is used to represent the occurrence of a particular event.

In an optimal code, more likely events will have shorter codes (e.g. 01), while less likely events will have proportionately longer codes (e.g. 1001010). The relative lengths of the code words will be in the same proportion as the relative probabilities of the events. Why is this? Because the number of binary digits and the number of bits of information contained in a particular code word are the same if the code is optimal, and messages describing likely events contain less information (and should therefore have shorter code words) than messages describing unlikely events.

Event $X_ i$ $Pr(X_ i)$ $-\log _2 (Pr(X_ i))$ Code word
$x_0$ 1/2 1 001
$x_1$ 1/4 2 010
$x_2$ 1/8 3 011
$x_3$ 1/16 4 100
$x_4$ 1/16 4 101
[an error occurred while processing this directive]
Event $X_ i$ $Pr(X_ i)$ $-\log _2 (Pr(X_ i))$ Code word
$x_0$ 1/2 1 1
$x_1$ 1/4 2 01
$x_2$ 1/8 3 001
$x_3$ 1/16 4 0000
$x_4$ 1/16 4 0001
[an error occurred while processing this directive]
(Note, in passing, that valid codes must have the prefix property, namely that no word in the code is a prefix of any other word in the code, i.e. that no longer code word begins with same sequence of digits that makes up any of the the shorter code words. See our further explanation of the prefix property for more details.)

Consider the two coding schemes in tables 2a and 2b above. In the first scheme, we have taken a "naive" approach: we've gone ahead and assigned sequential 3-digit code words to the five events. While these events have different probabilities, they are represented by code words that all have the same length: the code cannot therefore be optimal, since the number of bits of information conveyed by each code word varies (because of the different probabilities), and yet the number of binary digits in each code word is the same.

In the second scheme, we've assigned code words of different lengths to the different events. The relative lengths of the code words are in proportion with the information conveyed by each codeword: for example, event $x_0$ is eight times as likely as event $x_5$, its codeword therefore containing only one-quarter as much information as $x_5$'s codeword, and only having one-quarter as many binary digits (one, as opposed to four).

Noting the correspondence between the number of code word binary digits and the number of bits of information conveyed by the code word is an intuitive way of defining an optimal code. However, a more precise definition of an optimal code is in terms of the expected code length.

Consider a set of mutually exclusive events $X$ with each event $x_ i$ having probability $p_ i$, and some code $C$ which supplies a code word $c_ i$ for each event $x_ i$.

Each code word $c_ i$ has a length $l_ i$, and the expected or average length of a one-word message in code $C$ can be expressed as:

  \[  E(l) = \sum _{i} p_ i \,  l_ i  \]    

An optimal code is one which minimises $E(l)$. In the case of a binary code, it can be shown that $E(l)$ is minimised if $l_ i = -\log _2 (p_ i)$:

INSERT PROOF!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This confirms the previous assertion that optimal codes are codes for which the amount of information conveyed by a message in that code is equal to the length of the message.

Substituting the minimising expression $l_ i = -\log _2 p_ i$ into the expression for $E(l)$ gives:

  \[  E(l) = - \sum _{i} p_ i \log _2 p_ i  \]    

which corresponds to the entropy, $H(X)$.

What this implies is that the lower bound on expected message length corresponds with the expected information content of the message; messages cannot on average be any shorter than the amount of information they convey. In an optimal code, the average message length and the average information content per message will be the same; in a non-optimal code, the average message length will be greater.

Information and Messages

Let's now look at the message length that's required to describe a particular event in an optimal code.

If a receiver has a good theory or model that leads them to expect that event, then an optimally-coded message describing the event will be shorter (since it contains less information) than for a receiver who has a bad model that leads them not to expect that event.

For example, consider two receivers living in Melbourne, both of whom have the prior knowledge that it is currently summer. The first has experienced many Melbourne summers, and she has a good model (or set of prior expectations) about what kind of weather is likely in summer. The second has lived all her life in a windowless building airconditioned to a uniform 21 C and constant humidity (apart from occasional air-conditioning breakdowns), and has no clear idea of what "summer" implies.

We wish to convey to both receivers, neither of whom have been outside today or have any explicit knowledge of today's weather conditions, the following data:

  • (a) The temperature is higher than 21 C,
  • (b) it is sunny,
  • (c) the humidity is low, and
  • (d) the level of UV radiation is high.
all of which are reasonably typical of summer in Melbourne.

Now, let's say an optimally-encoded message describing these data for each receiver can be constructed. Note that the optimal code in this context is {\em subjective}: since the perceived likelihood of particular events or circumstances will vary with the receiver's model, therefore the optimal code for these events will also vary according to that model.

The message to the first receiver (who, with her good model of Melbourne summers, will think all of the above data items fairly likely) will be comparatively short. The message to the second receiver (who may well, from her previous experience of summer indoors, think all of the above fairly unlikely), will be longer. The first message contains less information (since the receiver already strongly suspected its contents to be true), and therefore, when encoded in an optimal code, produced a shorter message.

For a more detailed explanation, consider table {\ref{tbl:weather}}. This table gives the prior probability that each receiver assigns to each data item. For example, the first receiver believes that on average, 8 times out of 10 the humidity will be low, (and, of course, the converse outcome that 2 times out of 10 the humidity will {\em not} be low). As already mentioned, the first receiver's prior probabilities represent a more accurate set of beliefs about typical Melbourne summer conditions than do the second receiver's.

Now, as already discussed, the length of the code word required to transmit a given data item to a receiver, in an optimal binary code, is $-\log _2 Pr(data_ i)$, where $Pr(data_ i)$ is the prior probability assigned by the receiver to that data item.

Therefore, for each of the four data items (a)–(d), the length of the optimal binary code word for that event for each receiver can be calculated. The results are given in table ????. As expected, the code words are shorter for the first receiver, who has more realistic prior beliefs. The length of the total message, which conveys all four data items, is the sum of the code lengths for the individual items. For the first receiver, this length is 1.2 binary digits, and for the second receiver it is 7.7 binary digits. In other words, since the code is optimal, the first receiver gains 1.2 bits of information, and the second receiver gains 7.7 bits of information.


About the author

Kona Macphee is a member of the Plus Magazine editorial team.