It's almost miraculous to think that the most evocative song or exciting movie can be captured in a sober string of 0s and 1s, to be stored and processed by computers. Zeroes and ones are the backbone of information technology — but where within those strings does the information they represent really lie?
Where is the information in this?
It's a question people have been asking themselves for a surprisingly long time. Back in the 1940s the mathematician Claude Shannon realised that information is only informative when it's telling you something unexpected, and decided to measure the information in strings of symbols in terms of how predictable they are (see here to find out more). In the 1960s mathematicians drew on the fact that many words don't necessarily mean a lot of content, and decided to measure information in terms of how much you can compress it. The Kolmogorov complexity of a string of 0s and 1s is the length of the shortest computer program that can produce it.
Complexity might indeed be a good measure of information: the more complex something is, the more information it contains. But Kolmogorov complexity has a curious feature. Suppose your string consists of a thousand zeroes. A computer program that outputs that string only needs to say "print 0; 1000 times", so it's pretty short, leading to a low Kolmogorov complexity. This makes sense, as the string is very basic.
Now imagine a completely random string of a thousand zeroes of ones; one that doesn't exhibit any pattern at all. A program that outputs that string has no choice but to say "print:" and then quote the entire string. There is no pattern here to exploit to make the program any shorter. Long random strings have high (in fact maximal) Kolmogorov complexity. (Shannon's measure of information similarly assigns a maximal value to patternlessness.) But does this make sense? Patternless strings may be hard to describe, but don't we usually associate complexity with some sort of structure, albeit a complex one?
"Kolmogorov complexity is misnamed because really it just measures patternlessness," says Scott Aaronson, a computer scientist at the Massachusetts Institute of Technology. "If we are actually trying to measures complexity, we need something that is small for very simple strings and also small for completely patternless strings. Then there should be some third class of strings for which that measure is not small."
As an example, look at the string
Does it contain a pattern? After staring at it for a while, you'll probably spot that it's made out of double bits 00 and 11 - you never see a 0 or a 1 occur alone. The order in which the 00s and 11s occur, however, seems random. Perhaps a measure of complexity and/or information should evaluate the amount of structure within this string, the doubling of bits, ignoring the random part. That way, completely random strings would have a low complexity value, as they have no structure at all, and so would strings with a very simple structure. Strings that contain more complex patterns would have a higher value.
Any pattern in this?
This is exactly the idea behind a concept called sophistication, which emerged in the 1980s. The idea is to look at programs that can produce a given string, taking some other information as input. In the example above, the input could be the string 001110101, and the program could be "double each bit". It's a short program, reflecting the fact that the structure in this case (each bit appears double) is simple. A more complex structure (say a random string made up of the triples 010 and 101) would lead to a longer program.
Loosely speaking, the sophistication of a string s is the length of the shortest program that outputs s given some random string as input, and then halts no matter (no matter the value of the input). The input reflects the random, and in a sense uninteresting, part of the string, while the program reflects its structure. How do we decide whether the input string is "random"? We can use Kolmogorov complexity to do this: we require that the Kolmogorov complexity of the input sting be not much smaller than its length (in technical language, such a string is called algorithmically random).
Is anything sophisticated?
This seems neat, but it poses an awkward question. As we've seen, there are strings with low sophistication: those with simple structure and completely random ones. But are there any strings to which our new measure assigns a high value? Are there strings with a lot of sophistication? Strangely, mathematicians have only been able to find few such strings, and these are highly contrived, arising from esoteric logical arguments. They are not random, but because they involve intricate logical problems, their structure is incredibly hard to describe.
"This is unsatisfactory," says Aaronson. "If every string you ever come across in real life [as opposed to abstract maths] is unsophisticated, then maybe we're setting our standards too high. The trouble is, if a string is going to arise in real life, then you can ask 'how did it arise'? Presumably it arose by some combination of natural laws and the injection of some random component. If the law part is simple, and we like to think of the laws of physics as simple, and the random part is not being counted, then how are you ever going to get [high] sophistication?"
The equations governing planetary motion are quite simple. Artist's impression courtesy NASA.
As an example, imagine recording the time the Sun reaches a particular point in the sky each day. The resulting string of numbers may exhibit a little randomness due to small measuring errors, but it's going to have a lot of structure, given by the comparatively simple equations that govern planetary motion. The string (once it's encoded in binary) will have low sophistication because a program could compute it using those simple equations. "The very fact that a string was generated by a process we can understand, logically implies that the string is unsophisticated," says Aaronson.
This brings us right up to date with current research. Computer scientists are busy refining notions of complexity and sophistication to make them more amenable to real-life measurements. One approach is not to ask for the shortest computer program that outputs a string given some input, but for the shortest computer program that does so in some reasonable amount of time. "A huge number of things are abstractly computable, but it would take longer than the age of the Universe to compute them," explains Aaronson. Such monstrous programs are useless in practical terms, so we might want to exclude them when computing sophistication. Although they take ages to run, they might actually be very short in length, so excluding them could only ever increase the sophistication of a given string. This may create those elusive high-sophistication strings.
This idea has already been fairly well developed for Kolmogorov complexity. "I think the right answer to our question is to study sophistication [in a similar way]. Unfortunately that theory really has not been well developed yet," says Aaronson.
When engineers first started thinking about information in the 1920s, the idea of treating it using maths, rather than psychology, was completely new. Ralph Hartley's pioneering 1928 paper, Transmission of information has an entire section entitled Elimination of psychological factors. Today, we're much more comfortable with thinking of information in such sober terms. But with so much research still gong on, it seems that its real secret still hasn't been cracked.
About this article
Marianne Freiberger is co-editor of Plus. She would like to thank Scott Aaronson, a computer scientist at the Massachusetts Institute of Technology, for a very useful conversation about information. And she recommends the book Information Theory: A Tutorial Introduction by James V. Stone as a mathematical introduction to the subject.