Shakespeare? He's in my DNA

If you think you have memory problems, spare a thought for archivists of the future. The volume of digital information in the world is currently estimated at around three zettabytes — that's 3000 billion billion bytes. And the number is growing. All that information, or at least the parts of it that are worth keeping, needs to be stored somewhere. Hard discs are expensive and require power to run, while other storage materials, such as magnetic tape, don't last very long before they degrade. This could be a problem particularly in the life sciences where genetic information has caused an information explosion.

Can we store information in DNA?

But help may be at hand from the very molecule that is partly to blame for the problem: DNA. For not only can scientist read DNA sequences from biological samples, they can also "write" them. In the lab they can produce strands of DNA corresponding to particular strings of nucleotides, denoted by the letters A, G, T and C. So if you encode your information in terms of these four letters, you could theoretically store it in DNA.

"We already know that DNA is a robust way to store information because we can extract it from bones of woolly mammoths, which date back tens of thousands of years, and make sense of it," explains Nick Goldman of the EMBL-European Bioinformatics Institute (EMBL-EBI). "It's also incredibly small, dense and does not need any power for storage, so shipping and keeping it is easy."

The problem with this idea is that scientists can currently only produce short strings of DNA and that it's easy to make errors when you read and write them, especially when there are repetitions of the same nucleotide letter. This throws up a mathematical challenge: how to turn digital files into bite-sized sequences of As, Gs, Ts, and Cs without repetitions, which can then be re-assembled and decoded to give back the original information.

Goldman and his colleague Ewan Birney have come up with a method that does just this. First they use a mathematical algorithm called a Huffman code to turn a digital file (which is usually represented as a sequence of 0s and 1s) into a new sequence of 0s, 1s and 2s. The code algorithm is a bit tedious to write down, but the encoding is straight-forward work for a computer.

Previous letter written	Next character to encode
Previous letter written	0	1	2
A	C	G	T
C	G	T	A
G	T	A	C
T	A	C	G

The researchers then turn the 0, 1, 2 string into an A, G, T, C string using a simple but neat recipe. The first character of the 0, 1, 2 string is encoded using the first row of the table on the left. So a 0 turns into a C, a 1 into a G, and a 2 into a T. For the next character you choose the row corresponding to the letter you've just written down. So if you have just written down a C (meaning the first character in the original string was a 0), you choose the C row to translate the next character, and so on. Since the row corresponding to each letter doesn't contain the letter itself, you end up with a string containing no repeating letters.

The sequence is broken into segments of length 100, but with each segment overlapping the previous one in 75 letters. This means that each DNA letter is contained in four different segments.

Goldman and Birney then break the resulting sequence into segments of length 100, but with each segment overlapping the previous one in 75 letters. This means that each DNA letter is contained in four different segments. "That way, you would have to have the same error on four different fragments for [the method] to fail – and that would be very rare," says Birney. To each segment they append a few bits of extra information, also encoded in As, Gs, Ts and Cs, to tell a decoder how to put the pieces back together again. After a couple of other manipulations to protect against errors, the segments are ready to be turned into real DNA in the lab.

The method does seem to work. Goldman and Birney encoded digital files containing, among other things, all of Shakespeare's sonnets and an mp3 of Martin Luther King's "I have a dream" speech, and sent the code to a US company which produced the corresponding DNA strings. When Goldman and Birney received the DNA samples back by mail, they were able to decipher the original files with 100% accuracy.

"We've created a code that's error tolerant using a molecular form we know will last in the right conditions for 10,000 years, or possibly longer," says Goldman. "As long as someone knows what the code is, you will be able to read it back if you have a machine that can read DNA." The method still needs perfecting though, so don't buy your DNA reader just yet.