Skip to main content
Home
plus.maths.org

Secondary menu

  • My list
  • About Plus
  • Sponsors
  • Subscribe
  • Contact Us
  • Log in
  • Main navigation

  • Home
  • Articles
  • Collections
  • Podcasts
  • Maths in a minute
  • Puzzles
  • Videos
  • Topics and tags
  • For

    • cat icon
      Curiosity
    • newspaper icon
      Media
    • graduation icon
      Education
    • briefcase icon
      Policy

      Popular topics and tags

      Shapes

      • Geometry
      • Vectors and matrices
      • Topology
      • Networks and graph theory
      • Fractals

      Numbers

      • Number theory
      • Arithmetic
      • Prime numbers
      • Fermat's last theorem
      • Cryptography

      Computing and information

      • Quantum computing
      • Complexity
      • Information theory
      • Artificial intelligence and machine learning
      • Algorithm

      Data and probability

      • Statistics
      • Probability and uncertainty
      • Randomness

      Abstract structures

      • Symmetry
      • Algebra and group theory
      • Vectors and matrices

      Physics

      • Fluid dynamics
      • Quantum physics
      • General relativity, gravity and black holes
      • Entropy and thermodynamics
      • String theory and quantum gravity

      Arts, humanities and sport

      • History and philosophy of mathematics
      • Art and Music
      • Language
      • Sport

      Logic, proof and strategy

      • Logic
      • Proof
      • Game theory

      Calculus and analysis

      • Differential equations
      • Calculus

      Towards applications

      • Mathematical modelling
      • Dynamical systems and Chaos

      Applications

      • Medicine and health
      • Epidemiology
      • Biology
      • Economics and finance
      • Engineering and architecture
      • Weather forecasting
      • Climate change

      Understanding of mathematics

      • Public understanding of mathematics
      • Education

      Get your maths quickly

      • Maths in a minute

      Main menu

    • Home
    • Articles
    • Collections
    • Podcasts
    • Maths in a minute
    • Puzzles
    • Videos
    • Topics and tags
    • Audiences

      • cat icon
        Curiosity
      • newspaper icon
        Media
      • graduation icon
        Education
      • briefcase icon
        Policy

      Secondary menu

    • My list
    • About Plus
    • Sponsors
    • Subscribe
    • Contact Us
    • Log in
    • Shakespeare? He's in my DNA

      31 January, 2013

      If you think you have memory problems, spare a thought for archivists of the future. The volume of digital information in the world is currently estimated at around three zettabytes — that's 3000 billion billion bytes. And the number is growing. All that information, or at least the parts of it that are worth keeping, needs to be stored somewhere. Hard discs are expensive and require power to run, while other storage materials, such as magnetic tape, don't last very long before they degrade. This could be a problem particularly in the life sciences where genetic information has caused an information explosion.

      Can we store information in DNA?

      But help may be at hand from the very molecule that is partly to blame for the problem: DNA. For not only can scientist read DNA sequences from biological samples, they can also "write" them. In the lab they can produce strands of DNA corresponding to particular strings of nucleotides, denoted by the letters A, G, T and C. So if you encode your information in terms of these four letters, you could theoretically store it in DNA.

      "We already know that DNA is a robust way to store information because we can extract it from bones of woolly mammoths, which date back tens of thousands of years, and make sense of it," explains Nick Goldman of the EMBL-European Bioinformatics Institute (EMBL-EBI). "It's also incredibly small, dense and does not need any power for storage, so shipping and keeping it is easy."

      The problem with this idea is that scientists can currently only produce short strings of DNA and that it's easy to make errors when you read and write them, especially when there are repetitions of the same nucleotide letter. This throws up a mathematical challenge: how to turn digital files into bite-sized sequences of As, Gs, Ts, and Cs without repetitions, which can then be re-assembled and decoded to give back the original information.

      Goldman and his colleague Ewan Birney have come up with a method that does just this. First they use a mathematical algorithm called a Huffman code to turn a digital file (which is usually represented as a sequence of 0s and 1s) into a new sequence of 0s, 1s and 2s. The code algorithm is a bit tedious to write down, but the encoding is straight-forward work for a computer.

      Previous letter
      written
      Next character to encode
      012
      ACGT
      CGTA
      GTAC
      TACG

      The researchers then turn the 0, 1, 2 string into an A, G, T, C string using a simple but neat recipe. The first character of the 0, 1, 2 string is encoded using the first row of the table on the left. So a 0 turns into a C, a 1 into a G, and a 2 into a T. For the next character you choose the row corresponding to the letter you've just written down. So if you have just written down a C (meaning the first character in the original string was a 0), you choose the C row to translate the next character, and so on. Since the row corresponding to each letter doesn't contain the letter itself, you end up with a string containing no repeating letters.

      Segments

      The sequence is broken into segments of length 100, but with each segment overlapping the previous one in 75 letters. This means that each DNA letter is contained in four different segments.

      Goldman and Birney then break the resulting sequence into segments of length 100, but with each segment overlapping the previous one in 75 letters. This means that each DNA letter is contained in four different segments. "That way, you would have to have the same error on four different fragments for [the method] to fail – and that would be very rare," says Birney. To each segment they append a few bits of extra information, also encoded in As, Gs, Ts and Cs, to tell a decoder how to put the pieces back together again. After a couple of other manipulations to protect against errors, the segments are ready to be turned into real DNA in the lab.

      The method does seem to work. Goldman and Birney encoded digital files containing, among other things, all of Shakespeare's sonnets and an mp3 of Martin Luther King's "I have a dream" speech, and sent the code to a US company which produced the corresponding DNA strings. When Goldman and Birney received the DNA samples back by mail, they were able to decipher the original files with 100% accuracy.

      "We've created a code that's error tolerant using a molecular form we know will last in the right conditions for 10,000 years, or possibly longer," says Goldman. "As long as someone knows what the code is, you will be able to read it back if you have a machine that can read DNA." The method still needs perfecting though, so don't buy your DNA reader just yet.

      Read more about...
      encryption
      DNA
      medicine and health
      • Log in or register to post comments

      Read more about...

      encryption
      DNA
      medicine and health
      University of Cambridge logo

      Plus is part of the family of activities in the Millennium Mathematics Project.
      Copyright © 1997 - 2025. University of Cambridge. All rights reserved.

      Terms