Skip to main content
Home
plus.maths.org

Secondary menu

  • My list
  • About Plus
  • Sponsors
  • Subscribe
  • Contact Us
  • Log in
  • Main navigation

  • Home
  • Articles
  • Collections
  • Podcasts
  • Maths in a minute
  • Puzzles
  • Videos
  • Topics and tags
  • For

    • cat icon
      Curiosity
    • newspaper icon
      Media
    • graduation icon
      Education
    • briefcase icon
      Policy

      Popular topics and tags

      Shapes

      • Geometry
      • Vectors and matrices
      • Topology
      • Networks and graph theory
      • Fractals

      Numbers

      • Number theory
      • Arithmetic
      • Prime numbers
      • Fermat's last theorem
      • Cryptography

      Computing and information

      • Quantum computing
      • Complexity
      • Information theory
      • Artificial intelligence and machine learning
      • Algorithm

      Data and probability

      • Statistics
      • Probability and uncertainty
      • Randomness

      Abstract structures

      • Symmetry
      • Algebra and group theory
      • Vectors and matrices

      Physics

      • Fluid dynamics
      • Quantum physics
      • General relativity, gravity and black holes
      • Entropy and thermodynamics
      • String theory and quantum gravity

      Arts, humanities and sport

      • History and philosophy of mathematics
      • Art and Music
      • Language
      • Sport

      Logic, proof and strategy

      • Logic
      • Proof
      • Game theory

      Calculus and analysis

      • Differential equations
      • Calculus

      Towards applications

      • Mathematical modelling
      • Dynamical systems and Chaos

      Applications

      • Medicine and health
      • Epidemiology
      • Biology
      • Economics and finance
      • Engineering and architecture
      • Weather forecasting
      • Climate change

      Understanding of mathematics

      • Public understanding of mathematics
      • Education

      Get your maths quickly

      • Maths in a minute

      Main menu

    • Home
    • Articles
    • Collections
    • Podcasts
    • Maths in a minute
    • Puzzles
    • Videos
    • Topics and tags
    • Audiences

      • cat icon
        Curiosity
      • newspaper icon
        Media
      • graduation icon
        Education
      • briefcase icon
        Policy

      Secondary menu

    • My list
    • About Plus
    • Sponsors
    • Subscribe
    • Contact Us
    • Log in
    • Good-looking gibberish

      9 February, 2015
      2 comments

      How would you approximate the English language? In the 1940s the mathematician Claude Shannon asked himself just this question. Given a machine that can produce strings of letters, how would you set it up so that the strings it produces resemble a real English sentence as closely as possible?

      Claude Shannon

      Claude Shannon.

      A naive approach is to imagine a keyboard that contains all possible letters and punctuation marks, as well as a space bar, and get the machine to stab at the keyboard at random, with each key having the same probability of being hit. Shannon produced a message using essentially this method and it reads as follows (it appears Shannon only allowed capitals):

      XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.

      That's not a very good attempt at a meaningful sentence.

      The next-best approximation would be to choose the letters, and other symbols, with the frequency they actually have in the English language. For example, according to Shannon, the letter E appears with the frequency 0.12 (12% of letters in a text will be an E) and the letter W with frequency 0.02. So you could set your machine up to hit the E key with probability 0.12, the W key with probability 0.02, and so on. A message generated in this way already looks a tiny bit like a real language, though not exactly English:

      OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.

      To do even better, you could make the probability of a symbol appearing dependent on the previous letter. This reflects the fact that certain two-letter combinations — such as TH — are much more common than others — such as BD. Using this approach Shannon generated the sequence

      ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.

      Similarly, you could make the probabilities of hitting the various symbols depend, not just on the previous letter, but on the two previous letters:

      IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

      This does look like real English at first sight, there are even real words in there, but it still makes no sense whatsoever. At this point you might decide to get your machine to pick whole words from a dictionary, rather than individual letters from a keyboard. As before, you could get it to pick words independently of each other, but with probabilities that reflect their frequencies in the English language:

      REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TOOF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

      Or you could make the probability of picking a word dependent on the previous word in a way that reflects real English:

      THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

      That's not bad! Shannon seems to have stopped there, but imagine continuing in this vein, making a word's probability dependent on the previous two, three, four, etc words. At some point you'd probably produce something that reads quite well.

      Shannon published these examples in his 1948 landmark paper A mathematical theory of communication. His aim was to make communication via media such as telegraphy, TV and radio more efficient. The examples here might seem a bit silly, but the overall theory he developed became so influential that Shannon has become known as the father of information theory.

      Another concept, which is related but different, involves a monkey randomly typing on a type writer with each key having the same probability of being hit —  that's the situation we had in the first example above. However, rather than producing one short string, you give the monkey an infinite amount of time to type away. The idea is that eventually the monkey will produce something meaningful, such as the complete works of Shakespeare. To find out just how long you would have to wait for that, see Infinite monkey business.

      • Log in or register to post comments

      Comments

      Anonymous

      13 February 2015

      Permalink

      I saw some software that worked like this several years ago. While it generated output that was essentially meaningless, it was easily recognizable which author was used for input if the author had a distinct writing style.

      • Log in or register to post comments

      Anonymous

      21 March 2015

      Permalink

      Great article. Reminds me of the SCIgen "utility". Though, by now, I wonder how sophisticated these programs have become.
      Link: http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-…

      • Log in or register to post comments

      Read more about...

      Information theory
      statistics
      mathematics and language
      University of Cambridge logo

      Plus is part of the family of activities in the Millennium Mathematics Project.
      Copyright © 1997 - 2025. University of Cambridge. All rights reserved.

      Terms