Understanding uncertainty: ESP and the significance of significance

Kevin McConway Share this page

This article has been adapted from material on the Understanding Uncertainty website.


Is there such a thing as extra-sensory perception? Image: Boston.

In March 2011 the highly respected Journal of Personality and Social Psychology published a paper by the distinguished psychologist Daryl Bem, of Cornell University in the USA. The paper reports a series of experiments which, Bem claims, provide evidence for some types of extra-sensory perception (ESP). These can occur only if the generally accepted laws of physics are not all true. That's a pretty challenging claim. And the claim is based largely on the results of a very common (and very commonly misunderstood) statistical procedure called significance testing. Bem's experiments provide an excellent way into looking at how significance testing works and at what's problematic about it.

Bem's experiments and what he did with the data

Bem's article reports the results of nine different experiments, but for our purpose it's sufficient to only look at Experiment 2. This is based on well-established psychological knowledge about perception. Images that are flashed up on a screen for an extremely short time, so short that the conscious mind does not register that they have been seen at all, can still affect how an experimental subject behaves. Such images are said to be presented subliminally, or to be subliminal images. For instance people can be trained to choose one item rather than another by presenting them with a pleasant, or at any rate not unpleasant (neutral), subliminal image after they have made the "correct" choice and a very unpleasant one after they have made the "wrong" choice.

Bem, however, did something rather different in Experiment 2. As in a standard experiment of this sort his participants had to choose between two closely matched pictures (projected clearly on a screen, side by side). Then they were presented a neutral subliminal image if they had made the "correct" choice, and an unpleasant subliminal image if they had made the "wrong" choice. The process was then repeated with a different pair of pictures to choose between. Each participant made their choice 36 times, and there were 150 participants in all.

The ESP controversy

Bem's paper sparked a considerable amount of debate. The Journal of Personality and Social Psychology recognised that the statistical aspects are crucial and took the unusual step of publishing an editorial alongside Bem's paper, explaining their reasons for publishing it, and also including in the same journal another [ http://dl.dropbox.com/u/1018886/Bem6.pdf ] paper by the Dutch psychologist Eric-Jan Wagenmakers criticising Bem's work. Bem and colleagues provided a response to Wagenmakers, who in turn responded [ http://dl.dropbox.com/u/1018886/ClarificationsForBemUttsJohnson.pdf ] here. Several other researchers and commentators have joined in too. This one will run and run. The New York Times has also published two articles on the matter, which you can find here and here.

But the new feature of Bem's experiment was that when the participants made their choice between the two pictures in each pair, nobody — not the participants, not the experimenters — could know which was the "correct" choice. The "correct" choice was determined by a random mechanism, after a picture had been chosen by the respondent.

If the experiment was working as designed, and if the laws of physics relating to causes and effects are as we understand them, then the subliminal images could have no effect at all on the participants' choices of picture. This is because at the time they made their choice there was no "correct" image to choose. Which image was "correct" was determined afterwards. Therefore, given the way the experiment was designed (I have not explained every detail) one would expect each participant to be "correct" in their choice half the time, on average. Because of random variability, some would get more than 50% right, some would get less, but on average, people would make the right choice 50% of the time.

What Bem found was that the average percentage of correct choices, across his 150 participants, was not 50%. It was slightly higher: 51.7%.

There are several possible explanations for this finding, including the following:

  1. The rate was higher than 50% just because there is random variability, both in the way people respond and in the way the "correct" image was selected. That is, nothing very interesting happened.
  2. The rate was higher than 50% because the laws of cause and effect are not as we understand them conventionally, and somehow the participants could know something about which picture was "correct" before the random system had decided which was "correct".
  3. The rate was higher than 50% because there was something wrong with the experimental setup and the participants could get an idea about which picture was "correct" when they made their choice, without the laws of cause and affect being broken.
  4. More subtly, these results are not typical, in the sense that actually more experiments were done than are reported in the paper, and the author chose to report the results that were favourable to the hypothesis that something happened that casts doubt on the laws of cause and effect, and not to report the others. Or perhaps more and more participants kept being added to the experiment until the results happened to look favourable to that hypothesis.

I won't consider all these in detail. Instead I'll concentrate on how and why Bem decided that point 1 was not a likely explanation.

Bem carried out a significance test. Actually he made several different significance tests, making slightly different assumptions in each case, but they all led to more or less the same conclusion, so I'll discuss only the simplest. The resulting p value was 0.009. Because this value is small he concluded that Explanation 1 was not appropriate and that the result was statistically significant.

This is a standard statistical procedure which is very commonly used. But what does it actually mean and what is this p value?

What's a p value?

All significance tests involve a null hypothesis, which (typically) is a statement that nothing very interesting has happened. In a test comparing the effects of two drugs the usual null hypothesis would be that, on average, the drugs do not differ in their effects. In Bem's Experiment 2 the null hypothesis is that Explanation 1 is true: the true average proportion of "correct" answers is 50% and any difference from 50% that is observed is simply due to random variability.

The p value for the test is found as follows. One assumes that the null hypothesis is really true. One then calculates the probability of observing the data that were actually observed, or something more extreme, under this assumption. That probability is the p value. So in this case, Bem used standard methods to calculate the probability of getting an average proportion correct of 51.7%, or greater, on the assumption that all that was going on was chance variability. He found this probability to be 0.009. (I'm glossing over a small complication here, that in this case the definition of "more extreme" depends on the variability of the data as well as their average, but that's not crucial to the main ideas.)

Flying pig

The p value is not the probability that the laws of physics don't apply.

Well, 0.009 is quite a small probability. So we have two possibilities here. Either the null hypothesis really is true but nevertheless an unlikely event has occurred. Or the null hypothesis isn't true. Since unlikely events do not occur often, we should at least entertain the possibility that the null hypothesis isn't true. Other things being equal (which they usually aren't), the smaller the p value is the more doubt it casts on the null hypothesis. How small the p value needs to be in order for us to conclude that there's something really dubious about the null hypothesis (and hence that, in the jargon, the result is statistically significant and the null hypothesis is rejected) depends on circumstances. Sometimes the values of 0.05 or 0.01 are used as boundaries and a p value less than that would be considered a significant result.

This line of reasoning, though standard in statistics, is not at all easy to get one's head round. In an experiment like this one often sees the p value interpreted as follows:

(a) "The p value is 0.009. Given that we've got these results, the probability that chance alone is operating is 0.009." That is WRONG.

The correct way of putting it is

(b) "The p value is 0.009. Given that chance alone is operating, the probability of getting results like these is 0.009."

(That's not quite the whole picture, because it doesn't include the part about "results at least as extreme as these", but it's close enough for most purposes.)

So the difference between (a) and (b) is that the "given" part and the part that has probability of 0.009 are swapped round.

It may well not be obvious why that matters. The point is that the answers to the two questions "Given A, what's the probability of B?" and "Given B, what's the probability of A" might be quite different. An example is to imagine that you're picking a random person off the street in London. Given that the person is a Member of (the UK) Parliament, what's the probability that they are a British citizen? Well that probability would be high.

What about the other way round? Given that this random person is a British citizen, what's the probability that they are an MP? I hope it's clear to you that that probability would be very low. The great majority of the British citizens in London are not MPs. So it's fairly obvious (I hope!) that swapping the "given" part and the probability changes things.

Nevertheless, Bem is interpreting his significance test in the commonly used way when he deduces, from a p value of 0.009, that the result is significant and that there may well be more going on than simply the effects of chance. But, just because this is common, that doesn't mean it's always correct.

Out of the nine experiments that Bem reports, he found significant p values in all but one (where he seems to be taking "significant" as meaning "p less than 0.05" - the values range from 0.002 to 0.039, with the one he regards as non-significant having a p value of 0.096). Conventionally, these would indeed be taken as Bem interprets them, pointing in the direction that most of the null hypotheses are probably not true, which in the case of these experiments means that there is considerable evidence of the laws of physics not applying everywhere. Can that really be the case?

This is what I'll look at in the second part of this article.

About the author

Kevin McConway is Professor of Applied Statistics at the Open University. As well as teaching statistics and health science, and researching the statistics of ecology and evolution and in decision making, he works in promoting public engagement in statistics and probability. Kevin is an occasional contributor to the Understanding Uncertainty website.