Stop taking the p

By 
Marianne Freiberger

"Lies, damned lies, and statistics" — considering the deluge of data the modern world is showering us with, this famous quote by Benjamin Disraeli seems all the more relevant. Science, as well as policy making, rely on statistical analyses of this data wealth, which makes them vulnerable to statistical mischief. Voices criticising some standard statistical practices have become louder over recent years, so last month the American Statistical Association (ASA) decided to take a stand. In an unusual move the ASA issued guidelines on how to use a particular statistical tool.

P value

The curve represents the probability of observing particular data points (the x-axis represents the possible data points and the y-axis the probabilities). The tails of the curve represent very unlikely observations. The area shaded in green is the probability of observing a value equal to, or more extreme than, the data point marked in red. Image: Repapetilto and Chen-Pan Liao, CC BY-SA 3.0.

The ASA's bone of contention is something called the p-value. To get an idea of what this is, consider an example: imagine a medical researcher who has tested a particular drug (say to lower blood pressure) in a clinical trial. The result is looking good and appears to indicate that the drug has worked (eg the patients' blood pressure has gone down on average). But of course it's possible that the observations were just a fluke: the patients may have got better by chance, or because of some other unknown factor. However, making some theoretical assumptions, it's possible to calculate the probability of observing the result that has been observed (or something even more pronounced) assuming the drug was not actually effective. That probability is called the p-value. If the p-value is small, so the observations made are unlikely for a non-effective drug, then that lends some support to the idea that the drug does work.

P-values are used all over the place: from medicine to the social sciences, and from particle physics to economics. So what's the problem? If it took you a while to understand what a p-value actually is, then you have a clue to one of the difficulties: p-values are easy to misunderstand or misinterpret. People use p-values to compare a null hypothesis (eg the drug is not effective) to the observations they have made. A small p-value of, say 0.05, means there's only a 5% chance of making the observation if the null hypothesis is true. It's tempting to conclude that, therefore, the null hypothesis only has a 5% chance of being correct, which gives the alternative hypothesis (eg the drug is effective) a 95% chance of being true.

But unfortunately, this kind of reasoning is false. The p-value is not the probability of any hypothesis being true or false, it only measures the compatibility between the null hypothesis and the observations. A small p-value means that the observations aren't very compatible with the null hypothesis, but it doesn't tell you why they are not compatible: it might be down to chance, or to underlying factors you're not aware of. Conversely, a large p-value means your observations are compatible with the null hypothesis, but this doesn't mean the null hypothesis is true: other hypotheses might be equally, or even more, compatible with your data.

Data

Science, policy making and business decisions are based on statistical analyses of data.

The 0.05 figure is another problematic issue. In many sciences, including medicine, it is often used as a magic threshold. If your p-value is less than 0.05, then the null hypothesis is rejected and the complementary hypothesis accepted. But why 0.05 rather than, say, 0.01 or 0.1? That's just the problem. It's an arbitrary convention, so critics say, and represents a "long-worrisome circularity": universities teach the 0.05 figure to their students because the scientific community uses it, and the scientific community uses it because that's what they were taught at university. (Other sciences, notably particle physics, do use different thresholds for their p-values, see here to find out more.)

But regardless of what threshold is used, basing a scientific conclusion, or even a policy decision, on whether a p-value passes such a threshold is itself a crude practice, says the ASA. "A conclusion does not necessarily become 'true' on one side of the divide, and 'false' on the other." The use of p-values as a license to make a scientific claim "leads to a considerable distortion of the scientific process".

Which brings us to the question of context. A p-value by itself only offers one small piece of evidence for or against a hypothesis. Any scientific conclusion needs to be supported by other evidence, and researchers need to be clear about how they conducted their study, how good their measurements were, and what underlying assumptions they used. In particular, researchers shouldn't cherry-pick the results they publish according to whether they come with the right kind of p-value (a practice also known as p-hacking). "Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted, and how those analyses (including p-values) were selected for reporting," says the ASA report.

This is a rough summary of the ASA's grievances with the humble p-value (you can read the full report here). At least one scientific journal, Basic and Applied Social Psychology, already banned the use of p-values some time ago, and we will see if the ASA report has any further consequences. But it's worth remembering: it's not the p-value itself that is being criticised, but the fact that it is sometimes used as the primary piece of evidence on which a conclusion is based. As the ASA report says, "No single index should substitute for scientific reasoning."

Comments

Randall Munroe's wonderful webcomic xkcd has a rather nice take on this (at http://xkcd.com/882/)