
If I flip a coin 8 times, I expect to get 4 heads and 4 tails. That's because on a fair coin there's a 50:50 chance of flipping each. But if I get 5 heads and 3 tails, I don't immediately assume the coin is unfair. There is some variation in the actual results of a random event which means we don't always get exactly what we expect.
So, how different do the results need to be before we start to suspect something fishy is going on? Similarly, and more importantly, how do we know if a new medical treatment produces better outcomes for patients than an existing one? Or if an educational intervention increases standardised test scores? Or if the proportion of people supporting a particular political party has changed? Or if immigration rates are correlated with economic output (and in which direction)?
One way to answer questions like these is by using a hypothesis test. In a hypothesis test, we state two hypotheses. One is the null hypothesis, H0, in which we assume there is no change or difference from the status quo: we assume that the coin is fair, that the new medical treatment is no better than the existing one, that the support for a political party has not changed, and so on. The second hypothesis is the alternative hypothesis, H1, in which we assert that the quantity we are interested in has increased, decreased or just changed.

Our starting assumption is that the null hypothesis is true and we only revise that assumption in the face of evidence to the contrary. There is no way that we can prove which hypothesis is true, certainly not in the same sense that we consider a mathematical theorem to be proven. So, we also need to specify a significance level, a measure of how strong we would like the evidence to be, in order for us to reject our null hypothesis in favour of our alternative hypothesis. For this reason, hypothesis tests are also called significance tests and results of such tests may be referred to as being statistically significant or not.
The significance level chosen most often is 5%, which means that we only reject the null hypothesis if the probability of seeing the result that we have observed in our data, or something more extreme, is 5% or less assuming that the null hypothesis is true. For example, we would only reject the assumption that our coin is fair if we had tossed so many heads that the likelihood of this happening with a fair coin was less than 5%.
Similarly, when testing a new medical treatment (or an educational intervention, or the support for a political party, etc), we only reject the null hypothesis (that there isn't a change) if the likelihood of observing the results we did observe in our test is less than 5% under the null hypothesis. In these real-world examples the likelihood isn't as straightforward to calculate as for the coin example, but it's nevertheless possible (see here to find out more). (The probability of seeing the result that we have observed in our data assuming that the null hypothesis is true is called the p-value.)
The 5% significance level is just an arbitrary convention, based on the judgement that an event which only has 5% probability of occurrence is sufficiently unlikely. In effect, the significance level of 5% indicates that we can expect to reject a true null hypothesis 5% of the time. In other words, 1 in 20 of the "significant" results we get from hypothesis tests at the 5% level are likely to be the result of chance. How can we be more confident of our results? We could use a lower significance level like 1% (somewhat counterintuitively, we would describe a significant result at the 1% level as being more significant). But then we reduce the likelihood of detecting a difference which is actually there, so it is a bit of a balancing act.
This underlying uncertainty is inevitable in hypothesis tests and so our conclusions must reflect this. Whatever level of significance we have used, we can only report that the evidence suggests that the null hypothesis is true or that the alternative hypothesis is true. We cannot prove or disprove either hypothesis. And our conclusion is based on an arbitrary, predetermined level of significance — a different level of significance could lead to a different conclusion. So hypothesis tests are far from perfect and many scientists argue for a different approach, but as they continue to be widely used in natural and social sciences, they are worth getting your head around. Perhaps, your ultimate decision will be the extent to which you allow hypothesis tests to inform your decision-making.
About this article
Kristin Coldwell is a secondary maths teacher and general maths enthusiast who splits her time between the classroom and work in maths outreach and promotion.