The leaning tower of PISA?

According to the latest PISA study Britain's teenagers have dropped out of the top 20 in reading, maths and science. That's in a ranking of 65 economies from around the world. Media reactions were predictable (worse than Estonia?!) and so were Michael Gove's (it's the last government's fault), but some were a little more critical. What can such a ranking really tell us, and what information lies buried behind the headlines?

School: Is it effective?

In a very interesting documentary broadcast on BBC Radio 4 last week, David Spiegelhalter had a closer look at the study, which was published by the Organisation for Economic Co-operation and Development (OECD). Spiegelhalter is a statistician and Winton Professor for the Public Understanding of Risk at the University of Cambridge. And since a lot of statistics goes into making a good study, from designing it right to evaluating its results, it's interesting to hear a statistician's opinion.

The main challenge for a global study such as PISA, spanning so many different cultures, political and economic systems, is to make sure you are comparing like with like. Test questions have to make as much sense to a person in Britain as they do to a person in Nigeria, and measure student's true ability rather than simply reflect on what they have been focussing on in their classes. Some critics argue that this has rendered some PISA questions slightly vacuous. In maths, for example, questions apparently focus on interpreting graphs and other visual representations of data, in the hope that this will render them independent of cultural background and whether syllabi focus on algebra, geometry, or both. Understanding visual information is a very important part of maths, but it's only one part. The other components, so critics argue, are simply not captured by the questions.

The assumption that questions are equally difficult for people in different countries is fundamental in the OECD's analysis of the results and this, according to Spiegelhalter is a major flaw. As he describes on his blog (the statistics wasn't elaborated on in the programme), not all students tested by PISA answer all of the questions. In fact, the statistician Svend Kreiner calculated that in the 2006 study around half of all students didn't answer any reading questions at all and only 10% of all students that were tested answered all reading questions.

David Spiegelhalter

The PISA study deals with this by generating "plausible scores" for students that haven't answered all the questions. This may sound alarming at first, but the point is that there is information available on each students from their answers to other questions. Assuming a statistical model (called a Rasch model) the plausible scores are generated from what's known about students. To put this into context, in the 2006 study five reading scores each for around half of the students (those that didn't answer any reading questions) were "plausible" rather than real.

Spiegelhalter does not object to this method per se: using plausible values can be fine, as long as you keep track of the extra potential for error this introduces. What he is skeptical about is the underlying assumption, namely that questions are equally difficult for people in different countries. In fact, Kreiner has found that this isn't the case. To a lay person this would put an end to it: plausible scores generated from a flawed model are no longer plausible. But statisticians are aware that all models are to some extent over-simplified, so they go on to check whether the uncertainty that arises from this is acceptable. But in this case that's tricky. Kreiner says, and Spiegelhalter agrees, that "the effect of using plausible values generated by a flawed model is unknown". So their analysis throws a lot of doubt on the final country ranking.

Exactly how that final ranking is produced from real and plausible scores isn't entirely clear. "The PISA methodology is complex and rather opaque," according to Spiegelhalter. But this aside, there is also the question of how to interpret the results. What is actually being assessed? If it's the secondary school system, then there is the problem that people come out of primary schools with different levels of understanding, so really you should try and measure how much that has been improved by secondary school — to measure the "value added" to use that horrible phrase — rather than just measuring understanding full stop. If it's the whole education system that's being assessed, then you still have the problem that in some countries, like some top-ranking Asian ones, students receive a lot of tuition out of school and the cultural attitude towards education is different. PISA results really reflect whole cultures, so knee-jerk reactions directed against the education system seem too narrow. And, as the Radio 4 programme pointed out, politicians are likely to use PISA results to argue for whatever educational reform they had in mind anyway.

On the bright side, and as was argued by representatives of OECD, the PISA study produces much more data than the simple minded country ranking reveals. For example, it has found that British teenagers are happier at school than their contemporaries from some top-ranking countries. South Korea, near the top of the performance table, is at the bottom in terms of student happiness. Since there's more to life than reading, maths and science, it stands to reason that this happiness will play a role in people's futures. What we really need is a very long study that follows people from their school days long into their future. The OECD says it's doing this, but we'll have to wait a while to see the results.

You can read more about league tables and associated problems in Spiegelhalter's Plus article. He also runs an excellent website called Understanding uncertainty, which contains the blog mentioned above.