The school research lead - understanding p-values, statistical significance and avoiding misconceptions — Evidence-based school leadership and management: A practical guide

A major challenge for aspiring evidence-informed teachers is knowing when to trust the experts. It would be easy to assume that just because you have come across a particular interpretation of a concept or idea in a number different places - book, peer-reviewed article or blog - that it is correct. Unfortunately, if you did this, you could well be making a mistake. For example, in recent weeks I have come across three examples – Churches and Dommett (2016), Firth (2018) and Ashman (2018) - where the meaning of p-values and statistical significance would appear to have been misinterpreted. Furthermore, as Gorard et al (2017) states this mistakes are not uncommon. So to help aspiring school research leads and evidence-informed teachers spot where p-values and statistical significance have been misinterpreted I will:

Explain what is meant by the terms p-values and statistical significance
Identify a number of common misconceptions about p values and statistical
Show how the work of Churches and Dommett, Firth and Ashman all fall foul of some of these misconceptions and misinterpretations
Examine some of the implications for evidence-informed teachers.

And to help me do this I’m going to draw upon the work of Greenland, Senn, et al. (2016), the American Statistical Association and Wasserstein and Lazar (2016)

P values and statistical significance

When seeking to understand these terms there are a number of major problems and as Greenland, et al. (2016) state: ‘There are no interpretations of these concepts, which are at once simple, intuitive, correct, and foolproof’ (p337). Greenland et al go onto illustrate their point by providing twenty-five examples of common misconceptions and interpretation of these terms, which even professional academics are prone. Nevertheless, the American Statistical Association seek to informally define a p-value as: the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

The smaller the p value, the more unlikely are our results if the null hypothesis (and test assumptions) hold true. Whereas, the larger the p value, the less surprising are our results, given the null hypothesis and (test assumptions) hold true. In other words, as Greenland et al state: ‘The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical model (including the null hypothesis) predicted they should be, while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation’. p340

Statistical Significance

Put very simply a result is often deemed to be statistically significant if the p value is less than or equal to 0.05, although the level of statistical significance can be set lower levels, for example, p is less than or equal 0.01

Interpreting p values and statistical significance – guidance from the American Statistical Association

Given difficulties in interpreting p values and statistical significance the American Statistical Association - Wasserstein and Lazar (2016) – have provided some guidance on how to avoid some common mistakes. This guidance is summarised in six principles

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Some common misinterpretations – Churches, Dommett, Firth and Ashman

I will now look at how Churches and Dommett, Firth. and Ashman have - in my view- all misinterpreted either p-values or statistical significance.

Richard Churches and Eleanor Dommett - In their book Teacher-Led Research : Designing and implementing randomised controlled trials and other forms of experimental research – t include the following definitions within their glossary of terms

p-value – Probability value – that is the probability that the result may have occurred by chance (e.g p = 0.001 – a 1 in 1000 probability that the result may have happened by chance) Also known as the significance level.

Significance – The probability that a change in score may have occurred by chance. A threshold for significance (alpha) is set at the start of piece of research. This is never less stringent than 0.05 ….

Unfortunately, according to the ASA both of these statements are incorrect. First, the p-value is a measure of the consistency of the results with a particular statistical model – with all the assumptions behind the model being maintained. Second, the p-value is not the probability that the data were produced by random chance alone as it also depends on the accuracy of the assumptions underpinning the statistical model Third, the definition of significance conflates scientific significance, with statistical significance.

Jonathan Firth - Firth, J. (2018). The Application of Spacing and Interleaving Approaches in the Classroom Impact. 1. 2.

In a recent edition of Impact, Jonathan Firth uses p-values and statistical significance to the application of spacing an interleaving in the classroom where an opportunity sample of 31 school pupils between 16 and 17 years of age was used.

The mean percentages of correct answers on the end-of-task test for the interleaved and blocked conditions are shown in Figure 4. A between-subjects ANOVA was carried out. This analysis revealed a significant main effect of spacing (performance in the spaced condition being worse than the massed condition, with mean scores of 12.25 vs 9.45, p = .002), while interleaving did not have a significant main effect. Importantly, there was also a significant (p = .009) interaction between the two variables (spacing vs interleaving), indicating that interleaving had a mediating or protective effect against the difficulties caused by spacing (see Figure 5).

The findings demonstrated that spacing had a harmful effect on the immediate test, while the main effect of interleaving was neutral. The results fit with the idea that these are ‘desirable difficulties’, with the potential to impede learning in the short term. .

Again, according to the ASA there are errors in both paragraphs. Statistical significance does not demonstrate whether an a scientifically or substantively important/significant relation has been detected. Neither is statistical significance a property of the phenomenon being studied, but is a product of the consistency between the data and what would have been expected using the specified statistical model. In other words, the map is not the territory.

Greg Ashman - Ashman (2018) The Article That England’s Chartered College Will Not Print. Filling the Pail.

In a blogpost which criticises the EEF’s approach to both meta-cognition and meta-analysis, Greg also falls foul of the problems of interpreting p-values and statistical significance

If we focus only on the randomised controlled trials conducted by the EEF, the case for meta-cognition and self-regulation seems weak at best. Of the seven studies, only two appear to have statistically significant results. In three of the other studies, the results are not significant and in two more, significance was not even calculated. This matters because a test of statistical significance tells us how likely we would be to collect this particular set of data if there really was no effect from the intervention. If results are not statistically significant then they could well have arisen by chance.

Again using the ASA’s guidance there are a number of errors in this statement. First, statistical significance – or rather the lack of it – does not tell us whether there was no effect from the intervention. It just tells us the data was inconsistent with the statistical model. Second, even if the results are or are not statistically significant it does not mean the results have arisen by chance. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself. In other words, it is a statement about the results of the study relative to a particular statistical model.

Where does this leave us?

First, p-values, significance and statistical are slippery concepts, which take time and effort to even begin to understand never alone master. Indeed, you may need to forget what you have already learnt at university on under-graduate or post-graduate courses.

Second, misuse of p-values and statistical significance is not uncommon, so is something you have to watch out for when reading quantitative research reports. So keep the ASA principles hand to see if they are being misapplied in research reports. You don’t have to understand something and how it works, (though it helps) to be able to spot it misuse.

Third, just because you can come across something in a variety of formats – book, peer-reviewed article or blog and from a variety of authors – researchers, researchers at university, or school teachers - does not mean it is correct.

Fourth, I am not making personal comments about the personal integrity of any of the authors I have criticised. These comments should be seen as ‘business not personal’ and are a genuine attempt to increase the research literacy of teachers and school leaders. Being an evidence-informed teacher or school leaders is hard enough when you are using the right, never mind the wrong, tools.

And finally, it’s worth remembering the words of Greenland, et al. (2016) who state: ‘In closing, we note that no statistical method is immune to misinterpretation and misuse, but prudent users of statistics will avoid approaches especially prone to serious abuse. In this regard, we join others in singling out the degradation of P values into ‘‘significant’’ and ‘‘nonsignificant’’ as an especially pernicious statistical practice.’ p348.

References

Ashman, G. (2018). The Article That England’s Chartered College Will Not Print. Filling the Pail. https://gregashman.wordpress.com/2018/04/17/the-article-that-englands-chartered-college-will-not-print/. 21 April, 2018.

Churches, R. and Dommett, E. (2016). Teacher-Led Research: Designing and Implementing Randomised Controlled Trials and Other Forms of Experimental Research. London. Crown House Publishing.

Firth, J. (2018). The Application of Spacing and Interleaving Approaches Int He Classroom Impact. 1. 2.

Gorard, S., See, B. and Siddiqui, N. (2017). The Trials of Evidence-Based Education. London. Routledge

Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S. and Altman, D. (2016). Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations. European journal of epidemiology. 31. 4. 337-350.

Wasserstein, R. and Lazar, N. (2016). The Asa's Statement on P-Values: Context, Process, and Purpose, the American Statistician, 70:2, 129-133,. The American Statistician. 70. 2. 129-133.