Recently I had the good fortune to have an article published in the latest edition of the Chartered College of Teaching’s journal Impact in which I briefly discussed the merits and demerits of meta-analyses, Jones (2018). In that article I lent heavily on the work of Adrian Simpson (2017) who raises a number of technical arguments against the use of meta-analysis. However, since then a blog post written by Kay, Higgins, and Vaughan (2018) has been published on the Evidence for Learning website, which seeks to address the issues raised in Simpson’s original article about the inherent problems associated with meta-analyses. In this post Adrian Simpson responds to the counter-arguments raised on the Evidence for Learning website.
Magic or
reality: your choice, by Professor Adrian Simpson, Durham University
There are
many comic collective nouns whose humour contains a grain of truth. My
favourites include "a greed of lawyers", "a tun of brewers"
and, appropriately here, "a disputation of academics". Disagreement
is the lifeblood of academia and an essential component of intellectual
advancement, even if that is annoying for those looking to academics for
advice.
Kay,
Higgins and Vaughan (2018, hereafter KHV) recently published a blog post attempting
to defend using effect size to compare the effectiveness of educational
interventions, responding to critiques (Simpson, 2017; Lovell, 2018a). Some of KHV
is easily dismissed as factually incorrect: for example, Gene Glass did not
create effect size: Jacob Cohen wrote about it in the early 1960s; the toolkit
methodology is not applied consistently: at least one strand [setting and
streaming] is based only on results for low attainers while other strands are
not similarly restricted (that is quite apart from most studies in the strand
being about within-class grouping!)
However,
this response to KHV is not about extending the chain of point and
counter-point, but to ask that teachers and policy makers check arguments for
themselves: Decisions about using precious educational resources needs to lie
with you, not with feuding faculty. The faculty need to state their arguments
as clearly as possible but readers need to check them: if I appeal to a
simulation to illustrate the impact of range restriction on effect size (which
I do in Simpson, 2017), can you repeat it - does it support the argument? If
KHV claim the EEF Teaching and Learning toolkit use ‘padlock ratings’ to address the concern about comparing and
combining effect sizes from studies with different control treatments, read the
padlock rating criteria – do they discuss equal control treatments anywhere?
Dig down and choose a few studies that underpin the Toolkit ratings – do the
control groups in different studies have the same treatment?
So, in the
remainder of this post, I invite you to test our arguments: are my analogies
deceptive or helpful? Re-reading KHV’s post, do their points address the issues
or are they spurious?
KHV’s definition
of effect size shows it is a composite measure. The effectiveness of the
intervention is one component, but so is the effectiveness of the control
treatment, the spread of the sample of participants, the choice of measure etc.
It is possible to use a composite measure as a proxy for one component factor,
but only provided the ‘all other things equal’ assumption holds.
In the
podcast I illustrated the ‘all other things equal’ assumption by analogy: when
is the weight of a cat a proxy for its age? KHV didn’t like this, so I’ll use another:
clearly the thickness of a plank of wood is a component of its cost, but when can
the cost of a plank be a proxy for its thickness? I can reasonably conclude
that one plank of wood is thicker than another plank on the basis of their
relative cost only if all other components impinging on cost are equal (e.g.
length, width, type of wood, timberyard’s pricing policy) and I can reasonably
conclude that one timberyard on average
produces thicker planks than another on the basis of relative average cost only if those other
components are distributed equally at
both timberyards. Without this strong assumption holding, drawing a conclusion
about relative thickness on the basis of relative cost is a misleading category
error.
In the same
way, we can draw conclusions about relative effectiveness of interventions on
the basis of relative effect size only with ‘all other things equal’; and we
can compare average effect sizes as a proxy for comparing the average
effectiveness of types of interventions only with ‘all other things equal’ in
distribution.
So, when
you are asked to conclude that one intervention is more effective than another
because one study resulted in a larger effect size, check if ‘all other things
equal’ holds (equal control treatment, equal spread of sample, equal measure
and so on). If not, you should not draw the conclusion.
When the
Teaching and Learning Toolkit invites you to draw the conclusion that the
average intervention in one area is more effective than the average
intervention in another because its average effect size is larger, check if
‘all other things equal’ holds for distributions of controls, samples and
measures. If not, you should not draw the conclusion.
Don’t rely
on disputatious dons: dig in to the detail of the studies and the
meta-analyses. Does ‘feedback’ use proximal measures in the same proportion as
‘behavioural interventions’? Does ‘phonics’ use restricted ranges in the same
proportion as ‘digital technologies’? Does ‘metacognition’ use the same
measures as ‘parental engagement’? Is it true that the toolkit relies on ‘robust
and appropriate comparison groups’, and would that anyway be enough to confirm
the ‘all other things equal’ assumption?
KHV
describe my work as ‘bad news’ because it destroys the magic of effect size. ‘Bad
news’ may be a badge of honour to wear with the same ironic pride as decent
journalists wear autocrats’ ‘fake news’ labels. However, I agree it can feel a
little cruel to wipe away the enchantment of a magic show; one may think to
oneself ‘isn’t it kinder to let them go on believing this is real, just for a
little longer?’ However, educational policy making may be one of those times
when we have to choose between rhetoric and reason, or between magic and
reality. Check the arguments for yourself and make your own choice: are effect
sizes a magical beginning of an evidence adventure, or a category error
misdirecting teachers’ effort and resources?
References
Kay, J.,
Higgins, S. & Vaughan, T. (2018) The magic of meta-analysis, http://evidenceforlearning.org.au/news/the-magic-of-meta-analysis/ (accessed 28/5/2018)
Simpson, A.
(2017). The misdirection of public policy: Comparing and combining standardised
effect sizes. Journal of Education Policy, 32(4), 450-466.
Lovell, O.
(2018a) ERRR #017. Adrian Simpson critiquing the meta-analysis, Education
Research Reading Room Podcast, http://www.ollielovell.com/errr/adriansimpson/ (accessed 25/5/2018)