Wednesday, 6 June 2018

Guest post - Meta-analysis: Magic or Reality, by Professor Adrian Simpson

Recently I had the good fortune to have an article published in the latest edition of the Chartered College of Teaching’s journal Impact in which I briefly discussed the merits and demerits of meta-analyses, Jones (2018).  In that article I lent heavily on the work of Adrian  Simpson (2017) who raises a number of technical arguments against the use of meta-analysis.   However, since then a blog post written by Kay, Higgins, and Vaughan (2018) has been published on the Evidence for Learning website, which seeks to address the issues raised in Simpson’s original article about the inherent problems associated with meta-analyses. In this post Adrian Simpson responds to the counter-arguments raised on the Evidence for Learning website.

Magic or reality: your choice, by Professor Adrian Simpson, Durham University

There are many comic collective nouns whose humour contains a grain of truth. My favourites include "a greed of lawyers", "a tun of brewers" and, appropriately here, "a disputation of academics". Disagreement is the lifeblood of academia and an essential component of intellectual advancement, even if that is annoying for those looking to academics for advice. 

Kay, Higgins and Vaughan (2018, hereafter KHV) recently published a blog post attempting to defend using effect size to compare the effectiveness of educational interventions, responding to critiques (Simpson, 2017; Lovell, 2018a). Some of KHV is easily dismissed as factually incorrect: for example, Gene Glass did not create effect size: Jacob Cohen wrote about it in the early 1960s; the toolkit methodology is not applied consistently: at least one strand [setting and streaming] is based only on results for low attainers while other strands are not similarly restricted (that is quite apart from most studies in the strand being about within-class grouping!)

However, this response to KHV is not about extending the chain of point and counter-point, but to ask that teachers and policy makers check arguments for themselves: Decisions about using precious educational resources needs to lie with you, not with feuding faculty. The faculty need to state their arguments as clearly as possible but readers need to check them: if I appeal to a simulation to illustrate the impact of range restriction on effect size (which I do in Simpson, 2017), can you repeat it - does it support the argument? If KHV claim the EEF Teaching and Learning toolkit use ‘padlock ratings’  to address the concern about comparing and combining effect sizes from studies with different control treatments, read the padlock rating criteria – do they discuss equal control treatments anywhere? Dig down and choose a few studies that underpin the Toolkit ratings – do the control groups in different studies have the same treatment?

So, in the remainder of this post, I invite you to test our arguments: are my analogies deceptive or helpful? Re-reading KHV’s post, do their points address the issues or are they spurious?

KHV’s definition of effect size shows it is a composite measure. The effectiveness of the intervention is one component, but so is the effectiveness of the control treatment, the spread of the sample of participants, the choice of measure etc. It is possible to use a composite measure as a proxy for one component factor, but only provided the ‘all other things equal’ assumption holds.

In the podcast I illustrated the ‘all other things equal’ assumption by analogy: when is the weight of a cat a proxy for its age? KHV didn’t like this, so I’ll use another: clearly the thickness of a plank of wood is a component of its cost, but when can the cost of a plank be a proxy for its thickness? I can reasonably conclude that one plank of wood is thicker than another plank on the basis of their relative cost only if all other components impinging on cost are equal (e.g. length, width, type of wood, timberyard’s pricing policy) and I can reasonably conclude that one timberyard on average produces thicker planks than another on the basis of relative average cost only if those other components are distributed equally at both timberyards. Without this strong assumption holding, drawing a conclusion about relative thickness on the basis of relative cost is a misleading category error.

In the same way, we can draw conclusions about relative effectiveness of interventions on the basis of relative effect size only with ‘all other things equal’; and we can compare average effect sizes as a proxy for comparing the average effectiveness of types of interventions only with ‘all other things equal’ in distribution.

So, when you are asked to conclude that one intervention is more effective than another because one study resulted in a larger effect size, check if ‘all other things equal’ holds (equal control treatment, equal spread of sample, equal measure and so on). If not, you should not draw the conclusion.

When the Teaching and Learning Toolkit invites you to draw the conclusion that the average intervention in one area is more effective than the average intervention in another because its average effect size is larger, check if ‘all other things equal’ holds for distributions of controls, samples and measures. If not, you should not draw the conclusion.

Don’t rely on disputatious dons: dig in to the detail of the studies and the meta-analyses. Does ‘feedback’ use proximal measures in the same proportion as ‘behavioural interventions’? Does ‘phonics’ use restricted ranges in the same proportion as ‘digital technologies’? Does ‘metacognition’ use the same measures as ‘parental engagement’? Is it true that the toolkit relies on ‘robust and appropriate comparison groups’, and would that anyway be enough to confirm the ‘all other things equal’ assumption?

KHV describe my work as ‘bad news’ because it destroys the magic of effect size. ‘Bad news’ may be a badge of honour to wear with the same ironic pride as decent journalists wear autocrats’ ‘fake news’ labels. However, I agree it can feel a little cruel to wipe away the enchantment of a magic show; one may think to oneself ‘isn’t it kinder to let them go on believing this is real, just for a little longer?’ However, educational policy making may be one of those times when we have to choose between rhetoric and reason, or between magic and reality. Check the arguments for yourself and make your own choice: are effect sizes a magical beginning of an evidence adventure, or a category error misdirecting teachers’ effort and resources?

Kay, J., Higgins, S. & Vaughan, T. (2018) The magic of meta-analysis, (accessed 28/5/2018)

Simpson, A. (2017). The misdirection of public policy: Comparing and combining standardised effect sizes. Journal of Education Policy, 32(4), 450-466.

Lovell, O. (2018a) ERRR #017. Adrian Simpson critiquing the meta-analysis, Education Research Reading Room Podcast, (accessed 25/5/2018)


  1. This can not be correct. If this held true this means EEF Toolkit and Visible Learning were incorrect. These tools are used by teachers across our world. This must be wrong.

  2. An excellent post Adrian which reinforces the need to examine the credibility of the arguments presented in reports of meta-analyses. The plank analogy is used very effectively to illustrate that effect size is a composite measure in which the effectiveness of the intervention is only one component. The task of checking that ‘all other things are equal’ will however be challenging for many teachers.