In a recently published paper, (Simpson, 2017) argues that the rankings of educational interventions through combining effect sizes from meta-analyses and meta-meta-analyses is fundamentally flawed. Assumptions about statistical summaries of effect size providing an estimate of the impact of an educational intervention are shown to be false. Furthermore, the use of effect size is open to researcher manipulation. As such, league tables of the effectiveness of interventions (Hattie, 2008) are potentially hierarchies of the openness to the manipulation of research design. Consequently, league tables of the effectiveness of educational interventions provide little guidance for educators at national, school or classroom level.
The rest of this post will consist of the following:
- A brief introduction to effect sizes
- An attempt to summarise briefly summarise (Simpson, 2017)
- Other considerations vis a vis meta-analyses and meta-meta-analyses (Wiliam, 2016)
- Consider the implications for school leaders and teacher of the use of hierarchies of the effect size of interventions,(Hattie, 2008).
Effect sizes: A brief introduction
Put quite simply, an effect size is a way of estimating the strength /magnitude of a phenomenon. So an effect size can be result of an intervention identified through the comparison of two groups – one group who received the intervention and another group, the control group who did not receive the intervention. Alternatively, it can be used to describe to measure the strength of the relationship between variables. However, for our purposes, we will focus on the use of effect sizes when comparing the differences between two groups and is estimated by using the following calculation.
Effect Size = (Mean of experimental group) - (Mean of the control group)
Assumptions underpinning meta-analysis and meta-meta-analysis
(Simpson, 2017) argues there are two key assumptions associated with meta-analysis and meta-meta-analysis. First, the larger the effect size is associated with greater educational significance. Second, two or more different studies on the same interventions can have their effect sizes combined to give a meaningful estimate of the intervention’s educational importance. However, Simpson identifies three reasons – different comparator groups, range restrictions, and measure design – as to why these assumptions do not hold.
Why these assumptions do not hold.
Unequal Comparator Groups
Say we are looking at combining the effect sizes of a couple of studies on the impact of written feedback. In one study the results of a group of pupils who receive written feedback is compared with the results of pupils who receive verbal feedback. Let’s say that give us an effect size of 0.6. In a second study, the results pupils who receive written feedback are compared with pupils who receive only group feedback and has an effect size of 0.4. Now we may be tempted to add the two effect sizes together to find out the average effect size of written feedback, in this case 0.5. However, that would not allow us to make an accurate estimate as to the effect size of providing written feedback. This would require a study where the results of written feedback is compared to pupils who receive no feedback whatsoever. As such, it is simply not possible to accurately combine studies which have used different types of comparator groups.
This time we are going to undertake the same two interventions but in this example we are going to restrict the the range of pupils used in the studies. In the first study, only highly attaining pupils are included in the study. Whereas in the second study, pupils involved in the intervention are drawn from the whole ability range. As a result, and for at least two reasons, this may lead to a change in the effect size of receiving written feedback. First, it will take out from study pupils who may not know how to respond to the feedback. Second, it may well be that highly attaining pupils have less ‘head-room’ to demonstrate the impact of either type of feedback. As a result, the effect size is highly likely to change. The consequence of this is the different ranges of pupils used in interventions will influence the impact of an intervention and influence the effect size. As such, unless the interventions combine studies which use the same range of pupils, the combined effect size in unlikely to be an accurate estimate of the ‘true’ effect size of the intervention.
Finally, we are going to look at the impact of measure design on effect sizes. (Simpson, 2017) argues that researchers can directly influence effect size by choices they make about how they seek to measure the effect. First, if researchers design an intervention and the measure used is specifically focussed on measuring the effect of that intervention this will lead to an increase in effect size. For example, you could be undertaking an intervention looking to improve algebra scores. Now you could choose to use a measure which is specifically designed to ‘measure’ algebra or you could choose to use a measure of general mathematical competence, which includes an element of algebra. In this situation, the effect size of the former will be greater than the latter, due to the precisions of the measure used. Second, the researcher could increase the number of test items. Simpson states that a relatively well designed test that having two questions instead of one increases the effect size by 20% and if we can twenty questions, this can lead to a doubling of the effect size. Simulations suggest that if you increase the number of questions used to measure the effectiveness of an intervention, this may lead to effect size inflation of 400%.
It is important to note that there are considerations as regard the limitation effect sizes and meta-analysis. (Wiliam, 2016) identifies four other limitations of effect sizes. First, the intensity and duration of the intervention will have an impact on the resulting effect size. Second, there is the file drawer problem, we don’t know how many similar interventions have been carried out, which did not generate statistically significant results, and as a result have not been published. (Polanin et al., 2016) found when reviewing 383 meta-analysis published research yielded larger effect results than those from unpublished studies, and provides evidence to support the notion of publication bias, i.e. a phenomenon where studies with large and/or statistically significant effects, relative to studies with small or null effects, have a greater chance of being published. Third, there is the age dependence of effect size. All other things being equal, the older the pupils the smaller the effect size, which is result of a greater diversity in population of older pupils compared to younger pupils. Finally, Wiliam raises the issue of the generalisability of the studies. One of the problems of trying to calculate the overall effect size of an intervention, is that much of the published research is undertaken by psychology professors in laboratories on their own under-graduate students. As such, these students will have little in common with say Key Stage 2 or Key Stage 3 pupils, and will have a substantial impact on the generalisability of the findings.
So what are the implications for teachers and school leaders who wish to use Hattie’s hierarchy of the educational significance of interventions?
For a start, as (Simpson, 2017) argues league table of effect sizes may reflect openness to the manipulation of outcomes through research design. In other words, Hattie’s hierarchy may not reflect the educational significance of interventions but rather the sensitivity of the intervention to measurement. As such, if teachers or school leaders use Hattie’s league table of intervention effectiveness to choose what interventions to priorities, they are probably looking at the wrong hierarchy.
Second, if teachers and school leaders wish to use effect sizes generated by research to help prioritise interventions, then it is necessary to look at the original research. And when aggregating studies, make sure you are looking at studies which use the same type of comparator groups, range of pupils, and measurement design.
Third, it requires teachers and school leaders to commit on-going professional development and engagement with research with research output. With that in mind the recent announcement by the Chartered College of Teachers that members will be able access research which is currently behind paywalls, could not be more timely.
*In this section I’m pushing the both boundaries of my understanding of the impact measure design on effect and my ability to communicate the core message. I hope I have made my explanations as simple as possible, but not simpler.
HATTIE, J. 2008. Visible learning: A synthesis of over 800 meta-analyses relating to achievement, Routledge.
POLANIN, J. R., TANNER-SMITH, E. E. & HENNESSY, E. A. 2016. Estimating the difference between published and unpublished effect sizes a meta-review. Review of Educational Research, 86, 207-236.
SIMPSON, A. 2017. The misdirection of public policy : Comparing and combining standardised effect sizes. Journal of Education Policy.WILIAM, D. 2016. Leadership for teacher learning, West Palm Beach, Learning Sciences International.