Small class sizes for improving student achievement in primary and secondary schools: a systematic review

May 3, 2023Campbell systematic reviews

Small class sizes linked to better student achievement in primary and secondary schools

AI simplified

Circadian Biology on OpenScience ↗PubMed ↗DOI ↗OA ↗

Abstract

The evidence suggests a small positive effect on reading achievement from reducing class size.

A total of 127 studies from 41 countries were included, analyzing various populations of students from kindergarten to grade 12.
The overall effect size for reading achievement was positive and statistically significant, with a weighted average of 0.11.
In contrast, the effect on mathematics achievement was negative and statistically insignificant, with a weighted average of -0.03.
Among studies analyzing data from the STAR experiment, all indicated positive effects for both reading and mathematics, but these were still small.
There is a 53% chance that a student in a smaller class will score higher in reading compared to a student in a larger class.
The findings suggest that while reducing class size may have some positive impact, the effects are modest and may not justify the associated costs.

AI simplified

UNLABELLED: This Campbell systematic review examines the impact of class size on academic achievement. The review summarises findings from 148 reports from 41 countries. Ten studies were included in the meta-analysis. Included studies concerned children in grades kindergarten to 12 (or the equivalent in European countries) in general education. The primary focus was on measures of academic achievement. All study designs that used a well-defined control group were eligible for inclusion. A total of 127 studies, consisting of 148 papers, met the inclusion criteria. These 127 studies analysed 55 different populations from 41 different countries. A large number of studies (45) analysed data from the Student Teacher Achievement Ratio (STAR) experiment which was for class size reduction in grade K-3 in the US in the eighties. However only ten studies, including four of the STAR programme, could be included in the meta-analysis. Overall, the evidence suggests at best a small effect on reading achievement. There is a negative, but statistically insignificant, effect on mathematics. For the non-STAR studies the primary study effect sizes for reading were close to zero but the weighted average was positive and statistically significant. There was some inconsistency in the direction of the primary study effect sizes for mathematics and the weighted average effect was negative and statistically non-significant. The STAR results are more positive, but do not change the overall finding. All reported results from the studies analysing STAR data indicated a positive effect of smaller class sizes for both reading and maths, but the average effects are small.

PLAIN LANGUAGE SUMMARY: Reducing class size is seen as a way of improving student performance. But larger class sizes help control education budgets. The evidence suggests at best a small effect on reading achievement. There is a negative, but statistically insignificant, effect on mathematics, so it cannot be ruled out that some children may be adversely affected.Increasing class size is one of the key variables that policy makers can use to control spending on education.But the consensus among many in education research is that smaller classes are effective in improving student achievement which has led to a policy of class size reductions in a number of US states, the UK, and the Netherlands. This policy is disputed by those who argue that the effects of class size reduction are only modest and that there are other more cost-effective strategies for improving educational standards.Despite the important policy and practice implications of the topic, the research literature on the educational effects of class-size differences has not been clear.This review systematically reports findings from relevant studies that measure the effects of class size on academic achievement.Included studies concerned children in grades kindergarten to 12 (or the equivalent in European countries) in general education. The primary focus was on measures of academic achievement. All study designs that used a well-defined control group were eligible for inclusion.A total of 127 studies, consisting of 148 papers, met the inclusion criteria. These 127 studies analysed 55 different populations from 41 different countries. A large number of studies (45) analysed data from the Student Teacher Achievement Ratio (STAR) experiment which was for class size reduction in grade K-3 in the US in the eighties. However only ten studies, including four of the STAR programme, could be included in the meta-analysis.Overall, the evidence suggests at best a small effect on reading achievement. There is a negative, but statistically insignificant, effect on mathematics.For the non-STAR studies the primary study effect sizes for reading were close to zero but the weighted average was positive and statistically significant. There was some inconsistency in the direction of the primary study effect sizes for mathematics and the weighted average effect was negative and statistically non-significant.The STAR results are more positive, but do not change the overall finding. All reported results from the studies analysing STAR data indicated a positive effect of smaller class sizes for both reading and maths, but the average effects are small.There is some evidence to suggest that there is an effect of reducing class size on reading achievement, although the effect is very small. There is no significant effect on mathematics achievement, though the average is negative meaning a possible adverse impact on some students cannot be ruled out.The overall reading effect corresponds to a 53 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population of larger classes. This is a very small effect.Class size reduction is costly. The available evidence points to no or only very small effect sizes of small classes in comparison to larger classes. Moreover, we cannot rule out the possibility that small classes may be counterproductive for some students. It is therefore crucial to know more about the relationship between class size and achievement in order to determine where money is best allocated.The review authors searched for studies published up to February 2017. This Campbell systematic review was published in 2018. Small class size has at best a small effect on academic achievement: What is this review about?: What are the main findings of this review?: What studies are included?: What are the main results?: What do the findings of this review mean?: How up-to-date is this review?:

EXECUTIVE SUMMARY/ABSTRACT: Increasing class size is one of the key variables that policy makers can use to control spending on education. Reducing class size to increase student achievement is an approach that has been tried, debated, and analysed for several decades. Despite the important policy and practice implications of the topic, the research literature on the educational effects of class-size differences has not been clear.The consensus among many in education research, that smaller classes are effective in improving student achievement has led to a policy of class size reductions in a number of U.S. states, the United Kingdom, and the Netherlands. This policy is disputed by those who argue that the effects of class size reduction are only modest and that there are other more cost-effective strategies for improving educational standards.The purpose of this review is to systematically uncover relevant studies in the literature that measure the effects of class size on academic achievement. We will synthesize the effects in a transparent manner and, where possible, we will investigate the extent to which the effects differ among different groups of students such as high/low performers, high/low income families, or members of minority/non-minority groups, and whether timing, intensity, and duration have an impact on the magnitude of the effect.Relevant studies were identified through electronic searches of bibliographic databases, internet search engines and hand searching of core journals. Searches were carried out to February 2017. We searched to identify both published and unpublished literature. The searches were international in scope. Reference lists of included studies and relevant reviews were also searched.The intervention of interest was a reduction in class size. We included children in grades kindergarten to 12 (or the equivalent in European countries) in general education. The primary focus was on measures of academic achievement. All study designs that used a well-defined control group were eligible for inclusion. Studies that utilized qualitative approaches were not included.The total number of potential relevant studies constituted 8,128 hits. A total of 127 studies, consisting of 148 papers, met the inclusion criteria and were critically appraised by the review authors. The 127 studies analysed 55 different populations from 41 different countries.A large number of studies (45) analysed data from the STAR experiment (class size reduction in grade K-3) and its follow up data.Of the 82 studies not analysing data from the STAR experiment, only six could be used in the data synthesis. Fifty eight studies could not be used in the data synthesis as they were judged to have too high risk of bias either due to confounding (51), other sources of bias (4) or selective reporting of results (3). Eighteen studies did not provide enough information enabling us to calculate an effects size and standard error or did not provide results in a form enabling us to use it in the data synthesis.Meta-analysis was used to examine the effects of class size on student achievement in reading and mathematics. Random effects models were used to pool data across the studies not analysing STAR data. Pooled estimates were weighted using inverse variance methods, and 95% confidence intervals were estimated. Effect sizes were measured as standardised mean differences (). It was only possible to perform a meta-analysis by the end of the treatment year (end of the school year).Four of the studies analysing STAR data provided effect estimates that could be used in the data synthesis. The four studies differed in terms of both the chosen comparison condition and decision rules in selecting a sample for analysis. Which of these four studies' effect estimates should be included in the data synthesis was not obvious as the decision rule (concerning studies using the same data set) as described in the protocol could not be used. Contrary to usual practice we therefore report the results of all four studies and do not pool the results with the studies not analysing STAR data except in the sensitivity analysis. We took into consideration the ICC in the results reported for the STAR experiment and corrected the effect sizes and standard errors using ρ = 0.22. No adjustment due to clustering was necessary for the studies not analysing STAR data.Sensitivity analysis was used to evaluate whether the pooled effect sizes were robust across components of methodological quality, in relation to inclusion of a primary study result with an unclear sign, inclusion of effect sizes from the STAR experiment and to using a one-student reduction in class size in studies using class size as a continuous variable.All studies, not analysing STAR data, reported outcomes by the end of the treatment (end of the school year) only. The STAR experiment was a four year longitudinal study with outcomes reported by the end of each school year. The experiment was conducted to assess the effectiveness of small classes compared with regular-sized classes and of teachers' aides in regular-sized classes on improving cognitive achievement in kindergarten and in the first, second, and third grades. The goal of the STAR experiment was to have approximately 100 small classes with 13-17 students (S), 100 regular classes with 22-25 students (R), and 100 regular with aide classes with 22-25 students (RA).Of the six studies not analysing STAR, only five were used in the meta-analysis as the direction of the effect size in one study was unclear. The studies were from USA, the Netherlands and France, one was a RCT and five were NRS. The grades investigated spanned kindergarten to 3. Grade and one study investigated grade 10. The sample sizes varied; the smallest study investigated 104 students and the largest study investigated 11,567 students. The class size reductions varied from a minimum of one student in four studies, a minimum of seven students in another study to a minimum of 8 students in the last study.All outcomes were scaled such that a positive effect size favours the students in small classes, i.e. when an effect size isa class size reduction improves the students' achievement.Primary study effect sizes for reading lied in the range -0.08 to 0.14. Three of the study-level effects were statistically non-significant. The weighted average was positive and statistically significant. The random effects weighted standardised mean difference was 0.11 (95% CI 0.05 to 0.16) which may be characterised as small. There is some inconsistency in the direction of the effect sizes between the primary studies. Primary study effect sizes for mathematics lies in the range -0.41 to 0.11. Two of the study-level effects were statistically non-significant. The weighted average was negative and statistically non-significant. The random effects weighted standardised mean difference was -0.03 (95% CI -0.22 to 0.16). There is some inconsistency in the direction as well as the magnitude of the effect sizes between the primary studies.All reported results from the four studies analysing STAR data indicated a positive effect favouring the treated; all of the study-level effects were statistically significant. The study-level effect sizes for reading varied between 0.17 and 0.34 and the study-level effect sizes for mathematics varied between 0.15 and 0.33.There were no appreciable changes in the results when we included the extremes of the range of effect sizes from the STAR experiment. The reading outcome lost statistical significance when the effect size from the primary study reporting a result with an unclear direction was included with a negative sign and when the results from the studies using class size as a continuous variable were included with a one student reduction in class size instead of a standard deviation reduction in class size.Otherwise, there were no appreciable changes in the results.There is some evidence to suggest that there is an effect of reducing class size on reading achievement, although the effect is very small. We found a statistically significant positive effect of reducing the class size on reading. The effect on mathematics achievement was not statistically significant, thus it is uncertain if there may be a negative effect.The overall reading effect corresponds to a 53 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population of larger classes. The overall effect on mathematics achievement corresponds to a 49 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population of larger classes.Class size reduction is costly and the available evidence points to no or only very small effect sizes of small classes in comparison to larger classes. Taking the individual variation in effects into consideration, we cannot rule out the possibility that small classes may be counterproductive for some students. It is therefore crucial to know more about the relationship between class size and achievement and how it influences what teachers and students do in the classroom in order to determine where money is best allocated. BACKGROUND: OBJECTIVES: SEARCH METHODS: SELECTION CRITERIA: DATA COLLECTION AND ANALYSIS: RESULTS: AUTHORS' CONCLUSIONS: positive

Key numbers

0.11

Increase in Reading Achievement

Weighted average effect size from included studies.

-0.03

Decrease in Mathematics Achievement

Weighted average effect size from included studies.

53%

Probability of Benefit in Reading

Probability-of-benefit statistic for reading.

Full Text

What this is

This systematic review examines the effects of class size on academic achievement across various studies.
It includes 127 studies from 41 countries, focusing on children in grades K-12.
The review aims to clarify the impact of smaller class sizes on reading and mathematics performance.

Essence

Reducing class size has a small positive effect on reading achievement but no significant effect on mathematics. The evidence suggests that smaller classes may not be cost-effective.

Key takeaways

Smaller class sizes lead to a statistically significant positive effect on reading achievement, with an average effect size of 0.11.
Mathematics achievement shows a negative effect, with an average effect size of -0.03, which is statistically non-significant.
The STAR experiment results indicate a positive effect for both reading and mathematics, but the overall findings suggest only modest benefits from class size reductions.

Caveats

The review includes studies with a high risk of bias, limiting the reliability of the findings. Only a small number of studies contributed to the meta-analysis.
The geographical coverage is narrow, with most studies from the USA, France, and the Netherlands, potentially limiting generalizability.

Definitions

Standardised Mean Difference (SMD): A statistical measure used to quantify the effect size across studies, indicating the difference in means relative to the standard deviation.

AI simplified

1 Background

1.1 THE PROBLEM, CONDITION OR ISSUE

Increasing class size is one of the key variables that policy makers can use to control spending on education. The average class size at the lower secondary level is 23 students in OECD countries, but there are significant differences, ranging from over 32 in Japan and Korea to 19 or below in Estonia, Iceland, Luxembourg, Slovenia and the United Kingdom (OECD, 2012). On the other hand, reducing class size to increase student achievement is an approach that has been tried, debated, and analysed for several decades. Between 2000 and 2009, many countries invested additional resources to decrease class size (OECD, 2012).

Despite the important policy and practice implications of the topic, the research literature on the educational effects of class‐size differences has not been clear. A large part of the research on the effects of class size has found that smaller class sizes improve student achievement (for example Finn & Achilles, 1999; Konstantopoulos, 2009; Molnar et al., 1999; Schanzenbach, 2007). The consensus among many in education research that smaller classes are effective in improving student achievement has led to a policy of class size reductions in a number of U.S. states, the United Kingdom, and the Netherlands. This policy is disputed by those who argue that the effects of class size reduction are only modest and that there are other more cost‐effective strategies for improving educational standards (Hattie, 2005; Hedges, Laine, & Greenwald, 1994; Rivkin, Hanushek, & Kain, 2005). There is no consensus in the literature as to whether class size reduction can pass a cost‐benefit test (Dustmann, Rajah & van Soest, 2003; Dynarski, Hyman & Schanzenbach, 2011; Finn, Gerber & Boyd‐Zaharias, 2005; Muenning & Woolf, 2007).

As it is costly to reduce class size, it is important to consider the types of students who might benefit most from smaller class sizes and to consider the timing, intensity, and duration of class size reduction as well. Low socioeconomic status is strongly associated with low school performance. Results from the Programme for International Student Assessment (PISA) point to the fact that most of the students who perform poorly in PISA are from socio‐economically disadvantaged backgrounds (OECD, 2010). Across OECD countries, a student from a more socio‐economically advantaged background outperforms a student from an average background by about one year's worth of education in reading, and by even more in comparison to students with low socio‐economic background. Results from PISA also show that some students with low socioeconomic status excel in PISA, demonstrating that overcoming socio‐economic barriers to academic achievement is indeed possible (OECD, 2010).

Smaller class size has been shown to be more beneficial for students from socioeconomically disadvantaged backgrounds (Biddle & Berliner, 2002). Evidence from the Tennessee STAR randomised controlled trial showed that minority students, students living in poverty, and students who were educationally disadvantaged benefitted the most from reduced class size (Finn, 2002; Word et al. (1994). Further, evidence from the controlled, though not randomised, trial, the Wisconsin's Student Achievement Guarantee in Education (SAGE) program, showed that students from minority and low‐income families benefitted the most from reduced class size (Molnar et al., 1999). Thus, rather than implementing costly universal class size reduction policies, it may be more economically efficient to target schools with high concentrations of socioeconomic disadvantaged students for class size reductions.

In the case of the timing of class size reduction, the question is: when does class size reduction have the largest effect? Ehrenberg, Brewer, Gamoran and Willms (2001) hypothesized that students educated in small classes during the early grades may be more likely to develop working habits and learning strategies that enable them to better take advantage of learning opportunities in later grades. According to Bascia and Fredua‐Kwarteng (2008), researchers agree that class size reduction is most effective in the primary grades. That empirical research shows class size to be most effective in the early grades is also concluded by Biddle and Berliner (2002) and the evidence from both STAR and SAGE back this conclusion up (Finn, Gerber, Achilles, & Boyd‐Zaharias, 2001; Smith, Molnar, & Zahorik, 2003). Of course, there is still the possibility that smaller classes may also be advantageous at later strategic points of transition, for example, in the first year of secondary education. Research evidence on this possibility is, however, needed.

For intensity, the question is: how small does a class have to be in order to optimize the advantage? For example, large gains are attainable when class size is below 20 students (Biddle & Berliner, 2002; Finn, 2002) but gains are also attainable if class size is not below 20 students (Angrist & Lavy, 2000; Borland, Howsen & Trawick, 2005; Fredrikson, Öckert & Oosterbeek, 2013; Schanzenbach, 2007). It has been argued that the impact of class size reduction of different sizes and from different baseline class sizes is reasonably stable and more or less linear when measured per student (Angrist & Pischke, 2009, see page 267; Schanzenbach, 2007). Other researchers argue that the effect of class size is not only non‐linear but also non‐monotonic, implying that an optimal class size exists (Borland, Howsen & Trawick, 2005). Thus, the question of whether the size of reduction and initial class size matters for the magnitude of gain from small classes is still an open question.

Finally, researchers agree that the length of the intervention (number of years spent in small classes) is linked with the sustainability of benefits (Biddle & Berliner, 2002; Finn, 2002; Grissmer, 1999; Nye, Hedges & Konstantopoulos, 1999) whereas the evidence on whether more years spent in small classes leads to larger gains in academic achievement is mixed (Biddle & Berliner, 2002; Egelson, Harman, Hood & Achilles, 2002; Finn 2002; Kruger, 1999). How long a student should remain in a small class before eventually returning to a class of regular size is an unanswered question.

1.2 THE INTERVENTION

The intervention in this systematic review is a reduction in class size. What constitutes a reduced class size? This seemingly simple issue has confounded the understanding of outcomes of the research and it is one of the reasons there is disagreement about whether class size reduction works (Graue, Hatch, Rao & Oen, 2007).

Two terms are used to describe the intervention, class size and student‐teacher ratio, and it is important to distinguish between these two terms. The first, class size, focuses on reducing group size and, hence, is operationalized as the number of students a teacher instructs in a classroom at a point in time. For this definition, a reduced number of students are assigned to a class in the belief that teachers will then develop an in‐depth understanding of student learning needs through more focused interactions, better assessment, and fewer disciplinary problems. These mechanisms are based on the dynamics of a smaller group (Ehrenberg et al., 2001). The second term is student‐teacher ratio and is often used as a proxy for class size, defined as a school's total student enrollment divided by the number of its full time teachers.

From this perspective, lowering the ratio of students to teachers provides enhanced opportunities for learning. The concept of using student‐teacher ratios as a proxy for class size is based on a view of teachers as units of expertise and is less focused on the student‐teacher relationship. Increasing the relative units of expertise available to students increases learning, but does not rely on particular teacher‐student interactions (Graue et al., 2007).

Although class size and student‐teacher ratio are related, they involve different assumptions about how a reduction changes the opportunities for students and teachers. In addition, the discrepancy between the two can vary depending on teachers' roles and the amount of time teachers spend in the classroom during the school day.

In this review, the intervention is class size reduction. Studies only considering average class size measured as student‐teacher ratio at school level (or higher levels) will not be eligible. Neither will studies where the intervention is the assignment of an extra teacher (or teaching assistants or other adults) to a class be eligible. The assignment of additional teachers (or teaching assistants or other adults) to a classroom is not the same as reducing the size of the class, and this review focuses exclusively on the effects of class size in the sense of number of students in a classroom.

1.3 HOW THE INTERVENTION MIGHT WORK

Smaller classes allow teachers to adapt their instruction to the needs of individual students. For example, teachers' instruction can be more easily adapted to the development of the individual students. The concept of adaptive education refers to instruction that is adapted to meet the individual needs and abilities of students (Houtveen, Booij, de Jong & van de Grift, 1999). With adaptive education, some students receive more time, instruction, or help from the teacher than other students.

Research has shown that in smaller classes, teachers have more time and opportunity to give individual students the attention they need (Betts & Shkolnik, 1999; Blatchford & Mortimore, 1994; Bourke, 1986; Molnar et al., 1999; Molnar et al., 2000; Smith & Glass, 1980). Additional, less pressure may be placed upon the physical space and resources within the classroom. Both of these factors may be connected to less pupil misbehaviour and disciplinary problems detected in larger classes (Wilson, 2002).

In smaller classes, it is possible for students with low levels of ability to receive more attention from the teacher, with the result that not necessarily all students profit equally. More generally, teachers are able to devote more of their time to educational content (the tasks students must complete) and less to classroom management (for example, maintaining order) in smaller classes. An increased amount of time spend on task, contributes to enhanced academic achievement.

It has often been pointed out, however, that teachers do not necessarily change the way they teach when faced with smaller classes and therefore do not take advantage of all of the benefits offered by a smaller class size. Research suggests that such situations do indeed exist in practice (e.g.Blatchford & Mortimore, 1994; Shapson, Wright, Eason & Fitzgerald, 1980).

Anderson (2000) addressed the question of why reductions in class size should be expected to enhance student achievement and part of his theory was tested in Annevelink, Bosker and Doolaard (2004). To explain the relationship between class size and achievement, Anderson developed a causal model, which starts with reduced class size and ends with student achievement. Anderson noted that small classes would not, in and of themselves, solve all educational problems. The number of students in a classroom can have only an indirect effect on student achievement. As Zahorik (1999) states: “Class size, of course, cannot influence academic achievement directly. It must first influence what teachers and students do in the classroom before it can possibly affect student learning” (p. 50). In other words, what teachers do matter. Anderson's causal model of the effect of reduced class size on student achievement is depicted in Figure 1.

Anderson's model predicts that a reduced class size will have direct positive effects on the following three variables: 1) Disciplinary problems, 2) Knowledge of student, and 3) Teacher satisfaction and enthusiasm. Each of these variables, in turn, begins a separate path. Fewer disciplinary problems are expected to lead to more instructional time, which in combination with teacher knowledge of the external test, produces greater opportunity to learn. In combination with more appropriate, personalised instruction and greater teacher effort, more instructional time potentially produces greater student engagement in learning as well as more in‐depth treatment of content.

Greater knowledge of students is expected to provide more appropriate personalised instruction, and in combination with more instructional time and greater teacher effort, potentially produces greater student engagement in learning and more in‐depth treatment of content.

Greater teacher satisfaction and enthusiasm are expected to result in greater teacher effort, which in combination with more instructional time and more appropriate, personalised instruction produces greater student engagement in learning and more in‐depth treatment of content.

Finally greater student achievement is the expected result of a combination of the three variables: Greater opportunity to learn, greater student engagement in learning, and more in‐depth treatment of content.

The path from greater knowledge of students through appropriate, personalised instruction and student engagement in learning to student achievement is tested in Annevelink et al. (2004) on students in Grade 1 in 46 Dutch schools in the school year 1999‐2000. Personalised instruction is operationalised as the number of specific types of interactions. Teachers seeking to provide more personalised instruction are expected to provide fewer interactions directed at the organization and personal interactions, and more interactions directed at the task and praising interactions. These changes in interactions are expected to result in a situation where the student spends more time on task.

The level of student engagement is operationalised as the amount of time a student spends on task. Students who spend more time on task are expected to achieve higher learning results.

Smaller classes were related to more interactions of all kinds and more task‐directed and praising interactions resulted in more time spent on task which in turn was related to higher student achievement as expected. Notice that more organizational or personal interactions in smaller classes were contrary to expectations whereas more task‐directed interactions or praising interactions was consistent with expectations (Annevelink et al., 2004).

Figure 1

An explanation of the impact of class size on student achievement (). [Anderson, 2000]

1.4 WHY IT IS IMPORTANT TO DO THE REVIEW

Class size is one of the most researched educational interventions in social science, yet there is no clear consensus on the effectiveness of small class sizes for improving student achievement. While one strand of class size research points to small and insignificant effects of smaller classes, another points to positive and significant effects on student achievement of smaller classes.

The early meta‐analysis by Glass and Smith (1979) analysed the outcomes of 77 studies including 725 comparisons between smaller and larger class sizes on student achievement. They concluded that a class size reduction had a positive effect on student achievement. Hedges and Stock (1983) reanalysed Glass and Smith's data using different statistical methods, but found very little difference in the average effect sizes across the two analysis methods.

However, the updated literature reviews by Hanushek (Hanushek, 1989; 1999; 2003) cast doubt on these findings. His reviews looked at 276 estimates of pupil‐teacher ratios as a proxy for class size, and most of these estimates pointed to insignificant effects. Based on a vote counting method, Hanushek concluded that “there is no strong or consistent relationship between school resources and student performance” (Hanushek, 1989, p. 47). Krueger (2003), however, points out that Hanushek relies too much on a few studies, which reported many estimates from even smaller subsamples of the same dataset. Many of the 276 estimates were from the same dataset but estimated on several smaller subsamples, and these many small sample estimates are more likely to be insignificant. The vote counting method used in Hanushek's original literature review (Hanushek, 1989) is also criticised by Hedges et al. (1994), who offer a reanalysis of the data from Hanushek's reviews using more sophisticated synthesis methods. Hedges et al. (1994) used a combined significance test.1 They tested two null hypotheses: 1) no positive relation between the resource and output and 2) no negative relation between the resource and output. The tests determine if the data are consistent with the null hypothesis in all studies or false in at least some of the studies. Further, Hedges et al. (1994) reported the median standardized regression coefficient.2 The conclusion is that “it shows systematic positive relations between resource inputs and school outcomes” (Hedges et al., 1994, p. 5). Hence, dependent upon which synthesis method3 is considered appropriate; conclusions based on the same evidence are quite different.

The divergent conclusions of the above‐mentioned reviews are further based on non‐experimental evidence, combining measurements from primary studies that have different specifications and assumptions. According to Grissmer (1999), the different specifications and assumptions, as well as the appropriateness of the specifications and assumptions, account for the inconsistency of the results of the primary studies.

The Tennessee STAR experiment provides rare evidence of the effect of class size from a randomized controlled trial (RCT). The STAR experiment was implemented in Tennessee in the 1980s, assigning kindergarten children to either normal sized classes (around 22 students) or small classes (around 15 students). The study ran for four years, until the assigned children reached third grade, but not even based on this kind of evidence do researchers agree about the conclusion.

According to Finn and Achilles (1990), Nye et al. (1999) and Krueger (1999), STAR results show that class size reduction increased student achievement. However, Hanushek (1999; 2003) questions these results because of attrition from the project, crossover between treatments, and selective test taking, which may have violated the initial randomization.

While the class size debate on what can be concluded based on the same evidence is acceptable and meaningful in the research community, it is probably of less help in guiding decision‐makers and practitioners. If research is to inform practice, there must be an attempt to reach some agreement about what the research does and does not tell us about the effectiveness of interventions as well as what conclusions can be reasonably drawn from research. The researchers must reach a better understanding of questions such as: for who does class size reduction have an effect? When does class size reduction have an effect on student achievement? How small does a class have to be in order to be advantageous?

The purpose of this review is to systematically uncover relevant studies in the literature that measure the effects of class size on academic achievement and synthesize the effects in a transparent manner.

2 Objectives

The purpose of this review is to systematically uncover relevant studies in the literature that measure the effects of class size on academic achievement. We will synthesize the effects in a transparent manner and, where possible, we will investigate the extent to which the effects differ among different groups of students such as high/low performers, high/low income families, or members of minority/non‐minority groups, and whether timing, intensity, and duration have an impact on the magnitude of the effect.

3 Methods

3.1 TITLE REGISTRATION AND REVIEW PROTOCOL

The title for this systematic review was approved in The Campbell Collaboration on 9. October 2012. The systematic review protocol was published on March 3, 2015. Both the title registration and the protocol are available in the Campbell Library at:

https://www.campbellcollaboration.org/library/small‐class‐sizes‐student‐achievement‐primary‐and‐secondary‐schools.html

3.2 CRITERIA FOR CONSIDERING STUDIES FOR THIS REVIEW

3.2.1 Types of studies

The study designs eligible for inclusion were:

We included study designs that used a well‐defined control group; i.e. the control or comparison condition was students in classes with more students than in the treatment classes.

Non‐randomised studies, where the reduction of class size has occurred in the course of usual decisions outside the researcher's control, must demonstrate pre‐treatment group equivalence via matching, statistical controls, or evidence of equivalence on key risk variables and participant characteristics. These factors are outlined in section 3.4.3 under the subheading of Confounding, and the methodological appropriateness of the included studies was assessed according to the risk of bias model outlined in section 3.4.3.

Different studies used different types of data. Some used test score data on individual students and actual class‐size data for each student. Others used individual student data but average class‐size data for students in that grade in each school. Still others used average scores for students in a grade level within a school and average class size for students in that school. We only included studies that used measures of class size and measures of outcome data at the individual or class level. We excluded studies that relied on measures of class size as and measures of outcomes aggregated to a level higher than the class (e.g., school or school district).

Some studies did not have actual class size data and used the average student‐teacher ratio within the school (or at higher levels, e.g. school districts). Studies only considering average class size measured as student‐teacher ratio within a school (or at higher levels) were not eligible.

3.2.2 Types of participants

We included children in grades kindergarten to 12 (or the equivalent in European countries) in general education. Studies that met inclusion criteria were accepted from all countries. We excluded children in home‐school, in pre‐school programs, and in special education.

3.2.3 Types of interventions

The intervention in this review is a reduction in class size, i.e. a comparison of classes with larger and small numbers. The more precise class size is measured the more reliable the findings of a study will be.

Studies only considering the average class size measured as student‐teacher ratio within a school (or at higher levels) were not eligible. Neither were studies where the intervention was the assignment of an extra teacher (or teaching assistants or other adults) to a class eligible. The assignment of additional teachers (or teaching assistants or other adults) to a classroom is not the same as reducing the size of the class, and this review focused exclusively on the effects of reducing class size. We acknowledge that class size can change per subject or eventually vary during the day. The precision of the class size measure was recorded.

3.2.4 Types of outcome measures

Primary outcomes

The primary focus was on measures of academic achievement. Academic achievement outcomes included reading and mathematics. Outcome measures had to be standardised measures of academic achievement. The primary outcome variables used in the identified studies were standardised reading and mathematics tests (Stanford Achievement Test (SAT), Item Response Theory‐scaled scores, State wide End‐of‐Grade test (EOG) and NovLex (a lexical database for French elementary‐school readers)).

Studies were only included if they considered one or more of the primary outcomes.

Secondary outcomes

We planned to code the following effect sizes as secondary outcomes when available: standardised test in other academic subjects at primary school level (e.g. in science or second language) and measures of global academic performance (e.g. Woodcock‐Johnson III Tests of Achievement, Stanford Achievement Test (SAT), Grade Point Average). None of these secondary outcomes were reported in studies that could be used in the data synthesis.

3.2.5 Duration of follow‐up

All follow‐up durations reported in the primary studies were recorded.

Time points for measures we planned to consider were:

All studies that could be used in the data synthesis reported outcomes in the short run only; by the end of the school year in which treatment were given.

3.2.6 Types of settings

The location of the intervention was classes, grades kindergarten to 12 (or the equivalent in European countries) in regular private, public or boarding schools were eligible. Home‐schools would have been excluded.

3.3 SEARCH METHODS FOR IDENTIFICATION OF STUDIES

3.3.1 Bibliographical database searches

The original electronic searches for this review were performed in 2015. Those searches covered content from 1980‐2015. In February 2017 the searches were updated to cover content from 2015‐2017. The 2017 update had a minor change in the searched electronic resources. These changes are described below. Following electronic databases were searched:

ERIC (EBSCO‐host) ‐ searched from 1980‐2017

SocIndex (EBSCO‐host) ‐ searched from 1980‐2017

EconLit (EBSCO‐host) ‐ searched from 1980‐2017

PsycInfo (EBSCO‐host) ‐ searched from 1980‐2017

Academic Search Premier (EBSCO‐host) ‐ searched from 2015‐2017

Teacher Reference Center (EBSCO‐host) ‐ searched from 2015‐2017

Education Research Complete (EBSCO‐host) ‐ searched from 1980‐2015

International Bibliography of the Social Sciences (ProQuest‐host) ‐ searched from 1980‐2015

ProQuest Dissertations & Theses A&I (ProQuest‐host) ‐ searched from 1980‐2015

Social Science Citation Index (ISI Web of Science) ‐ searched from 1980‐2017

Science Citation Index (ISI Web of Science) ‐ searched from 1980‐2017

3.3.2 Searching other resources

We also searched in other electronic resources for relevant publications:

Campbell Collaboration Library ‐ searched from 1980‐2017

Centre for Reviews and Dissemination Databases ‐ searched from 1980‐2017

EPPI‐Centre Systematic Reviews ‐ Database of Education Research ‐ searched from 1980‐2017

Social Care Online ‐ searched from 1980‐2017

Bibliotek.dk (Danish National Library portal) ‐ searched from 1980‐2015

Bibsys.no (Norwegian National Library portal) ‐ searched from 1980‐2015

Libris.kb.se (Swedish National Library portal) ‐ searched from 1980‐2015

3.3.3 Grey literature search

We searched specific electronic repositories for additional grey literature:

What Works Clearinghouse – U.S. Department of Education ‐ searched from 1980‐2017

EDU.au.dk – Danish Clearinghouse for Education ‐ searched from 1980‐2017

European Educational Research Association ‐ searched from 1980‐2017

American Education Research Association ‐ searched from 1980‐2017

Social Science Research Network ‐ searched from 1980‐2017

Google Scholar ‐ searched from 2015‐2017

3.3.4 Hand search

We hand‐searched following journals for additional references:

Middle School Journal – (2014‐2015)

Elementary School Journal – (2014‐2015)

American Educational Research Journal – (2014‐2015)

Learning Environments Research – (2014‐2015)

3.3.5 Search documentation

Selected search strings from the recent search update as well as the resources searched in the 2015 original 2015 search can be found in the Appendix 11.1.

3.4 DATA COLLECTION AND ANALYSIS

3.4.1 Selection of studies

Under the supervision of review authors, two review team assistants first independently screened titles and abstracts to exclude studies that were clearly irrelevant. Studies considered eligible by at least one assistant or studies where there was not enough information in the title and abstract to judge eligibility, were retrieved in full text. The full texts were then screened independently by two review team assistants under the supervision of the review authors. Any disagreements of eligibility were resolved by the review authors. Exclusion reasons for studies that otherwise might be expected to be eligible were documented and presented in the appendix.

The study inclusion criteria were piloted by the review authors (see Appendix 11.3). The overall search and screening process was illustrated in a flow‐diagram. None of the review authors were blind to the authors, institutions, or the journals responsible for the publication of the articles.

3.4.2 Data extraction and management

Two review authors independently coded and extracted data from included studies. A coding sheet was piloted on several studies and no revision was necessary (see Appendix 11.4). Disagreements were minor and were resolved by discussion. Data and information was extracted on: Available characteristics of participants, intervention characteristics and control conditions, research design, sample size, risk of bias and potential confounding factors, outcomes, and results. Extracted data was stored electronically. Analysis was conducted in RevMan5 and Stata.

3.4.3 Assessment of risk of bias in included studies

We assessed the methodological quality of studies using a risk of bias model developed by Prof. Barnaby Reeves in association with the Cochrane Non‐Randomised Studies Methods Group.This model is an extension of the Cochrane Collaboration's risk of bias tool and covers risk of bias in non‐randomised studies that have a well‐defined control group. 2014001029

The extended model is organised and follows the same steps as the risk of bias model according to the 2008‐version of the Cochrane Hand book, chapter 8 (Higgins & Green, 2008). The extension to the model is explained in the three following points:

The refined assessment is pertinent when thinking of data synthesis as it operationalizes the identification of studies (especially in relation to non‐randomised studies) with a very high risk of bias. The refinement increases transparency in assessment judgements and provides justification for not including a study with a very high risk of bias in the meta‐analysis.

Risk of bias judgement items

The risk of bias model used in this review is based on nine items (see Appendix 11.5). The nine items refer to:

In the 5‐point scale, 1 corresponds to Low risk of bias and 5 corresponds to High risk of bias. A score of 5 on any of the items assessed on the 5‐point scale translates to a risk of bias so high that the findings will not be considered in the data synthesis (because they are more likely to mislead than inform).

Confounding

An important part of the risk of bias assessment of non‐randomised studies is how the studies deal with confounding factors (see Appendix 11.5). Selection bias is understood as systematic baseline differences between groups and can therefore compromise comparability between groups. Baseline differences can be observable (e.g. age and gender) and unobservable (to the researcher; e.g. motivation). There is no single non‐randomised study design that always deals adequately with the selection problem: Different designs represent different approaches to dealing with selection problems under different assumptions and require different types of data. There can be particularly great variations in how different designs deal with selection on unobservables. The “adequate” method depends on the model generating participation, i.e. assumptions about the nature of the process by which participants are selected into a program. A major difficulty in estimating causal effects of class size on student outcomes is the potential endogeneity of class size, stemming from the processes that match students with teachers, and schools. Not only do families choose neighbourhoods and schools, but principals and other administrators assign students to classrooms. Because these decision makers utilize information on students, teachers and schools, information that is often not available to researchers, the estimators are quite susceptible to biases from a number of sources.

The primary studies must at least demonstrate pre‐treatment group equivalence via matching, statistical controls, or evidence of equivalence on key risk variables and participant characteristics. For this review, we identified the following observable confounding factors to be most relevant: age and grade level, performance at baseline, gender, socioeconomic background and local education spending. In each study, we assessed whether these confounding factors had been considered, and in addition we assessed other confounding factors considered in the individual studies. Furthermore, we assessed how each study dealt with unobservables.

Importance of pre‐specified confounding factors

The motivation for focusing on age and grade level, performance at baseline, gender, socioeconomic background and local education spending is given below.

Generally development of cognitive functions relating to school performance and learning are age dependent, and furthermore systematic differences in performance level often refer to systematic differences in preconditions for further development and learning of both cognitive and social character (Piaget, 2001; Vygotsky, 1978).

Therefore, to be sure that an effect estimate is a result from a comparison of groups with no systematic baseline differences it is important to control for the students' grade level (or age) and their performance at baseline (e.g. reading level, mathematics level).

With respect to gender it is well‐known that there exist gender differences in school performance (Holmlund & Sund, 2005). Girls outperform boys with respect to reading and boys outperform boys with respect to mathematics (Stoet & Geary, 2013). Although part of the literature finds that these gender differences have vanished over time (Hyde, Fennema, & Lamon, 1990; Hyde & Linn, 1988), we find it important to include this potential confounder.

Students from more advantaged socioeconomic backgrounds on average begin school better prepared to learn and receive greater support from their parents during their schooling years (Ehrenberg et al., 2001). Further, there is evidence that class size may be negatively correlated with the student's socioeconomic backgrounds. For example, in a study of over 1,000 primary schools in Latin America, Willms and Somers (2001) found that the correlation between the pupil/teacher ratio in the school and the socioeconomic level of students in the school was about –.15. Moreover, Willms and Somers (2001) found that schools enrolling students from higher socioeconomic backgrounds tended to have better infrastructures, more instructional materials, and better libraries. The correlations of these variables with school‐level socioeconomic status varied between .26 and .36.

Finally, as outlined in the background section, students with socio‐economically disadvantaged backgrounds perform poorly in school tests (OECD, 2010).

Therefore, the accuracy of the estimated effects of class size will depend crucially on how well socioeconomic background is controlled for. Socioeconomic background factors are, e.g. parents' educational level, family income, minority background, etc.

3.4.4 Measures of treatment effect

For continuous outcomes, effects sizes with 95% confidence intervals were calculated using means and standard deviations where available, or alternatively from mean differences, standard errors and 95% confidence intervals (whichever were available), using the methods suggested by Lipsey & Wilson (2001). Hedges' g was used for estimating standardised mean differences (SMD).

Software for storing data and statistical analyses were Excel and RevMan 5.0.

3.4.5 Unit of analysis issues

To account for possible statistical dependencies, we examined a number of issues: we assessed whether suitable cluster analysis was used (e.g. cluster summary statistics, robust standard errors, the use of the design effect to adjust standard errors, multilevel models and mixture models), if assignment of units to treatment was clustered, whether individuals had undergone multiple interventions, whether there were multiple treatment groups, and whether several studies were based on the same data source.

Cluster assignment to treatment

We checked for consistency in the unit of allocation and the unit of analysis, as statistical analysis errors can occur when they are different. In cases where study investigators had not applied appropriate analysis methods that control for clustering effects, we estimated the intra‐cluster correlation (Donner, Piaggio, & Villar, 2001) and corrected the effect size and standard error. Based on the analysis in Stockford (2009), we used an intra‐cluster correlation () of 0.22. We report the corrected results and the non‐corrected results. We used the following formulas (see Hedges, 2007, page 349):

where n is cluster size and N^T, N^C are treatment and control group sample sizes and N is total sample size.

Multiple Interventions per Individual

There were no studies with multiple interventions per individual.

Multiple Studies using the Same Sample of Data

Five studies analysed the same population, using data from the Third International Mathematics and Science Study (TIMSS) data set from 1995. Three studies used TIMMS data from 2011. Data from the National Educational Longitudinal Study (NELS data from USA) was used in five studies.

Two studies analysed the same US population using the Early Childhood Longitudinal Study‐Kindergarten Class of 1998‐1999 data set. Five studies analysed data from Indiana's Prime Time Project (1984‐1988). Five studies analysed the Student Achievement Guarantee in Education Program (SAGE) implemented in Wisconsin in 1996‐2001. Three studies analysed the same sample of students from Israel. Four studies analysed the same population using the PRIMA survey which contains information on Dutch pupils who were enrolled in grades 2, 4, 6 and 8 in the school‐year 1994/95. Two studies used the same sample of Swedish students from 1998 to 1999. Finally, four studies analysed the British Class Size Study (1996‐1999). We reviewed all studies, but in the meta‐analysis we only included one estimate of the effect from each sample of data in order to avoid dependencies between the “observations” (i.e. the estimates of the effect) in the meta‐analysis. The choice of which estimates to include was based on our risk of bias assessment of the studies. We chose the estimate from each sample of data from the study that we judged to have the least risk of bias due to confounding.

One RCT (the STAR experiment conducted in Tennessee in 1985–1989) was reported in several studies (45 studies reported in 51 papers). We reviewed all studies but it was unclear which study should be judged to have the least risk of bias. We reported all relevant results from the studies analysing STAR but none of the studies were included in the meta‐analysis of non‐STAR studies.

Multiple Time Points

All studies that could be used in the data synthesis reported outcomes in the short run only.

3.4.6 Dealing with missing data

Where studies had missing summary data, such as missing standard deviations, we calculated SMDs from mean differences, standard errors and 95% confidence intervals (whichever were available), using the methods suggested by Lipsey & Wilson (2001). We requested information from the principal investigators (if current contact information could be located) if not enough information was provided to calculate an effect size and standard error.

3.4.7 Assessment of heterogeneity

Heterogeneity among primary outcome studies was assessed with the Chi‐squared (Q) test, and the I‐squared, and τ‐squared statistics (Higgins, Thompson, Deeks, & Altman, 2003). Any interpretation of the Chi‐squared test was made cautiously on account of its low statistical power.

3.4.8 Data synthesis

All studies that could be used in the data synthesis reported outcomes in the short run only; by the end of the school year in which treatment were given. We carried out our meta‐analyses using the standardised mean differences (SMD). All analyses were inverse variance weighted using random effects statistical models that incorporate both the sampling variance and between study variance components into the study level weights. Random effects weighted mean effect sizes were calculated using 95% confidence intervals.

3.4.9 Sensitivity analysis

Sensitivity analysis was used to evaluate whether the pooled effect sizes were robust across components of methodological quality.

For methodological quality, we performed sensitivity analysis for study design and the confounding item of the risk of bias checklists, respectively. Sensitivity analysis was further used to examine the robustness of conclusions in relation to inclusion of a result with an unclear sign, inclusion of effect sizes from the STAR experiment and to multiplying the reported effect with a standard deviation reduction in class size in the studies using class size as a continuous variable.

4 Results

4.1 DESCRIPTION OF STUDIES

4.1.1 Results of the search

The search was performed between 2015 and February 2017.

The results are summarised in Figure 1 in section 11.2. The total number of potential relevant records was 8,128 after excluding duplicates (database: 7,434, grey, hand search, snowballing and other resources: 694). All 8,128 records were screened based on title and abstract; 7754 were excluded for not fulfilling the first level screening criteria and 374 records were ordered for retrieval and screened in full text. Of these, 226 did not fulfil the second level screening criteria and were excluded. Eighteen records were unobtainable despite efforts to locate them through libraries and searches on the internet. The references are listed in section 8.3.

A total of 127 unique studies, reported in 148 papers were included in the review. Further details of the included and excluded studies are provided in. section 10

4.1.2 Included studies

The search resulted in a final selection of 127 studies, reported in 148 papers, which met the inclusion criteria for this review. The 127 studies analysed 55 different populations. A large number of studies (45) analysed data from the STAR experiment (class size reduction in grade K‐3) and its follow up data.

Of the 82 studies not analysing data from the STAR experiment, only six could be used in the data synthesis. Fifty eight studies could not be used in the data synthesis as they were judged to have too high risk of bias on either the confounding item (51), for the other bias item (4) or for the selective reporting item (3). Eighteen studies did not provide enough information enabling us to calculate an effects size and standard error or did not provide results in a form enabling us to use it in the data synthesis.

4.1.2.1 STAR studies

A large number of studies analysed data from the STAR experiment (class size reduction in grade K‐3) and its follow up data, 45 studies reported in 51 papers. 2014001029

The four‐year STAR experiment was conducted in Tennessee in 1985–1989, to assess the effectiveness of small classes compared with regular‐sized classes and of teachers' aides in regular‐sized classes on improving cognitive achievement in kindergarten and in the first, second, and third grades. According to the Technical report (Word, 1994) and Word et al. (1990), 6 the goal of the STAR experiment was to have approximately 100 small classes with 13‐17 students (S), 100 regular classes with 22‐25 students (R), and 100 regular with aide classes with 22‐25 students (RA). In Word et al. (1994) it is reported that in the 1985‐86 year (the first year of the experiment), the STAR project had 128 small classes (approximately 1,900 students), 101 regular classes, (approximately 2,300 students), and 99 regular classes with teacher aides (approximately 2,200 students). Both students and teachers were randomised and randomisation was done within schools so at least one of each class type (S, R and RA) was present at each school. Every class was to remain the same type for four years and a new teacher was randomly assigned to each class in each subsequent grade.

Four studies provided results for grade K‐3, that could be used in the data synthesis. The first study, by Folger and Breda (1989), provided effect sizes comparing small classes to regular classes for each grade level (K‐3). The results of the analysis conducted by Folger and reported in Folger and Breda (1989) was also reported in Word et al. (1990) and Word et al. (1994). Both reports by Word et al. provide a summary of original results from the primary analyses of the STAR experiment. The primary analyses were analysis‐of‐variance models conducted by Professor Finn. However, only a summary of the analyses showing significance levels (.05, .01, .001, all levels are only reported as < = and not the exact level of significance) are reported (which cannot be used in the data synthesis). The second study, by Finn, Gerber, Achilles and Boyd‐Zaharias (2001), provided effect sizes comparing small classes to regular classes for each grade level (K‐3) but used different decision rules in selecting a sample for analysis than in Folger and Breda (1989). In addition Finn et al. (2001) included covariates in the analysis. The third study, by Nye, Achilles, Boyd‐Zaharias, Fulton & Wallenhorst (1992 /1994) (Nye, Achilles, Boyd‐Zaharias, Fulton & Wallenhorst (1994) is a published and shorter version of the 1992 paper), provided effect sizes comparing small classes to the average of regular and regular with aide classes and other than the different comparison they also used different decision rules in selecting a sample for analysis than in Folger & Breda (1989). The effect sizes from the analysis in Nye et al. (1992 /1994) are also reported in Finn (1998), Finn & Achilles (1999) and Nye, Achilles, Boyd‐Zaharias & Fulton (1993). Finally, effect sizes comparing small classes to the average of regular and regular with aide classes were also provided in the study by Hanushek (1999).

Which of these four studies's effect estimates should be included in the data synthesis is not obvious as the decision rule as described in the protocol cannot be used (all studies analysing the same RCT).

The four studies differed in terms of both the chosen comparison condition and decision rules in selecting a sample for analysis (see table 4.1) and which one should be judged to have the least risk of bias is not obvious. Below we describe the different posibilities of chosing a comparison and selecting a sample for analysis.

The numbers of S, R and RA classes and students, as reported in the Technical report (Word, 1994) and Word et al. (1990), are probably the number of students and classes that initially were randomised to any of the three conditions (S, R and RA). However, a considerably proportion of classes did not fall into the range they were intended to. According to the STAR Database User's Guide (Finn et al.,2007, using a table of the distribution of classes by grade and designation reported in Achilles, 1999) between 18 and 32 per cent of classes each year was ‘out of range’; falling in the range of either 18‐21 students or 26‐30 students (see section 10.3 for details). In addition a total of 14 regular and regular with aide classes fell in the range of small classes throughout one of the four years but were not considered out of range according to Finn et al. 2007. The four studies providing effect estimates of the STAR experiment either excluded, included or did not report how they handled the out of range classes. In addition the range of regular sized classes used in the four studies differed, only one study used the range 22‐25 (see table 4.1).

In 2. Grade a number of schools and teachers were randomly chosen to receive special STAR training. A second choice of selection of analysis sample concerns whether to include or exclude the classes whose teachers received STAR training and in addition it is unclear how many actually received training. According to Word et al. (1990) and Folger and Breda (1989), 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3 received Project STAR training. According to Word et al. (1994) p. 73, 67 teachers received training in grade 2 and on page 117 it is stated that all teachers (57 teachers and 57 classes) from 13 schools received training in 2. Grade and all teachers from the same 13 schools (57 classes) received training in 3. Grade. According to Finn et al. (2007) the training was given to 54 second grade teachers from 15 STAR schools. The four studies either excluded, included or did not report how they handled these classes (see table 4.1).

The four studies also differed in the comparison condition they chose. They either compared small classes to regular classes only or to the average of regular and regular with aide classes. Which comparison is most appropriate for this review is however not obvious. At the beginning of 1. Grade approximately half of the students in regular and regular with aide classes interchanged classes (seefor details). At the beginning of 2. Grade (3. Grade) 6 (5) per cent of the students in regular and regular with aide classes interchanged classes. Which choice of comparison is appropriate concerning the analysis for grades 1‐3 is thus unclear. section 10.3

In addition to the regular and regular with aide class interchanging; each year students from small classes moved to regular or regular‐with‐aide classes and students from regular and regular with aide classes moved to small classes (6, 4 and 4 per cent at the beginning of 1. 2. and 3. Grade). In total 25 per cent of all students moved class type at some point. Whether all of these students actually moved classes or a part of the reported movement of students between classes were due to reclassification of class type (small or regular sized) is unclear. The reported number of students moving to and from classes with aide cannot be due to reclassification between small and regular sized classes. At least some reclassification must have occurred though as the following two pieces of evidence show: First, according to the numbers reported in the Technical report (Word et al., 1994), the distribution of class type was not constant in the 13 schools randomly chosen to receive STAR training. It is reported there are 21 small classes, 19 regular classes and 17 regular with aide classes in these schools in 2. Grade. In 3. Grade it is reported there are 25 small, 15 regular and 17 regular with aide classes in the same 13 schools. Thus four classes are apparently reclassified from regular sized to small even though classes were to remain the same type for four years. Second, according to the Technical report (Word et al., 1994) two schools in 3. Grade had incomplete test data and were removed. Compared to the total number of classes in 2. Grade however, only the number of regular classes is reduced from second to third grade (with 11). Some classes must have been reclassified as randomisation was done within schools so each school had at least one class of each type (S, R and RA). None of the four studies providing effect estimates, were explicit about how they handled this moving around of students (and classes).

The four studies are characterised concerning comparison, selection of sample for analysis and method of estimation in table 4.1. Only the study by Hanushek (1999) used the range 22‐25 for regular sized classes. The study compares small classes to the average of regular and regular with aide classes and otherwise nothing is reported concerning sample selection (out of range classes and STAR trained teachers) nor how the treatment was defined (as received or intended). The study by Folger and Breda (1989) compares small classes to regular classes only but uses a range of 21‐28 students for regular classes. It is reported that STAR trained teachers and their classes are included and out of range classes are excluded but it is not reported how out of range classes are defined (for example are the 14 regular and regular with aide classes that fell in the range of small classes excluded and is the definition of out of range classes different than that reported in Finn et al., 2007, considering the different range og regular classes?). The study by Nye et al. (1992 /1994) compares small classes to the average of regular and regular with aide classes and uses a range of 22‐26 students for regular sized classes. It is reported that STAR trained teachers and their classes are excluded and out of range classes are included but it is not reported how out of range classes are categorised (for example are the 14 regular and regular with aide classes that fell in the range of small classes considered small and are the classes in the range 18‐21 categorised as small or regular?). The study by Finn et al. (2001) compares small classes to regular classes only and uses a range of 22‐26 students for regular classes. Otherwise nothing is reported concerning sample selection (out of range classes and STAR trained teachers) or how the treatment was defined (as received or intended). The study includes covariates in the analysis.

We find it very difficult to decide which study or effect estimate is the ‘right’ one to include in the data synthesis. Contrary to usual practice we will therefore not chose one study to include in the data synthesis but will report the results of all four studies inand further examine the robustness of our conclusions when including the extremes (smallest and largest) of the range of effect sizes from the STAR experiment in. section 4.3 section 4.3.5

Concerning the follow up study of the STAR experiment (known as the Lasting benefits study, LBS) a technical report providing effect estimates concerning grade 4, 5, 6, 7 and 8 was published each year. However, only one of the technical reports could be located (Nye et al., 1992, reporting results for grade 5). The remaining technical reports (Nye et al., 1991, 1993, 1994 and 1995) for grade 4, 6, 7 and 8 were unobtainable. The results for grade 4 are however reported in Finn & Achilles (1989). In addition the effect sizes from the technical reports for grade 4 and 5 are also reported in Nye, Achilles, Zaharias & Fulton (1993), Achilles, Nye, Zaharias & Fulton (1993) and Finn & Achilles (1999). Finn & Achilles (1999) also report the effect sizes from the technical reports for grade 6 and 7. The effect sizes from the technical report for grade 8 could not be located. Finn, Gerber, Achilles & Boyd‐Zaharias (2001) report effect sizes for grade 8 (and grade 4 and 6) in a reanalysis of the follow up data. None of these studies reporting results using follow up data from the STAR experiment could however be used in the data synthesis due to too high risk of bias (see section 4.2).

Several other studies reported results from a variety of re‐analyses of the STAR experiment (and follow up data) but none of them could be used in the data synthesis. An overview of the reasons for exclusions from the data synthesis is given in. section 10.2

Table 4.1

Characteristics of studies analysing STAR data used in the data synthesis

	Folger, 1989	Nye, 1992/994	Finn, 2001	Hanushek, 1999
Comparison	R	R + RA	R	R + RA
Size of R and RA classes used	21‐28	22‐26	22‐26	22‐25
Out‐of range classes	Excluded	Included	Not reported	Not reported
STAR trained teachers	Included	Excluded	Not reported	Not reported
Regression with covariate adjustment	No	No	Yes	No
Intention to treat/treatment as received	Not reported	Not reported	Not reported	Not reported

4.1.2.2 Non‐STAR studies

Of the 82 studies (reported in 97 papers) not analysing data from the STAR experiment (or follow up data), only six could be used in the data synthesis.

Five studies (West & Wößmann, 2006; Wößmann & West, 2006; Wößmann, 2003; Wößmann, 2005b and Pong & Pallas, 2001) analysed the same population, using data from the Third International Mathematics and Science Study (TIMSS) data set from 1995. None of these studies were used in the data synthesis as all five studies were judged to have a score of 5 on the risk of bias scale for the confounding item. Three studies used TIMMS data from 2011 (Konstantopoulos & Li, 2016; Li & Konstantopoulos, 2017 and Li, 2015). All three studies were judged 5 on the confounding item and were not included in the analysis. Data from the National Educational Longitudinal Study (NELS data from USA) was used in five studies (Akerhielm, 1995; Boozer and Rouse, 2001; Dee & West, 2011; Hudson, 2011 and Maasoumi, Millimet & Rangaprasad, 2005). The studies by Boozer and Rouse (2001) and Akerhielm (1995) were judged to have a score of 5 on the risk of bias scale for the confounding item and were excluded from the data synthesis. The studies by Dee and West (2011) and Maasoumi, Millimet and Rangaprasad (2005) did not provide results we could use in the data synthesis (results were reported as differences between subjects and first or second order stochastic dominance tests respectively). The study by Hudson (2011) was used in the data synthesis.

Two studies (Milesi & Gamoran, 2006 and Wenfan & Qiuyun, 2005) analysed the same US population using the Early Childhood Longitudinal Study‐Kindergarten Class of 1998‐1999 data set. The study by Wenfan and Qiuyun (2005) was judged to have a too high risk of bias (scored 5 on the confounding item) and was excluded from the data synthesis. The study by Milesi and Gamoran (2006) was used in the data synthesis. Five studies analysed data from Indiana's Prime Time Project (1984‐1988) (Gilman, 1988; Gilman, Swan & Stone, 1988; McGiverin, 1989; Sanogo & Gilman, 1994 and Tillitsky et al., 1988). The four studies by Gilman (1988), Gilman et al. (1988), McGiverin (1989) and Tillitsky et al., (1988) were all rated 5 on the risk of bias scale and the study by Sanogo and Gilman (1994) did not provide results we could use in the data synthesis (do not report what type of classes are included). Five studies analysed the Student Achievement Guarantee in Education Program (SAGE) implemented in Wisconsin in 1996‐2001 (Maier et al., 1997; Molnar, Smith & Zahorik, 1997; Molnar, Smith & Zahorik, 1998; Molnar et al., 1999 and Molnar et al., 2001). None of the studies provided results that could be used in the data synthesis (for details see section 10.1). Three studies analysed the same sample of students from Israel (Angrist & Lavy, 1999; Lavy, 2001 and Otsu, Xu & Matsushita, 2015). The two studies by Angrist and Lavy (1999) and Lavy (2001) were both judged to have a too high risk of bias (scored 5 on the confounding item) and in the study by Otsu et al. (2015) relevant results were presented graphically and no effect sizes or standard errors could be extracted.

Four studies (Dobbelsteen, Levin & Oosterbeek, 2002; Levin, 2001; Ma & Koenker, 2006 and Gerritsen, Plug & Webbink, 2017) analysed the same population using the PRIMA survey which contains information on Dutch pupils who were enrolled in grades 2, 4, 6 and 8 in the school‐year 1994/95. Three studies (Dobbelsteen et al., 2002; Levin, 2001 and Ma & Koenker, 2006) were however judged to have a too high risk of bias (scored 5 on the confounding item) and were excluded from the data synthesis. The study by Gerritsen et al. (2017) was used in the data synthesis. Another two studies (Krueger & Lindahl, 2002 and Lindahl, 2005) used the same sample of Swedish students from 1998 to 1999. Both were judged to have a too high risk of bias (scored 5 on the confounding item).

Finally, four studies analysed the British Class Size Study (1996‐1999) (Blatchford & Basset, 2003; Blatchford, Bassett, Goldstein & Martin, 2003; Blatchford, Goldstein, Martin & Browne, 2002 and Carpenter, Goldstein & Rasbash, 2003). Blatchford et al., 2002 and Carpenter et al., 2003 were both judged to have a too high risk of bias (scored 5 on the selective reporting item) and were excluded from the data synthesis. Neither the study by Blatchford and Basset (2003) nor the study by Blatchford et al. (2003) provided information that enabled us to calculate an effect size and standard error (see section 10.1 for details).

In Table 4.2 we show the total number of studies, not analysing the STAR experiment that met the inclusion criteria for this review. The first column shows the total number of studies grouped by country of origin. The second column shows the number of these studies that did not provide enough data to calculate an effect estimate. The third column gives the number of studies that were coded with very high risk of bias. The fourth column gives the number of studies that were excluded from the data synthesis due to overlapping samples. The last column gives the total number of studies used in the data synthesis.

Fifty‐eight studies were judged to have a score of 5 on the risk of bias scale for either the confounding item (51), for the other bias item (4) or for the selective reporting item (3) (see a supplementary document for the detailed risk of bias assessments). In accordance with the protocol, we excluded these studies from the data synthesis on the basis that they would be more likely to mislead than inform. Eighteen studies did not provide enough information enabling us to calculate an effects size and standard error or did not provide results in a form enabling us to use it in the data synthesis. All studies (those not analysing STAR data) are listed in table 10.1 in section 10.1 along with the reason if the study is not used in the data synthesis.

The main characteristics of the six studies (not analysing STAR) used in the data synthesis are shown in table 4.3.

The studies used in the data synthesis were from USA, the Netherlands and France, one was a RCT and five were NRS. None of the studies were conducted recently, the oldest, used data from 1990 and the earliest was conducted in the beginning of 2000. The grades investigated spanned kindergarten to 3. Grade and one study investigated grade 10. The sample sizes varied; the smallest study investigated 104 students and the largest study investigated 11,567 students. The class size reductions analysed varied from a minimum of one student in four studies, a minimum of seven students in another study (small classes less than 18 students and large classes more than 23 students) to a minimum of 8 students in the last study (small classes 10‐12,students and large 20‐25 students).

Table 4.2

Number of Included Studies, Not Using STAR Data

			Reduction due to
Country	Total	Missing data	Too high risk of bias	Used same data sets	Used in data synthesis
Australia	1	‐	1	‐	0
Bolivia	1	‐	1	‐	0
Canada	1	1	‐	‐	0
Columbia	1	‐	1	‐	0
Cypres	1	‐	1	‐	0
Denmark	1	‐	1	‐	0
France	3	1	‐	‐	2
Germany	1	1	‐	‐	0
Greece	1	‐	1	‐	0
Hong Kong	1	‐	1	‐	0
Israel	3	1	2	‐	0
Italy	1	‐	1	‐	0
Japan	3	‐	3	‐	0
Lesotho	1	‐	1	‐	0
Multiple 2014001029	8	‐	8	‐	0
New Zealand	1	‐	1	‐	0
NL	5	‐	4	‐	1
Norway	2	‐	2	‐	0
Poland	1	‐	1	‐	0
Sri Lanka	1	‐	1	‐	0
Sweden	2	‐	2	‐	0
UK	5	2	3	‐	0
USA	37	12	22	‐	3
Total	82	18	58	0	6

Table 10.3.1

Number of students and transfers in percent, Kindergarten to 1. Grade

1. Grade
Kindergarten		Total number	Drop out	Small	Regular	Regular/aide
	Small	1900	26	68	3	3	100
	Regular	2194	30	6	34	30	100
	Regular/aide	2231	29	5	34	32	100
	Total	6325	29	24	25	22	100
	Transfer to 1 G	4515

Table 4.3

Characteristics of Studies Used in the Data Synthesis

Study	Bressoux, 2009	Ecalle, 2006	Gerritsen, 2017
Country	France	France	Netherlands
Time period	1991‐1992	2002‐2003	1994‐1995
Grade	3	1	2
Study design	NRS	RCT	NRS
Class size	Mean (SD): 22.9 (4.3)	S: 10‐12, R: 20‐25	Mean (SD): 24.07 (4.5)
Number of students	Total 1,680	S: 570; R: 622	Total 470
Number of classes	Total 100	S: 100; R: 100	NR
Study	Hudson, 2011	Milesi, 2006	Munoz, 2001
Country	USA	USA	USA
Time period	1990	1998‐1999	1999‐2000
Grade	10	KG	3
Study design	NRS	NRS	NRS
Class size	Mean (SD): Reading: 22.61 (6.3); Mathematics: 23.37 (7.1)	S: less than 18, R: 18‐23, L: more than 23	S: less than 19, L: more than 18 (‘usual’ size is 24)
Number of students	NR	Total 11,567	S: 47; L: 57
Number of classes	NR	Total 2,437	NR

4.1.3 Excluded studies

In addition to the 127 studies that met the inclusion criteria for this review, 38 studies (reported in 50 papers) at first sight appeared relevant but did not meet our criteria for inclusion. The studies and reasons for exclusion are given in a supplementary document.

4.2 RISK OF BIAS IN INCLUDED STUDIES

The risk of bias coding for each of the 127 studies is shown in a supplementary document.

4.2.1 STAR studies

Forty‐five studies analysed data from the STAR experiment and its follow up data. Both children and teachers were randomly allocated within schools to the three types of classes but the method is not described. All studies analysing the STAR experiment were judged Unclear on the sequence generation item and Low risk of bias on the allocation concealment item (as the allocation was non‐sequential) with the exception of one study (Harvey, 1994) which analysed a subgroup (the subgroup is retainees, i.e. selected on a potential outcome variable).

Only four studies provided results for grade K‐3, that can be used in the data synthesis. In addition three other studies provided results that can be used in the data synthesis but analyse only one grade (K or 1). Seven studies reported results from one or more of the five studies that can be used in the data synthesis. Seventeen studies provided no results that can be used in the data synthesis. Eleven studies analysed STAR follow up data (known as the Lasting Benefits Study LBS) and were all given a score of 5 on the Other risk of bias item corresponding to a risk of bias so high that the findings should not be considered in the data synthesis. Another three studies (analysing STAR data, not the follow up) were given a score of 5 on the Incomplete outcome data item (one study) and the Other risk of bias item (two studies).

4.2.2 Non‐STAR studies

Concerning studies that did not analyse STAR (or follow up) data, all studies, except two, used non‐randomised designs, they were all judged to have a high risk of bias on the sequence generation item and the allocation concealment item. The two studies using randomised designs did not report the method of randomisation and were judged unclear on the sequence generation and allocation concealment items. All studies were judged 4 on the blinding item. None of the studies had an a priori protocol or an a priori analysis plan.

A summary of the risk of bias associated with confounding, incomplete data, other bias and selective reporting for the 64 studies from which it was possible to extract an effect estimate is shown in Table 4.5. Fifty one studies were given a score of 5 on the confounding item, corresponding to a risk of bias so high that the findings should not be considered in the data synthesis. For these 51 studies, we did not find it relevant to judge on the remaining items because of their already high risk of bias. Of the remaining 13 studies, four were given a score of 5 on the Other risk of bias item and three were given a score of 5 on the Selective reporting item, corresponding to a risk of bias so high that the findings should not be considered in the data synthesis. For these seven studies, we did not find it relevant to judge on the remaining items because of their already high risk of bias. None of the other studies were given a score of 5 on the incomplete data.

Table 4.5

Risk of Bias ‐ Distribution of the Studies Not Analysing STAR Data

Risk of bias item	Judgement								Total number of studies
	High	Low	Unclear	1	2	3	4	5
Sequence generation	80	0	2	‐	‐	‐	‐	‐	82
Allocation concealment	80	0	2	‐	‐	‐	‐	‐	82
Blinding 2014001029	‐	‐	0	0	0	0	82	0	82
, Incomplete data 2014001029 2014001029	‐	‐	2	2	1	1	0	0	6
, Selective reporting 2014001029 2014001029	‐	‐	0	5	0	1	0	3	9
, Other bias 2014001029 2014001029	‐	‐	0	4	1	1	0	4	10
, Confounding 2014001029 2014001029	‐	‐	0	2	0	1	2	51	56

4.3 SYNTHESIS OF RESULTS

In order to carry out a meta‐analysis, every study must have a comparable type of effect size. All studies reported standardised mean differences (SMD) and variances or data that enabled calculation of standardised mean differences and variances. All studies, not analysing STAR data, reported outcomes by the end of the treatment (end of the school year) only. The STAR experiment was a four year longitudinal study with outcomes reported by the end of each school year.

All outcomes are scaled such that a positive effect size favours the students in small classes, i.e. when an effect size is positive a class size reduction improves the students' achievement.

4.3.1 STAR studies

Four studies provided effect estimates that could be used in the data synthesis.

The four studies differed in terms of both the chosen comparison condition and decision rules in selecting a sample for analysis. Contrary to usual practice we report the results of all four studies and do not pool the results with the studies not analysing STAR data. We took into consideration the ICC in the results reported for the STAR experiment and corrected the effect sizes and standard errors using ρ = 0.22. Only the standard errors changed (increased) due to the correction implying wider confidence intervals than reported in the studies. The uncorrected results are shown in. section 10.4

All reported results indicated a positive effect favouring the treated; all of the study‐level effects were statistically significant. The study‐level effect sizes for reading varied between 0.17 and 0.34 and the study‐level effect sizes for mathematics varied between 0.15 and 0.33, see table 4.6. The effect sizes for reading reported in Hanushek (1999) were generally smaller than the other effect sizes for reading for each grade. Otherwise no clear patterns could be found.

Table 4.6

Effect Sizes from the STAR Experiment

	Folger, 1989	Nye, 1992/994	Finn, 2001	Hanushek, 1999
Read SMD [95% CI]
Kindergarten	0.21 [0.07, 0.35]	0.18 [0.06, 0.30]	0.21 [0.07, 0.35]	0.17 [0.05, 0.29]
1. Grade	0.34 [0.20, 0.48]	0.24 [0.12, 0.36]	0.30 [0.16, 0.44]	0.23 [0.11, 0.35]
2. Grade	0.26 [0.12, 0.40]	0.23 [0.11, 0.35]	0.26 [0.12, 0.40]	0.20 [0.08, 0.32]
3. Grade	0.24 [0.10, 0.38]	0.26 [0.14, 0.38]	0.22 [0.10, 0.34]	0.22 [0.10, 0.34]
Mathematics SMD [95% CI]
Kindergarten	0.17 [0.03, 0.31]	0.15 [0.03, 0.27]	0.19 [0.05, 0.33]	0.17 [0.03, 0.31]
1. Grade	0.33 [0.19, 0.47]	0.27 [0.15, 0.39]	0.31 [0.17, 0.45]	0.26 [0.14, 0.38]
2. Grade	0.23 [0.09, 0.37]	0.20 [0.08, 0.32]	0.25 [0.11, 0.39]	0.19 [0.07, 0.31]
3. Grade	0.21 [0.07, 0.35]	0.23 [0.11, 0.35]	0.15 [0.01, 0.29]	0.18 [0.06, 0.30]

4.3.2 Non‐STAR studies

Six studies provided standardised mean differences and variances or data that enabled calculation of standardised mean differences and variances effect estimates that could be used in the data synthesis. No adjustment were necessary for clustering; as the studies either did not analyse whole classes (only one or a few students in a class), included class random effects or used a two level model (student and class).

Three studies compared the achievement of students in small classes to the achievement of students in larger classes (defined as reported in table 4.3). The class size reductions in these studies varied from a minimum of one student (the intended reduction was six students) in Munoz (2001), a minimum of seven students in Milesi (2006) to a minimum of 8 students in Ecalle (2006). Three studies (Bressoux, 2009; Gerritsen, 2017 and Hudson, 2011) included class size as a continuous variable in their models. Thus, the reported coefficients reflect the effect of a one student increase in class size on achievement. All three studies reported mean class size as well as the standard deviation of class size. We will use the effect of a standard deviation reduction in class size (as reported in the studies) in the data synthesis and investigate the robustness of results in the sensitivity analysis. Thus the results of the study by Bressoux (2009) will reflect a class size reduction of four students and the study by Gerritsen (2017) will reflect a class size reduction of five students. Concerning the study by Hudson (2011) it is, however, unclear what the correct sign of the effect is. The coefficient labels in the table of results (Table 3 page 17) are ‘Class size’ and the coefficient values reported are positive. Nevertheless, the interpretation in the text is that there is a positive effect of a class size reduction on achievement in reading as well as mathematics. Nowhere in the paper is it reported that the variable ‘class size’ is somehow rescaled to a variable reflecting decreasing class sizes. Thus, either the signs of the class size coefficients are incorrect or the interpretations in the text are incorrect. The results of this study will not be pooled with the other five studies but reported separately and included in the sensitivity analysis.

4.3.3 Reading

Three of the reported results indicated a positive effect favouring the treated and two indicated a negative effect favouring the comparison; three of the study‐level effects were statistically non‐significant.

The weighted average was positive and statistically significant. The random effects weighted standardised mean difference was 0.11 (95% CI 0.05 to 0.16, p = 0.0003). Although the p‐value of the Q‐statistic is notoriously underpowered to detect heterogeneity in small meta‐analyses, the estimated τ² is 0.00 and I² is 0%, implying that heterogeneity among these five studies is not present. The forest plot is displayed in Figure 4.1.

The reported result in Hudson (2011) was a SMD of 0.03 [95% CI 0.01 to 0.04].

Figure 4.1

Reading

4.3.4 Mathematics

The study by Ecalle (2006) did not report results for mathematics. Two of the reported results indicated a positive effect favouring the treated and two indicated a negative effect favouring the comparison; two of the study‐level effects were statistically non‐significant. The weighted average was negative and statistically non‐significant. The random effects weighted standardised mean difference was ‐0.03 (95% CI ‐0.22 to 0.16, p = 0.75). The estimated τ² is 0.02 and I² is 69%, implying that there is some heterogeneity among these four studies. The forest plot is displayed in Figure 4.2.

The reported result in Hudson (2011) was a SMD of 0.02 [95% CI 0.01 to 0.04].

Figure 4.2

Mathematics

4.3.5 Sensitivity analysis

Sensitivity analyses were planned to evaluate whether the pooled effect sizes were robust across study design and components of methodological quality. We found one randomised controlled trial, and evaluated the impact of study design. For methodological quality, we further carried out sensitivity analyses for the confounding risk of bias component of the risk of bias checklists. We examined the robustness of our conclusions when we excluded the study reporting results from a randomised controlled trial and when we excluded the study with risk of bias score of 4 on the confounding item. The analyses are performed separate by outcome, essentially replicating the meta‐analyses conducted in 4.3.3 and 4.3.4. We further examined the robustness of our conclusions when we did not multiply the reported effects with a standard deviation reduction in class size in the studies using class size as a continuous variable and when we included the reported result from the study with an unclear sign of the effect; including the effect both as a positive effect as well as negative effect. Last, we examined the robustness of our conclusions when including the extremes (smallest and largest) of the range of effect sizes from the STAR experiment.

The results of excluding the RCT study and the study with a score of 4 on the confounding risk of bias item are provided in table 4.7 and displayed in forest plots in section 12.

There were no appreciable changes in the results following removal of any of the studies.

In summary, the conclusions of the main syntheses do not change.

The results when not multiplying the reported effects with a standard deviation reduction in class size in the studies using class size as a continuous variable, of including the study with an unclear sign of the effect (Hudson, 2011) and include the extremes of the range of effect sizes from the STAR experiment are provided in table 4.8 and displayed in forest plots in section 12.

The reading outcome lost statistically significance when Hudson (2011) was included with a negative SMD and when Bressoux, 2009 and Gerritsen, 2017 were included with a one student reduction in class size. Otherwise, there were no appreciable changes in the results.

In summary, the conclusion of the main synthesis concerning reading changes except when Hudson (2011) was included with a positive SMD and the conclusion concerning mathematics do not change.

Table 4.7

Sensitivity Analysis. Exclusion of the RCT Study and the Study with a Score of 4 on the Confounding Risk of Bias Items. Separately by Outcome. Standardised Mean Difference (SMD) with 95% Confidence Interval (CI).

				95% CI
Outcome	Studies excluded	Number of studies k	Mean SMD	Lower	Upper
		5	0.11	0.05	0.16
Reading	RCT	4	0.1	0.03	0.16
	Confounding score of 4	4	0.11	0.05	0.17
		4	‐0.03	‐0.22	0.16
Mathematics	Confounding score of 4	3	0.06	‐0.08	0.19

Table 4.8

Sensitivity Analysis. One Student Class Size Reduction, Inclusion of Standardised Mean Difference (SMD) with Unclear Sign and Extremes of the Range of STAR SMDs. Separately by Outcome. SMD with 95% Confidence Interval (CI).

				95% CI
Outcome	Change to analysis	Number of studies k	Mean SMD	Lower	Upper
Reading		5	0.11	0.05	0.16
	One student reduction in class size in Bressoux, 2009 and Gerritsen, 2017	5	0.03	‐0.01	0.07
	Includewith positive SMD [Hudson (2011)]	6	0.07	0.01	0.12
	Includewith negative SMD [Hudson (2011)]	6	0.06	‐0.03	0.15
	IncludeKG [Hanushek (1999)]	6	0.12	0.07	0.17
	Include1G [Folger (1989)]	6	0.14	0.05	0.24
Mathematics		4	‐0.03	‐0.22	0.16
	One student reduction in class size in Bressoux, 2009 and Gerritsen, 2017	4	‐0.00	‐0.07	0.07
	Includewith positive SMD [Hudson (2011)]	5	0.02	‐0.07	0.11
	Includewith negative SMD [Hudson (2011)]	5	‐0.00	‐0.11	0.1
	Include3G [Finn (2001)]	5	0.03	‐0.10	0.17
	Include1G [Folger (1989)]	5	0.05	‐0.13	0.23

5 Discussion

5.1 SUMMARY OF MAIN RESULTS

This review focused on the effect of reducing the class size on students' achievement. The available evidence does suggest that there is an effect on reading achievement, although the effect is small. We found a statistically significant positive effect of reducing the class size on reading. The effect on mathematics achievement was negative and not statistically significant. The effects were measured by standardised mean differences. The weighted average reading effect was 0.11 and the weighted average mathematics effect was ‐0.03. Measured as the probability‐of‐benefit (POB) statistic, defined as the probability that a randomly selected score from the treated population (small classes) would be greater than a randomly selected score from the comparison population, the reading POB was 0.531. A standardised mean difference of 0.11 in reading therefore corresponds to a 53 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population. The lower and upper 95% confidence interval corresponds to 51 respectively 55 per cent chance of a randomly selected score of the treated being higher than a score from the comparison population.

A standardised mean difference of ‐0.03 in mathematics corresponds to a 49 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population. The lower and upper 95% confidence interval corresponds to 44 respectively 55 per cent chance of a randomly selected score of the treated being higher than a score from the comparison population.

None of the studies that could be used in the meta‐analysis provided secondary outcomes.

5.2 OVERALL COMPLETENESS AND APPLICABILITY OF EVIDENCE

In this review we included in total ten studies in the data synthesis and of these only five studies were used in the meta‐analysis. This number is very low compared to the large number of studies (127) meeting the inclusion criteria. The reduction was caused by three different factors. A total of 45 studies analysed data from the STAR experiment. Only four of these studies, could be used in the data synthesis and none of them were included in the meta‐analysis as the decision rule as described in the protocol could not be used.

Of the remaining 82 studies not analysing STAR data, 18 studies did not report effect estimates or provide data that would allow the calculation of an effect size. Fifty eight studies were judged to have a very high risk of bias (5 on the scale) and, in accordance with the protocol, we excluded these from the data synthesis on the basis that they would be more likely to mislead than inform.

If all the 82 studies had provided an effect estimate with lower risk of bias, the final list of useable studies in the data synthesis would have been largerwhich again would have provided a more robust literature on which to base conclusions. 2014001029

The five studies used in the meta‐analysis covered France, the Netherlands and USA, whereas 41 countries were represented by the 82 studies. The geographical coverage thus became narrower as studies from Australia, Belgium, Bolivia, Canada, Chinese Taipei, Columbia, Croatia, Cyprus, Czech Republic, Denmark, England, Germany, Greece, Hong Kong, Hungary, Iceland, Ireland, Israel, Italy, Japan, Korea, Lesotho, Lithuania, Malta, New Zealand, Norway, Poland, Portugal, Romania, Scotland, Singapore, Slovak Republic, Slovenia, Spain, Sri Lanka, Sweden, Switzerland and UK could not be used in the data synthesis. This is a clear limitation of the review.

All the studies used in the meta‐analysis were restricted to grade levels kindergarten to 3. Grade. This is also a clear limitation of the review.

It was not possible to examine the impact of the moderators.

None of the studies were eligible for analysis of any of the secondary outcomes.

5.3 QUALITY OF THE EVIDENCE

The majority of studies used non‐randomised designs. Overall the risk of bias in the included studies was high.

Among the 82 studies not analysing STAR data, fifty eight studies were judged to be at very high risk of bias. Among the 45 studies analysing STAR data, 14 studies were judged to be at very high risk of bias.

The risk of bias was examined using a tool for assessing risk of bias incorporating non‐randomised studies. We attempted to enhance the quality of the evidence in this review by excluding studies judged to be at very high risk of bias using this tool. We believe this process excluded those studies that are more likely to mislead than inform.

Furthermore, we performed a number of sensitivity analyses for each outcome to check whether the obtained results are robust across study design and methodological quality, to inclusion of a result with an unclear sign, inclusion of effect sizes from the STAR experiment and to multiplying the reported effect with a standard deviation reduction in class size in the studies using class size as a continuous variable.

To check the robustness across study design and methodological quality, we removed the study reporting results from a randomised controlled trial and we removed the study with risk of bias score of 4 on the confounding item. The overall conclusions did not change.

The reading outcome, however, lost statistically significance when the study with an unclear sign was included with a negative SMD and when the two studies using class size as a continuous variable were included with a one student reduction in class size instead of a class size standard deviation (as reported in the studies) reduction in class size. Otherwise the conclusions did not change.

There was overall inconsistency in the direction of effects on both the reading outcome and the mathematics outcome. Some effects favoured small classes and some effects favoured regular classes.

5.4 LIMITATIONS AND POTENTIAL BIASES IN THE REVIEW PROCESS

We believe that all the publicly available studies on the effect of a reduction in class size on student achievement up to the censor date were identified during the review process. However, eighteen references were not obtained in full text.

We believe that there are no other potential biases in the review process as two members of the review teamindependently coded the included studies. Any disagreements were resolved by discussion. Further, decisions about inclusion of studies and assessment of study quality were made by two review authors independently and minor disagreements resolved by discussion. Numeric data extraction was made by one review author and was checked by a second review author. 2014001029

5.5 AGREEMENTS AND DISAGREEMENTS WITH OTHER STUDIES OR REVIEWS

To our knowledge this is the first systematic review of the literature on the effects on student achievement of reducing the class size, no directly comparable literature exists.

Early related contributions are the meta‐analysis by Glass and Smith (1979) and the updated literature reviews by Hanushek (Hanushek, 1989; 1999; 2003). Both samples of studies, however, included a number of studies analysing pupil‐teacher ratios and not the actual class size and both contributions included several estimates from the same datasets. Glass & Smith (1979) analysed 725 comparisons from their 77 included studies and based on a meta‐regression model, Glass and Smith (1979) conclude: ‘There is little doubt that, other things equal, more is learned in smaller classes' (p.15). The overall effect size (SMD) is 0.088 and they find no differential effects of subject taught.

Hanushek's quantitative summary of the literature is based on 277 estimates drawn from 59 studies. Based on a vote counting method, Hanushek concluded that “there is no strong or consistent relationship between school resources and student performance” (Hanushek, 1989, p. 47).

A more recent review is found in Shin & Chung (2009), which, however, also include several estimates from the same data set but only include studies analysing actual class size. Further, only studies conducted in the US and published in the period from 1989 to 2008 were included. Ultimately, 17 studies were included for analysis of which 8 are studies analysing STAR data. They computed a total of 120 effect sizes from the 17 studies. Based on a random effects model they find that combining all 120 effect estimates (of which 78 are from STAR) without considering dependence between them the pooled standardised mean difference (SMD) is 0.20. When dependence is taken into consideration, by using state as the unit of analysis (they use the average SMD per state implying the effect size used for Tennessee is a simple average of the 78 SMD based on STAR data), the pooled SMD decreases to 0.08.

Most recently, Chingos (2013) offers a review, though not a systematic review, and like the two earlier reviews also includes actual class size and pupil‐teacher ratio without any distinguishing between them. No data synthesis is performed, but a narrative synthesis is given (although effect sizes from each included study are shown where possible) and the overall conclusion is: ‘The evidence on the efficacy of class size is clearly mixed, with one high‐quality study finding quite large effects, another finding no effects, and a handful finding effects in between’ (p. 430).

The conclusions of these earlier reviews are, with the exception of Hanushek's reviews9, that the evidence is either mixed or favours small classes. However, none of the reviews properly take into consideration the dependence between effect estimates used in the analyses and with the exception of Shin & Chung (2009) they do not distinguish between actual class size and pupil/teacher ratio. Therefore the results are not directly comparable to the results of our review. The available evidence analysed in our systematic review does suggest that there is an effect of reducing class size on student achievement, although only in reading and the size of the effect is small. As such, the conclusions are not inconsistent, though, even if the reviews are based on different inclusion criteria concerning the intervention and substantially different approaches and statistical methods compared to ours.

6 Authors' conclusions

6.1 IMPLICATIONS FOR PRACTICE AND POLICY

The effectiveness of small class sizes for improving student achievement has been one of the most debated issues in educational research. One strand of class size research points to small and insignificant effects, another points to positive and significant effects. In this review, the intervention has been class size reduction. Studies only considering average class size measured as student‐teacher ratio at school level (or higher levels) were not included.

We have found evidence that there is an effect on reading achievement, although the effect is very small. We found a statistically significant positive effect of reducing the class size on reading. The effect on mathematics achievement was negative and not statistically significant.

Measured as the probability‐of‐benefit (POB) statistic, defined as the probability that a randomly selected score from the treated population (small classes) would be greater than a randomly selected score from the comparison population, the overall reading effect corresponds to a 53 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population. The overall effect on mathematics achievement corresponds to a 49 per cent chance that a randomly selected score of a student from the treated population of small classes is greater than the score of a randomly selected student from the comparison population.

Class size reduction is costly and the available evidence points to no or only very small effect sizes of small classes in comparison to larger classes. Taking the individual variation in effects into consideration, we cannot rule out the possibility that small classes may be counterproductive for some students. It is therefore crucial to know more about the relationship between class size and achievement and how it influences what teachers and students do in the classroom in order to determine where money is best allocated.

6.2 IMPLICATIONS FOR RESEARCH

In this review we found evidence that reducing the class size results in an increased reading score, although the impact is very small. We found no evidence of an impact on the mathematics score.

By excluding from the data synthesis studies judged to be at very high risk of bias this review aimed at enhancing the quality of the evidence on the effects of reducing class size. We believe this process excluded those studies that are more likely to mislead than inform on the true effect sizes. Overall the risk of bias in the studies included in the review was high. Many of the available studies were judged to be at very high risk of bias. Fifty‐one of the studies not analysing STAR data were given a score of 5 on the confounding item, corresponding to a risk of bias so high that the findings should not be considered in the data synthesis. Of the remaining 13 studies, four were given a score of 5 on the Other risk of bias item and three were given a score of 5 on the Selective reporting item, corresponding to a risk of bias so high that the findings should not be considered in the data synthesis, leaving only six studies to be meta analysed.

Some of the studies judged to be at very high risk of bias, based the analysis on an instrument variable (IV) design relying on an average of class size (grade or regional) as instrument or a rule of maximum class size (and some studies in addition restricted the analysis to intervals around the discontinuities in class size induced by maximum class‐size rules). These studies, however, failed to deliver convincing arguments that the identification strategies were not subject to too high risk of selection. In general, the studies relying on an average class size as instrument did not explain or discuss the assumption that the instrument does not affect outcomes other than through their effect on class size and in some cases even the (first stage) effect on class size was very week. In general, there was a lack of country specific information given in the studies using rules of maximum class size (does the rule apply to all schools and to which extent is it binding). In addition, in some studies the IV class size was based on enrolment by the end of the school year and not the beginning which made it potentially endogenous.

A further concern is the practical use of effect sizes from studies using rules of maximum class size as instrument is that between the discontinuities triggered by the rules, predicted class size varies with actual enrolment, which is a function of the covariates. Therefore, predicted class size is not a valid instrument except when the rule triggers a change in the number of classes. Further, identification arises only when the rule binds, so if one uses a rule that binds only in some schools, one learns about the effects of class size only for those schools.

In general, studies using IV for causal inference only provides an estimate for a specific group namely, people whose behaviour change due to changes in the particular instrument used. It is not informative about effects on never‐takers and always‐takers because the instrument does not affect their treatment status. The estimated effect is thus applicable only to the subpopulation whose treatment status is affected by the instrument. As a consequence, the effects differ for different IVs and care has to be taken as to whether they provide useful information. The effect is interesting when the instrument it is based on is interesting in the sense that it corresponds to a policy instrument of interest. Further, if those that are affected by the instrument are not affected in the same way the IV estimate is an average of the impacts of changing treatment status in both directions, and cannot be interpreted as a treatment effect. To turn the IV estimate into a local average treatment effect (LATE) requires a monotonicity assumption. The movements induced by the instrument go in one direction only, from no treatment to treatment. The IV estimate, interpreted as a LATE, is only applicable to the complier population, those that are affected by the instrument in the ‘right way’. It is not possible to characterise the complier population as an observation's subpopulation cannot be determined and defiers do not exist by assumption. 2014001029

In the binary‐treatment– binary‐instrument context, the IV estimate can, given monotonicity, be interpreted as a LATE; i.e. the average treatment effect for the subpopulation of compliers. If treatment or instruments are not binary, interpretation becomes more complicated. In the binary‐treatment– multivalued‐instrument (ordered to take values from 0 to J) context, the IV estimate, given monotonicity, is a weighted average of pairwise LATE parameters (comparing subgroup j with subgroup j−1). The IV estimate can thus be interpreted as the weighted average of average treatment effects in each of the J subgroups of compliers. In the multivalued‐treatment (ordered to take values from 0 to T) – multivalued‐instrument (ordered to take values from 0 to J) context, the IV estimate for each pair of instrument values, given monotonicity, is a weighted average of the effects from going from t‐1 to t for persons induced by the change in the value of the instrument to move from any level below t to the level t or any level above. Persons can be counted multiple times in forming the weights.

As the effect of class size belongs to the multivalued‐treatment – multivalued‐instrument category, the results of the studies using IV for causal inference would have been very difficult, not to say impossible, to interpret and use for any practical purposes even if they had delivered convincing arguments that the instruments used were not subject to high risk of selection.

As studies from a variety of countries (38 countries) could not be used in the data synthesis the geographical coverage of the evidence of the effects of reducing the class size became rather narrow, covering only three countries, two European and the US.

The planned examination of potential moderators of the effect, such as gender, age, intensity and duration, was not possible due to low number of studies included in the data synthesis. If effect sizes from all the countries represented in the review had been useable in the data synthesis, additional valuable information about the heterogeneous effects of reducing the class size may have resulted.

These considerations point to the need for future studies that more thoroughly discuss the identifying assumptions and justify their choice of method by considering and reporting all relevant data and tests. Further, future studies should rely on identification strategies where the resulting effect sizes are manageable to interpret and use for practical and political purposes.

It would be natural to consider conducting a large randomised controlled trial (or a series of large RCTs) with specific allocation to small or standard size classes. Specific attention would also have to be paid to stringency in terms of conducting a well‐designed RCT with low risk of bias as well as ensuring that the sample sizes are large enough to enable sufficient power. The trial or trials should be designed, conducted and reported according to methodological criteria for rigour in respect of internal and external validity in order to achieve robust results regarding both the short‐term and the longer‐term effects.

7 Methods Not Implemented

7.1.1 Assessment of reporting bias

We were unable to comment on the possibility of publication bias because there were insufficient studies for the construction of funnel plots.

7.1.2 Moderator analysis and investigation of heterogeneity

We planned to investigate the following factors with the aim of explaining observed heterogeneity: Study‐level summaries of participant characteristics (studies considering a specific age (or grade level) group or socioeconomic status group, or studies where separate effects for high/low socioeconomic status or age (grade level) divided are available), intensity (size of reduction and initial class size) and duration (number of years in a small class).

There were, however, insufficient studies for moderator analysis to be performed.

8 References

8.1 REFERENCES TO INCLUDED STUDIES

References denoted with ‐ is a working paper attached to the primary reference listed just above.

8.1.1 STAR studies

8.1.2 Non‐STAR studies

8.2 REFERENCES TO EXCLUDED STUDIES

8.3 REFERENCES TO UNOBTAINABLE STUDIES

8.4 ADDITIONAL REFERENCES

9. Information about this review

9.1. REVIEW AUTHORS

Lead review author:
Name:	Trine Filges
Title:	Senior Researcher
Affiliation:	SFI‐Campbell
Address:	Herluf Trollesgade 11
City, State, Province or County:	Copenhagen
Postal Code:	1052
Country:	Denmark
Phone:	45 33480926
Email:	tif@sfi.dk
Co‐authors:
Name:	Christoffer Scavenius Sonne‐Schmidt
Title:	Researcher
Affiliation:	SFI‐Campbell
Address:	Herluf Trollesgade 11
City, State, Province or County:	Copenhagen
Postal Code:	1052
Country:	Denmark
Phone:	45 33480971
Email:	css@sfi.dk
Name:	Anne Marie Klint Jørgensen
Title:	Librarian/Information Specialist
Affiliation:	SFI‐Campbell
Address:	Herluf Trollesgade 11
City, State, Province or County:	Copenhagen
Postal Code:	1052
Country:	Denmark
Phone:	45 33480868
Email:	amk@sfi.dk

9.2 ROLES AND RESPONSIBILITIES

Below is listed who is responsible for the following areas:

9.3 SOURCES OF SUPPORT

SFI Campbell.

9.4 DECLARATIONS OF INTEREST

None.

9.5 PLANS FOR UPDATING THE REVIEW

We plan to update the review with a frequency of two years. Trine Filges will be responsible.

9.6 AUTHOR DECLARATION

Authors' responsibilities

By completing this form, you accept responsibility for maintaining the review in light of new evidence, comments and criticisms, and other developments, and updating the review at least once every five years, or, if requested, transferring responsibility for maintaining the review to others as agreed with the Coordinating Group. If an update is not submitted according to agreed plans, or if we are unable to contact you for an extended period, the relevant Coordinating Group has the right to propose the update to alternative authors.

Publication in the Campbell Library

The Campbell Collaboration places no restrictions on publication of the findings of a Campbell systematic review in a more abbreviated form as a journal article either before or after the publication of the monograph version in Campbell Systematic Reviews. Some journals, however, have restrictions that preclude publication of findings that have been, or will be, reported elsewhere, and authors considering publication in such a journal should be aware of possible conflict with publication of the monograph version in Campbell Systematic Reviews. Publication in a journal after publication or in press status in Campbell Systematic Reviews should acknowledge the Campbell version and include a citation to it. Note that systematic reviews published in Campbell Systematic Reviews and co‐registered with the Cochrane Collaboration may have additional requirements or restrictions for co‐publication. Review authors accept responsibility for meeting any co‐publication requirements.

I understand the commitment required to update a Campbell review, and agree to publish in the Campbell Library. Signed on behalf of the authors:

Form completed by: Trine Filges Date: 10 October 2018

10 Characteristics of included studies

10.1 NON‐STAR STUDIES

Study	Used/reason not used in data synthesis	Treatment year (s)	Country
Achilles, 1995	Too high risk of bias on the confounding item	1991‐1994	USA
Akerhielm, 1995	Too high risk of bias on the confounding item	1988	USA
Angrist, 1999	Too high risk of bias on the confounding item	1991	Israel
Angrist, 2014	Too high risk of bias on the confounding item	2009‐2011	Italy
Annevelink, 2004	Too high risk of bias on the confounding item	2000‐2001	NL
Blatchford, 2002	Too high risk of bias on the selective reporting item	1996/97	UK
Blatchford, 2003a	Collection of results from British Class Size Study. Cannot assess RoB as not enough information is provided. Only one effect size reported (but not number of observations used, so cannot calculate standard errors), the rest reported as NS or a narrative description such as ‘there was found to be an effect’.	1996/97 and maybe 1997/98	UK
Blatchford, 2003b	No results reported other than graphs without CI.	1996‐1999	UK
Bonesrønning, 2003	Too high risk of bias on the confounding item	1998‐2000	Norway
Boozer, 1995	Too high risk of bias on the confounding item	1988	USA
Boozer, 2001a	Too high risk of bias on the confounding item	1985‐1990	New Zealand
Boozer, 2001b	Too high risk of bias on the confounding item	1988	USA
Borland, 2005	Too high risk of bias on the confounding item	1990	USA
Bosworth, 2014	Not enough information provided to calculate standard errors	2001‐2002	USA
Bressoux, 2009	Used in data synthesis	1991‐1992	France
Breton, 2012	Too high risk of bias on the confounding item	1997	Columbia
Burde, 1990	Too high risk of bias on the confounding item	1988	USA
Carpenter, 2003	Too high risk of bias on the selective reporting item	1996/1997	UK
Chargois, 2008	Too high risk of bias on the confounding item	2007	USA
Clanet, 2010	Only report the significance level and only sign of the effects that are significant	2001‐2002	France
Costello, 1992	Too high risk of bias on the confounding item	1995	USA
Dee, 2011	Subject specific test score, may be mathematics, reading, science or history but not specified. First difference between subjects is outcome	1988	USA
Dennis, 1986	Too high risk of bias on the confounding item	1985‐1986	USA
Dharmadasa, 1995	Too high risk of bias on the confounding item	1989	Sri Lanka
Dieterle, 2013	Only have data at required level for two of three grades and do not provide useable separate results	2003‐2004	USA
Dobbelsteen, 2002	Too high risk of bias on the confounding item	1994/1995	NL
Ecalle, 2006	Used in data synthesis	2002‐2003	France
Galton, 2012	Too high risk of bias on the confounding item	2004‐2008	Hong Kong
Gerritsen, 2017	Used in data synthesis	1994‐2005	NL
Gilman, 1988a	Too high risk of bias on the confounding item	1984‐1988	USA
Gilman, 1988b	Too high risk of bias on the confounding item	1985	USA
Haenn, 2002	Too high risk of bias on the confounding item	1994/1995 to probably 2001	USA
Hallinan, 1985	Too high risk of bias on the confounding item	Not reported	USA
Hirschfeld,2016	Too high risk of bias on the confounding item	2016	USA
Hojo, 2011	Too high risk of bias on the confounding item	2007	Japan
Hojo, 2013	Too high risk of bias on the other bias item	2003	Japan
Hudson, 2011	Used in data synthesis	1990	USA
Iacovou, 2002	Too high risk of bias on the confounding item	1965, 1969 and 1974	UK
Iversen, 2013	Too high risk of bias on the confounding item	2003‐2004	Norway
Jakubowski, 2006	Too high risk of bias on the confounding item	2002‐2004	Poland
Konstantopoulos, 2014	Too high risk of bias on the confounding item	2001	Greece
Konstantopoulos, 2016	Too high risk of bias on the confounding item	2003 and 2007	Cyprus
Konstantopoulos, 2016	Too high risk of bias on the confounding item	2011	Multiple 2014001029
Krueger, 2002	Too high risk of bias on the confounding item	1998‐1999	Sweden
Lavy, 2001	Too high risk of bias on the confounding item	1991	Israel
Levin, 2001	Too high risk of bias on the confounding item	1994/1995	NL
Li, 2015	Too high risk of bias on the confounding item	2011	Multiple 2014001029
Li, 2017	Too high risk of bias on the confounding item	2011	Multiple 2014001029
Lindahl, 2005	Too high risk of bias on the confounding item	1998	Sweden
Ma, 2006	Too high risk of bias on the confounding item	1994/1995	NL
Maier, 1997	A Regular classroom refers to a classroom with one teacher. Most regular classrooms have 15 or fewer students, but a few exceed 15. A 2‐Teacher Team classroom is a class where two teachers work collaboratively to teach as many as 30 students. A Shared‐Space classroom is a classroom that has been fitted with a temporary wall that creates two teaching spaces, each with one teacher and about 15 students. A Floating Teacher classroom is a room consisting of one teacher and about 30 students, except during reading, language arts, and mathematics instruction when another teacher joins the class to reduce the ratio to 15:1. Only analyse effect of type of classroom within SAGE schools.	1995‐1996	USA
Maples, 2009	Too high risk of bias on the confounding item	2006‐2007	USA
McGiverin, 1989	Too high risk of bias on the confounding item	1984‐85	USA
Merritt, 2011	Too high risk of bias on the other bias item	2010	USA
Milesi, 2006	Used in data synthesis	1998‐1999	USA
Molnar, 1998	See [Maier, 1997]	1997‐1998	USA
Molnar, 1999a	See [Maier, 1997]	1998‐1999	USA
Molnar, 1999b	See [Maier, 1997]	1996‐1998	USA
Molnar, 2001	See [Maier, 1997]	2000‐2001	USA
Moshoeshoe, 2015	Too high risk of bias on the confounding item	2000	Lesotho
Munoz, 2001	Used in data synthesis	1999‐2000	USA
Murdoch, 1986	Only report p values from a multivariate model (8 outcomes) with CS, age, gender and school, separated by grade	1984‐1985	USA
Maasoumi, 2005	No method/results we can use (first or second order stochastic dominance tests)	1988	USA
Nandrup, 2016	Too high risk of bias on the confounding item	2009/2010‐2011/2012	Denmark
NICHD, 2004	Not enough information provided to calculate standard errors	1990‐1991	USA
Otsu, 2015	Relevant results are presented graphically and no ES and SE can be extracted. (Uses selected data of; schools with either one or two classes in grade 4) [Angrist and Lavy (1999)]	1991	Israel
Pollard, 1995	Too high risk of bias on the confounding item	1990‐1992 and 1996‐1997	USA
Pong, 2001	Too high risk of bias on the confounding item	1994‐1995	Multiple 2014001029
Sanogo, 1994	Reproduction of STAR and Indiana PRIME Time results (and). Do not report what type of classes are included in the PRIME Time results [Word et al. 1990] [Tillitsky, Gilman, Mohr, and Stone, 1988]	1985‐1989 and 1984‐1987	USA
Shapson, 1980	They do not report outcomes for all groups for all years, so we cannot determine the effect of being randomized to one of the four arms.	1977‐1979	Canada
Tienken, 2009	Too high risk of bias on the confounding item	2001‐2006	USA
Tillitsky, 1988	Too high risk of bias on the confounding item	1984‐1987	USA
Uhrain, 2016	Too high risk of bias on the confounding item	2012‐2013	USA
Urquiola, 2006	Too high risk of bias on the other bias item	1993	Bolivia
Watson, 2016	Too high risk of bias on the confounding item	2008‐2012	Australia
Wenfan, 2005	Too high risk of bias on the confounding item	1998‐1999	USA
West, 2006	Too high risk of bias on the confounding item	1994‐1995	Multiple 2014001029
Wiermann, 2005	Difference between mathematics and physics test scores (the chemistry/biology and the reading/biology differences scores 5)	2000	Germany
Wößmann, 2006	Too high risk of bias on the confounding item	1994‐1995	Multiple 2014001029
Wößmann, 2003	Too high risk of bias on the confounding item	1994‐1995	Multiple 2014001029
Wößmann, 2005a	Too high risk of bias on the confounding item	1995	Japan and Singapore
Wößmann, 2005b	Too high risk of bias on the confounding item	1995	Multiple 2014001029

10.2 STAR STUDIES

Study	Used/not used in data synthesis	Notes
Achilles, 1993a	Not used in data synthesis	STAR. Reproduction of the results in(significance levels from analysis‐of‐variance models) and further results on various subgroups (for example entering STAR in grade 1 or results on retained/not retained etc.) [Word et al. 1990]
Achilles, 1993b	Provide effect sizes from other studies.	Grade 4 results reproduced fromand Grade 5 results reproduced fromand judged 5 in the other risk of bias data item. Separate results for S vs R and R vs RA [Finn 1989] [Nye, 1992]
Balestra, 2014	Provide no results that can be used in data synthesis	STAR (quantile regression) only reported for kindergarten and 1. grade and Lasting Benefit Study reanalysis of graduation from high school (not an outcome of this review)
Bingham, 1994	Provide no results that can be used in data synthesis	STAR reanalysis. No useful data provided (only means)
Chetty, 2011	Provide no results that can be used in data synthesis	STAR no useful outcomes provided. Test score as the average mathematics and reading percentile rank score attained in the student's year of entry into the experiment is only relevant outcome reported for this review.
Ding, 2005	Provide no results that can be used in data synthesis	STAR reanalysis. None of the analyses can be used for this review. Analyses the effect of each class size in the range 12‐28 relative to 22. Further report results from regressions where class size is interacted with several covariates.
Ding, 2010	Not used in data synthesis	STAR reanalysis. Structural equation model. Effects of number of years (and sequence) treated
Ding, 2011	Provide no results that can be used in data synthesis	STAR reanalysis. Uses KG data only. Do not separate R and RA. Regression with small class interacted with covariates
Doulgas, 1989	Provide no results that can be used in data synthesis	Report percent of variance accounted for by factors (among others class size) affecting mean class achievement
Finn, 1989	Provide effect sizes for grade 4. Too high risk of bias (other bias item)	Report means, SD's and effect sizes for grade 4
Finn, 1990a	Provide results and data that can be used in data synthesis (although only for grade 1)	Report effect sizes, comparing small classes to the mean of regular and regular with aide. Report means for each of the three conditions and report standard deviations based on students in regular classes. Report total number of students and number of classes in the three conditions. Results divided on location (inner‐city, rural etc.) also provided. A growth analysis of students participating in the same classroom arrangement for both years and who had complete data (35%) performed but is given 5 on incomplete data
Finn, 1990b	Too high RoB	STAR reanalysis for those in same class arrangement for 3 years (K‐2. grade) Judged 5 in RoB (incomplete outcome data)
Finn, 1998	Provide effect sizes from other studies.	Reporting of effect sizes (KG‐3) from Nye, 1993 and Nye, 1992/1994.
Finn, 1999	Provide results from the LBS technical reports grade 4‐7. Could use results for grade 6 and 7 as the technical reports for these grades are not available (scores 5 on the other risk of bias item though). Otherwise no results are provided that can be used in data synthesis.	Reporting of effect sizes (KG‐3) from(who reports effect sizes from other studies). Reporting of effect sizes for grades 4, 5, 6 and 7 from; the LBS Technical Reports:;(study not available) and(study not available). The result for 6. Grade is to a large extent different from the result reported in. Calculate Grade Equivalence effect sizes (not an outcome of this review) and behaviour effect sizes [Finn, 1998] [Finn et al. 1989] [Nye et al., 1992] [Nye et al., 1993] [Nye et al., 1994] [Finn, 2001]
Finn, 2001	Provide effect sizes for grade KG‐3 and grade 4, 6 and 8. Grade 4, 6 and 8 judged 5 on the other risk of bias item.	Reanalysis of STAR and LBS. Report effect sizes, comparing small classes to regular classes. Do not report whether classes of trained teachers or out‐of‐range classes are excluded or not. Report the total number of students used, though not per grade for KG‐3. Results are slightly different than the results reported infor KG‐3 grade and infor 4. Grade and to a large extent different from the result reported infor grade 6. LBS results judged 5 in RoB (other bias) [Folger 1989] [Finn 1989] [Finn 1999]
Finn, 2005	Too high RoB	Analysis of high school graduation. Judged 5 in RoB (other bias)
Folger, 1989	Provide effect sizes for grade KG‐3. Used in data synthesis	It is most likely small classes compared to regular classes. Includes the teachers receiving STAR training although it is unclear how many teachers were trained. According toand and this study, 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3 received Project STAR training. According to Word et al. (1994) p. 73, 67 teachers received training in grade 2 and on page 117 it is stated that all teachers (57 teachers and 57 classes) from 13 schools received training in 2. Grade and all teachers from the same 13 schools (57 classes) received training in 3. Grade. The distribution of class type is not constant in these 13 schools; in 2. Grade it is reported there are 21 S, 19 R and 17 RA and in 3. Grade there are 25 S, 15 R and 17 RA. According to Finn et al. (2007): Second, during the summer between grade 1 and grade 2 (summer 1987), a three‐day training course was given to 54 second‐grade teachers (out of 340) from 15 STAR schools. The training was the same for all 54 teachers, since the assignment to class types had not yet been made. Excludes out‐of‐range classes although unclear how they are defined. Uses a range of 21‐28 students for regular classes (original the range was 22‐25. Analysis of STAR includes the 67 teachers receiving STAR training (although reports that it is 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and excludes out‐of‐range classes, results also shown inand 1994). [Word (1990)] [Word (1990]
Hanushek, 1999	Provide effect sizes for grade KG‐3. Used in data synthesis	Compares small classes to the mean of regular and regular with aide. Do not explicitly report the numbers used for analysis but probably include the classes of trained teachers and out‐of‐range classes. Report the numbers with achievement data.
Harvey, 1994	Too high RoB	STAR data, only retainees used (reanalysis). Judged 5 in RoB (other bias)
Jackson, 2013	Provide no results that can be used in data synthesis	Reanalysis uses only kindergarten and 1. Grade and a composite z‐score (average of mathematics, reading and word scores).
Jacobs, 1987	Provide no results that can be used in data synthesis and too high RoB	Is judged 5 in RoB (incomplete outcome data) Results in,and(for three different outcomes) have main effect for class type (not small separated out). Cross tabulation of the 3 outcomes in,andbut only raw totals and percent scoring low/middle/high and other tables subdivided on several covariates. Scores for small class size are given in fig. 20 and 38, but no standard deviation table 3 4 5 table 6 7 8
Konstantopoulos, 2008	Provide no results that can be used in data synthesis	STAR reanalysis. Quantile regression with covariates (gender, ethnicity and SES). Whether achievement distribution used is taken over Treated/Control or Treated+Control is not reported
Konstantopoulos, 2009	Provide no results that can be used in data synthesis and too high RoB	Reanalysis of STAR and Lasting Benefits Study data. ITT and IV analyses (same quantile regression effect of 3. grade treatment in 4‐8 grade separately), also available, and a dose analysis (judged 5 in RoB, other bias). Unclear what their achievement distribution is.
Konstantopoulos, 2011	Not used in data synthesis. Too high RoB	Reanalysis of STAR data. ITT analysis. Each school treated as an individual RCT ‐ effect size from linear regression (with small class and regular with aide compared to regular classes in the same model, cannot separate teacher effect from treatment effect in schools with only one small class and/or only one regular class (approximately 43% of schools had only one small class and 81% had only one small and/or one regular class)) ‐ overall mean calculated by inverse variance weighted random effects model. Judged 5 in RoB (other bias)
Krueger, 1999	Provide no results that can be used in data synthesis	STAR reanalysis. Average percentile scores in mathematics, reading and word (not shown separately) used for analysis.
Krueger, 2001a	Too high RoB and provide no results that can be used in data synthesis	Same analyses as, with updated data (in addition they only report weighted averages of percentages and do not report the numbers used for analysis, so results cannot be used). [Krueger & Whitmore, 2001]
Krueger, 2001b	Too high RoB and provide no results that can be used in data synthesis	STAR follow up. Analysis of scores on two high school entrance exams is judged 5 in RoB (other bias). Analysis of entrance exam taken or not is also available (not an outcome of this review)
Mckee, 2010	Not used in data synthesis	STAR reanalysis. Only KG and merge R and RA. OLS w/wo school FE controlling for teachers with fewer than three years of experience and teachers with an advanced degree, and for the student's race‐ethnicity, gender, age, special education status, whether or not they are repeating kindergarten, attendance record, and subsidized lunch eligibility. Specifications that do not include school fixed effects also include indicators for community type (suburban, rural, urban, and inner‐city). Transform test scores to have zero mean and SD of one
McKee, 2015	Not used in data synthesis	STAR reanalysis. Use only KG and pool R and RA classes and transform test scores to have zero mean and SD of one and include covariates
Mosteller, 1995	Provides results from other articles only	Provides results from other articles: Finn, J.D., and Achilles, C.M. Answers and questions about class size: A state‐wide experiment. American Educational Research Journal (1990) 27, 3:557–77,. And Word, E., Johnston, J., Bain, H.P., et al. Student/Teacher Achievement Ratio (STAR): Tennessee's K‐3 class size study, Nashville: Tennessee Department of Education,and. Table 5 Figures 1 2
Nye, 1992	Too high RoB. Not used in the data synthesis	Technical report for fifth grade of the Lasting Benefits Study. Scores 5 on the incomplete outcome data item (and other risk of bias)
Nye, 1993	Results for KG‐3 grade used in the data synthesis. Results for grade 4 and 5 are reproduced from,(not available) and. [Finn, 1989] [Nye et al., 1991] [Nye, 1992]	Results for grade KG‐3 are obtained comparing small classes to the mean of regular and regular with aide, also divided on white/minority (same analysis and results as in/1994). Excludes the 67 teachers receiving STAR training (it is 67 teachers according to the technical report (Word 1994) page 73 (text and table IV‐12 providing the numbers used for analysis) but on page 117 and 192 and according toandit was 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and includes out‐of‐range classes. Numbers used for KG and 1 grade are 5734 and 5905. Do not report the numbers used for 2. and 3. Grade analyses. Report effect sizes for grade 4 and 5 comparing small to regular. Grade 4 results reproduced from Finn, 1989 and Nye et al., 1991 (not available) and Grade 5 results reproduced from. [Nye, 1992] [Word (1990)] [Folger & Breda (1989)] [Nye, 1992]
Nye, 1992/1994	Results for KG‐3 grade used in the data synthesis. Results for grade 4 and 5 are reproduced from,(not available) and. [Finn, 1989] [Nye et al., 1991] [Nye, 1992]	Compares small classes to the mean of regular and regular with aide, also divided on white/minority (same analysis and results as in). Excludes the 67 teachers receiving STAR training (it is 67 teachers according to the technical report (Word, 1994) page 73 (text and table IV‐12 providing the numbers used for analysis) but on page 117 and 192 and according toandit was 57 teachers in grade 2 from 13 randomly chosen schools and another 57 teachers in grade 3) and includes out‐of‐range classes. Numbers used for KG and 1 grade are 5734 and 5905. Do not report the numbers used for 2. and 3. Grade analyses. Report effect sizes for grade 4 and 5 comparing small to regular. Grade 4 results reproduced from Finn, 1989 and Nye et al., 1991 (not available) and Grade 5 results reproduced from. [Nye, 1993] [Word (1990)] [Folger & Breda (1989)] [Nye, 1992]
Nye, 2000a	Provide no results that can be used in data synthesis	Hierarchical linear regression model separate for each grade and reading and mathematics including gender, SES and minority status, interaction of small class and gender, SES and minority respectively and (three way) interaction of small class, gender and minority and a similar analysis with three way interaction: small class, gender and SES. Coefficient estimates with stars. Cannot be used. Also available are effect sizes (d's) separated by white/minority and high/low SES and ES's by gender within race (white/minority) and SES (high/low) (but do not report number of observations used so we cannot calculate standard errors).
Nye, 2000b	Provide no results that can be used in data synthesis	Three analyses (two separate models for treatment as received (a two level and a three level model) and a three level model for treatment as assigned) each comparing regular to small and (for the two level model only) regular with aide (in the three level model regular and regular with aide are assumed to be the same).. Analysis separate for each grade and reading and mathematics including gender and SES, interaction of small class and gender (although coefficients shown report they are for gender and minority interaction?), geographic location of school, teacher experience, school SES and school minority. Effect size estimates with stars (indicating significance level).
Nye, 2001a	Too high RoB	STAR follow up (9. Grade) Two analyses: 1) Students who participated at least 1 year and was part of the trial in 3. Grade; 2) students participating all 4 years. Judged 5 in RoB (incomplete outcome data)
Nye, 2001b	Provide no results that can be used in data synthesis and too high RoB	STAR reanalysis, grade 1‐3, special sample: it is unclear whether some of the students in the control group they use have spent some years in a small class (the control group is characterised by: small class in some or no grades, see). In the analysis for each grade they include only treated who were in small class for that grade and all previous grades. Unclear whether the control group is required to have been in the experiment for all previous grades but probably not, the total sample size increases from grade 1 to 3 whereas the treated group considerably decreases. Grade 2 and 3 judged 5 in RoB (incomplete outcome data) and it is not possible to calculate standard errors (so results for grade 1 cannot be used either) table 1
Nye, 2002	Provide no results that can be used in data synthesis	Analysis separate for each grade and reading and mathematics including gender, SES, minority status, low achiever (below median within classes at end of kindergarten) and interaction of small class and low achiever. Coefficient estimates with stars (indicating significance level). Cannot be used.provides effect sizes (d's) separated by low/high achievers (relative within class at end of kindergarten) (but do not report number of observations used so we cannot calculate standard errors). Table 1
Prais, 1996	Provides results from other articles and otherwise provide no results that can be used in data synthesis	STAR reanalysis. ‘Reproduction of the Technical reports (Word, 1994) (mathematics/reading average scores) table p. 47/47 and figure p. 54/53, figure p.65/64, figure p.78/77 and figure p. 92/93 and (own) calculation of yearly value added and 3 years average of value added.
Schanzenbach, 2007	Not used in data synthesis	ITT reanalysis using composite mathematics and reading. Also provide results for composite test score for 4, 5, 6, 7 and 8 grade.
Shin, 2012	Provide no results that can be used in data synthesis	STAR reanalysis using new comers each year only and separate by race. Several analyses: 1) ITT (by IV, random assignment as IV for actual class size, i.e. multiple CS reduction levels and include new students each year also) separated by race and controlling for race and the race difference in same equation; 2) same as 1) but in a structural simultaneous model. They investigate whether there is school‐level confounding, by comparing a model with school‐level fixed‐effects to a model without fixed‐effects (comparison of 3L ITT and 2L ITT inand) table 2 3
Shin, 2011	Provide no results that can be used in data synthesis	Same analyses as, but not separated by race. They investigate whether there is school‐level confounding, by comparing a model with school‐level fixed‐effects to a model without fixed‐effects (comparison of 3L ITT and 2L ITT inand) [Shin, 2012] table 4 5
Sohn, 2015	Too high RoB	LBS reanalysis (CTBS data) 4., 6. and 8. grade. Analyse number of years in small class and divide on ‘effective’ (i.e. significant difference) and ineffective schools (also show total). Results cannot be used
Word, 1990 and 1994	Final report for grade KG‐3. Only report significance levels reported (can not be used). Summary of relevant results (effect sizes) fromcan be used. [Folger, 1989]	Summary of original results. Only report significance levels reported (analysis‐of‐variance model results can not be used as they are only reported as a summary of the analyses showing significance levels (.05, .01, .001, all levels are <=). Provide effect sizes for KG‐3 grade from an analysis conducted by Folger (also provided in). [Folger & Breda, 1989]

Table 4.4

Risk of Bias ‐ Distribution of the 45 Studies Analysing STAR Data

Relevant results reported are from other included studies	7
STAR follow up data (LBS)	11
Provide no results that can be used in the data synthesis	17
Used in data synthesis	4
Provide results for only one grade (K or 1) that can be used	3
Too high risk of bias	3
Total	45

10.3 STAR STUDENTS AND CLASSES

Table 10.3.2

Number of students and transfers in percent, 1. Grade to 2. Grade

2. Grade
1. Grade		Total number	Drop out	Small	Regular	Regular/aide
	Small	925	23	75	1	1	100
	Regular	2584	28	6	58	8	100
	Regular/aide	2320	26	2	5	67	100
	Total	6829	26	24	24	26	100
	Newcomers in 1. Grade	2314
	Transfer to 2. Grade	5049

Table 10.3.3

Table Number of students and transfers in percent, 2. Grade to 3. Grade

3. Grade
2. Grade		Total number	Drop out	Small	Regular	Regular/aide
	Small	2016	19	78	2	2	100
	Regular	2329	23	7	64	7	100
	Regular/aide	2495	21	2	3	74	100
	Total	6840	21	26	23	30	100
	Newcomers in 2. Grade	1791
	Transfer to 3. Grade	5413

Table 10.3.4

Total transfers

	Number	Per cent
Total drop out	5017	43
Total movers	2843	25
Total stayers	3740	32
Total STAR students	11600	100

Table 10.3.5

Distribution of STAR classes by grade (Kindergarden‐3) by designation S (Small), R (Regular), and RA (Regular with Aide)

	Class size	K (number of classes)			1 (number of classes)			2 (number of classes)			3 (number of classes)
		S	R	RA	S	R	RA	S	R	RA	S	R	RA
B	11										2
B	12	8			2			3			2
A	13	19			14			16			15
	14	22			18			27			17
	15	23		1	31			32			31
	16	31	4		16	1		29	1		31		1
	17	24	4	1	33	1		19			27
B	18		1	2	6	2		6			10	1
	19		7	6	3	4	3	1	3	3	5		4
	20		6	6	1	10	6		2	1		9	13
	21		14	12		18	18		7	11		11	12
C	22		20	20		27	15		23	21		13	16
	23		16	21		19	20		20	21		10	14
	24		19	14		16	11		22	25		15	14
	25		6	6		7	9		9	15		16	15
B	26		4	3		5	9		6	7		5	12
	27		1	6		2	4		4	1		5	8
	28			1		1	2		1			2	6
	29					1	2		2	2		2	2
	30					1	1
Total as reported		127	99	99	124	115	100	133	100	107	140	90	107
Total as calculated		127	102	99	124	115	100	133	100	107	140	89	117
Too small¹		8	36	28	2	36	27	3	13	15	4	21	30
Too large²		0	5	10	10	10	18	7	13	10	15	14	28
Total		8	41	38	12	46	45	10	26	25	19	35	58
Out‐of range as reported (sum of B's)		8	33	36	12	44	45	10	25	25	19	35	57
Per cent too large for S and too small for R and RA³		0	35	28	8	31	27	5	13	14	11	24	26

Table 10.3.6

Number of STAR classes by grade (Kindergarten‐3) and designation S (small), R (Regular) and RA (Regular with Aide)

Class size
		11‐12	13‐17	18‐21	22‐25	26‐30	Total
Kindergarten	S	8	119	0	0	0	127
	R	0	8	28	61	5	102
	RA	0	2	26	61	10	99
1. Grade	S	2	112	10	0	0	124
	R	0	2	34	69	10	115
	RA	0	0	27	55	18	100
2. Grade	S	3	123	7	0	0	133
	R	0	1	12	74	13	100
	RA	0	0	15	82	10	107
3. Grade	S	4	121	15	0	0	140
	R	0	0	21	54	14	89
	RA	0	1	29	59	28	117

10.4 STAR UNCORRECTED EFFECT SIZES

Technical report:

Word, E.R., Johnston, J., Bain, H.P., Fulton, B.D., Zaharias, J.B., Achilles, C.M., Lintz, M.N., Folger, J. & Breda, C. (1994). The state of Tennessee's Student/Teacher Achievement Ratio (STAR) Project: Technical report 1985–1990. Nashville: Tennessee State Department of Education, 1994.

Finn et al., 2007

Finn, J.D., Boyd‐Zaharias, J., Fish, R.M. & Gerber, S.B. (2007). Project STAR and Beyond: Database User's Guide. HEROS, Incorporated.

	Folger, 1989	Nye, 1992/994	Finn, 2001	Hanushek, 1999
Read SMD [95% CI]
Kindergarten	0.21 [0.15, 0.27]	0.18 [0.12, 0.24]	0.21 [0.15, 0.27]	0.17 [0.11, 0.23]
1. Grade	0.34 [0.28, 0.40]	0.24 [0.18, 0.30]	0.30 [0.24, 0.36]	0.23 [0.17, 0.29]
2. Grade	0.26 [0.20, 0.32]	0.23 [0.17, 0.29]	0.26 [0.20, 0.32]	0.20 [0.14, 0.26]
3. Grade	0.24 [0.18, 0.30]	0.26 [0.20, 0.32]	0.22 [0.16, 0.28]	0.22 [0.16, 0.28]
Mathematics SMD [95% CI]
Kindergarten	0.17 [0.11, 0.23]	0.15 [0.09, 0.21]	0.19 [0.13, 0.25]	0.17 [0.11, 0.23]
1. Grade	0.33 [0.27, 0.39]	0.27 [0.21, 0.33]	0.31 [0.25, 0.37]	0.26 [0.20, 0.32]
2. Grade	0.23 [0.17, 0.29]	0.20 [0.14, 0.26]	0.25 [0.19, 0.31]	0.19 [0.13, 0.25]
3. Grade	0.21 [0.15, 0.27]	0.23 [0.17, 0.29]	0.15 [0.09, 0.21]	0.18 [0.12, 0.24]

11 Appendices

11.1 SEARCH DOCUMENTATION

Examples of search strings used to search different host services: EBSCO, ProQuest, ISI Web of Science.

ERIC (EBSCO)

Latest search 14/2/2017. Search string from 2017 update. Search is limited from 20150101‐20171231. Search performed in full text.

International Bibliography of the Social Sciences (ProQuest)

Latest searched January 2015. Search limited to 1980‐2015. Search performed in full text.

Science Citation Index & Social Science Citation Index (ISI Web of Science)

Latest search 14/2/2017. Search string from 2017 update. Search is limited from 20150101‐20171231.

Centre for Reviews and Dissemination Databases

Latest searched January 2017. Search limited to 2015‐2017. Search performed in full text.

This search string was also utilised on Campbell Collaboration Library, EPPI‐Centre Systematic Reviews ‐ Database of Education, Social Care Online with minor modifications.

Searches on National Library Portals

Searches on these portals were performed in both English and Danish, Swedish and Norwegian. Searches where performed latest in 2015. Searches were limited to 1980‐2015.

Bibliotek.dk

Libris

BIBSYS

Grey literature sources

Latest searches performed in 2017.

Google Scholar

Search	Terms	Results
S16	S13 AND S15 – limited to 20150101‐20171231	235
S15	S5 AND S14	30,732
S14	student* OR pupil*	745,639
S13	S10 AND S11 AND S12	250
S12	S3 OR S4	12,413
S11	S1 OR S2	805
S10	S5 OR S6 OR S7 OR S8 OR S9	46,459
S9	Intellect* N2 develop*	197
S8	DE “Intellectual Development”	56
S7	School* N1 (performan* OR achiev*)	492
S6	Academic* N2 (performance* or achiev* or abilit* or outcome*)	4,251
S5	learn* OR develop* OR perform* OR achiev* OR abilit* OR outcome* OR improve*	46,459
S4	DE “Middle Schools” OR DE “Elementary Schools” OR DE “Secondary Schools* OR DE “Junior High Schools”	4,407
S3	(primary N1 School) OR (elementary N1 school) OR (secondary N1 school) OR (middle N1 school) OR (junior N1 high*)	12,413
S2	DE “Class Size” OR DE “Classroom Environment” OR DE “Crowding” OR DE “Flexible Scheduling” OR DE “Small Classes” OR DE “Teacher Student Ratio”	753
S1	class N2 size*	193

Search	Terms	Results
S1	(“class size”) OR class size* OR class near/2 size*	1690
S2	((“class size”) OR class size* OR class near/2 size) OR (((“classroom environment”) OR classroom environment OR classroom near/1 environment) OR ((“crowding”) OR crowding)) OR ((flexible NEAR/1 scheduling* OR (“flexible scheduling”)) OR (“small classes” OR small NEAR/1 classes))	3423
S3	((“primary schools”) OR primary school* OR primary NEAR/1 school) OR (((“elementary school students” OR “elementary schools”) OR elementary school OR elementary NEAR/1 school) OR ((“secondary schools”) OR secondary school OR secondary near/1 school)) OR (((“middle schools”) OR middle school OR middle near/1 school) OR ((“junior high schools” OR “junior high school students”) OR junior high OR junior near/1 high))	13820
S4	((“learning”) OR learn) OR (((“development”) OR develop OR child development) OR ((“performance”) OR perform)) OR (((“achievement”) OR achieve) OR ((“intellectual ability” OR “ability”) OR intelle near/2 abili)) OR (((“outcomes”) OR outcome) OR ((“improvement”) OR improve*))	639914
S5	(school NEAR/1 (performan* OR achiev)) OR (((“intellectual development”) OR intellectual near/1 development) OR (intelle* near/2 develop*))	2516
S6	((school NEAR/1 (performan* OR achiev)) OR (((“intellectual development”) OR intellectual near/1 development) OR (intelle* near/2 develop))) OR (((“learning”) OR learn) OR (((“development”) OR develop* OR child development) OR ((“performance”) OR perform)) OR (((“achievement”) OR achieve) OR ((“intellectual ability” OR “ability”) OR intelle near/2 abili)) OR (((“outcomes”) OR outcome) OR ((“improvement”) OR improve*)))	639924
S7	(((school NEAR/1 (performan* OR achiev)) OR (((“intellectual development”) OR intellectual near/1 development) OR (intelle* near/2 develop))) OR (((“learning”) OR learn) OR (((“development”) OR develop* OR child development) OR ((“performance”) OR perform)) OR (((“achievement”) OR achieve) OR ((“intellectual ability” OR “ability”) OR intelle near/2 abili)) OR (((“outcomes”) OR outcome) OR ((“improvement”) OR improve)))) AND (((“primary schools”) OR primary school OR primary NEAR/1 school) OR (((“elementary school students” OR “elementary schools”) OR elementary school OR elementary NEAR/1 school) OR ((“secondary schools”) OR secondary school OR secondary near/1 school)) OR (((“middle schools”) OR middle school OR middle near/1 school) OR ((“junior high schools” OR “junior high school students”) OR junior high OR junior near/1 high))) AND (((“class size”) OR class size* OR class NEAR/2 size) OR (((“classroom environment”) OR classroom environment OR classroom near/1 environment) OR ((“crowding”) OR crowding)) OR ((flexible NEAR/1 scheduling* OR (“flexible scheduling”)) OR (“small classes” OR small NEAR/1 classes)))	189

Search	Results	Search Terms
# 17	503	#16 OR #14
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 16	8	#15 AND #13 AND #12
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 15	29	(TI = (“class size*”))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 14	503	#13 AND #12 AND #11
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 13	2,073,907	#9 OR #8 OR #7 OR #6 OR #5 OR #4
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 12	75,716	#10 OR #3
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 11	10,953	#2 OR #1
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 10	67,485	(TS = (student* OR pupil*))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 9	826,667	(TS = (intellect* OR develop*))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 8	151	(TS = (“intellectual Development”))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 7	1,030,786	(TS = (school OR perform* OR achiev*))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 6	1,658,747	(TS = ((learn* OR develop* OR perfrom* OR achiev* OR abilit* OR outcome* OR improve*)))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 5	1,335,501	(TS = ((academic* OR performance* OR achiev* OR abilit* OR outcome* OR improve*)))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 4	1,863,667	(TS = ((learn* or develop* or perform* or achiev* or abilit* or outcome*)))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 3	17,143	(TS = ((primary school) OR (elementary school) OR (secondary school) OR (middle school) OR (junior high*) OR (“middle schools”) OR (“elementary schools”) OR (“secondary schools”) OR (“junior high schools”)))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 2	2,439	(TS = (“class size” OR“classroom environment” OR“crowding” OR“flexible scheduling” OR“small classes” OR“teacher student ratio”))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017
# 1	8,715	(TS = (class size*))
		Indexes = SCI‐EXPANDED, SSCI Timespan = 2015‐2017

Search	Terms	Hits
1	class size*	0
2	“Class Size” OR “Classroom Environment” OR DE “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio”	3
3	(Primary School) or (Elementary school) or (secondary school) or (middle school) or (Junior high) or (“Middle Schools”) OR (“Elementary Schools”) OR (“Secondary Schools”) OR (“Junior High Schools”)	29
4	learn* or develop* or perform* or achiev* or abilit* or outcome*	39712
5	learn* or develop* or perform* or achiev* or abilit* or outcome* or improve*	41059
6	Intellectual Development*	7
7	“Class Size” OR “Classroom Environment” OR DE “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio” AND (Primary School) or (Elementary school) or (secondary school) or (middle school) or (Junior high) or (“Middle Schools”) OR (“Elementary Schools”) OR (“Secondary Schools”) OR (“Junior High Schools”)	3
8	“Class Size” OR “Classroom Environment” OR DE “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio” AND Learn* or develop* or perform* or achiev* or abilit* or outcome* or improve* or Intellectual Development* AND Primary School* or Elementary school* or secondary school* or middle school* or Junior high or “Middle Schools” OR “Elementary Schools” OR “Secondary Schools” OR “Junior High Schools”	3

Search	Terms	Hits
1	EM: class size	10
2	TI: class size*	30
3	(TI: class size* OR EM: class size) – Limiters: 1980‐2015, bøger + artikler + tidsskrifter + e‐bøger, engelsk, dansk, norsk, svensk	40

Search	Terms	Hits
1	EM: “klassestørrelse*” ‐ EMNE, Bøger, tidsskrifter, artikler, 1980‐, dansk, svensk, norsk, engelsk	36

Search	Term	Hits
1	class size* ‐ Limiters: Keywords	175
2	class size* ‐ Limiters: Title	32
3	class size* ‐ Limiters: Subject	24
4	“class size” OR subject:(class size) OR title:(class size*)	145

Search	Terms	Hits
1	klasstorlek* ‐ Fritekst	34
2	tit:klasstorlek* ‐ Titel	5
3	zamn:“^Klasstorlek^” ‐ Emne	15
4	S1 OR S2 OR S3	35

Search	Terms	Hits
1	Class Size* OR “Classroom Environment” OR “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio”	11.669
2	Class Size* OR “Classroom Environment” OR “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio” AND Primary School* OR Elementary school* OR secondary school* OR middle school* OR Junior high OR “Middle Schools”	228
3	((“Class Size” OR “Classroom Environment” OR “Crowding” OR “Flexible Scheduling” OR “Small Classes” OR “Teacher Student Ratio”) AND (Primary School OR Elementary school* OR secondary school* OR middle school* OR Junior high OR “Middle Schools”) AND (learn* OR develop* OR perform* OR achiev* OR abilit* OR outcome* OR improve*))	188

Search	Terms	Hits
1	(“klassestørrelse*”)	21

Web‐source	Search	Terms	Limiters	Hits
What Works Clearinghouse ‐ U.S. Department of Education	1	class size*	Reviewed Studies	2
	2	small class*	Reviewed Studies	3
	3	classroom environment*	Reviewed Studies	12
edu.au.dk ‐ clearinghouse	1	class size*	Publikationer	85
European Educational Research Association	1	class size*		26
American Educational Research Association (AERA)	1	class size*		205
Social Science Research Network (SSRN)	1	class size*	Title, Abstract, Abstract ID & Keywords	762
	2	“class size”	Title	60
	3	“class size”	Title, Abstract, Abstract ID & Keywords	154

Search Documentation Template	Insert terms/detalis below
• Authors
• Publication
• Journal ISSN
• All of the words	class size
• Any of the words	effect RCT random review intervention trial teach learn achievement student
• None of the words
• The phrase
• Year of publication between	2015‐2017
• Data Source	Google Scholar
• Title words only	X
• Results	55
• Search Date	02/02‐2017.

11.2 FLOW CHART FOR LITERATURE SEARCH

Figure 11.1

11.3 FIRST AND SECOND LEVEL SCREENING

First level screening is on the basis of titles and abstracts. Second level is on the basis of full text

The study will be excluded if one or more of the answers to question 1‐3 are ‘No’. If the answers to question 1 to 3 are ‘Yes’ or ‘Uncertain’, then the full text of the study will be retrieved for second level eligibility. All unanswered questions need to be posed again on the basis of the full text. If not enough information is available, or if the study is unclear, the author of the study will be contacted if possible.

First level screening questions are based on titles and abstracts

Question 1 guidance:

The intervention in this review is a reduction in class size. Studies only considering student‐teacher ratio will not be eligible. Neither will studies where the intervention is the assignment of an extra teacher (or teaching assistants or other adults) to a class be eligible.

Question 2 guidance:

Regular private, public or boarding schools are eligible. We exclude children in home‐school, in pre‐school programs, and in special education.

Question 3 guidance:

We are only interested in primary quantitative studies with a comparison group, where the authors have analysed the data. We are not interested in theoretical papers on the topic or surveys/reviews of studies of the topic. (This question may be difficult to answer on the base of titles and abstracts alone.)

Second level screening questions based on full text

Question 4 guidance

Some use test score data on individual students and actual class‐size data for each student. Others use individual student data but average class‐size data for students in that grade in each school. Still others use average scores for students in a grade level within a school and average class size for students in that school. We will only include studies that use data on the individual or class level. We will exclude studies that rely on data aggregated to a level higher than the class.

11.4 CODING FORM

Outcome measures

Instructions: Please enter outcome measures in the order in which they are described in the report. Note that a single outcome measure can be completed by multiple sources and at multiple points in time (data from specific sources and time‐points will be entered later).

OUTCOME DATA

DICHOTOMOUS OUTCOME DATA

Repeat as needed

CONTINUOUS OUTCOME DATA

Names of author(s)

Title

Language

Journal

Year

Country

Participant characteristic (age, grade level, gender, socioeconomic status, ethnicity)

Duration of class size reduction (years)

Class size (divide into treated/comparison)

Type of data used in study (administrative, questionnaire, other (specify))

Level of aggregation (individual or class)

Time period covered by analysis (divide into intervention and follow up)

Sample size (divide into treated/comparison)

#	Outcome & measure	Reliability & Validity	Format	Direction	Pg# & notes
1		Info from:  Other samples  This sample  Unclear	Dichotomy Continuous	High score or event is  Positive  Negative  Can't tell

OUTCOME	TIME POINT (s) (record exact time from participation, there may be more than one, record them all)	SOURCE	VALID Ns	CASES	NON‐CASES	STATISTICS	Pg. # & NOTES
		Questionnaire  Admin data  Other (specify)  Unclear	Participation	Participation	Participation	RR (risk ratio)  OR (odds ratio)  SE (standard error)  95% CI  DF  P‐ value (enter exact p value if available)  Chi2  Other
			Comparison	Comparison	Comparison

OUTCOME	TIME POINT (s) (record exact time from participation, there may be more than one, record them all)	SOURCE (specify)	VALID Ns	Means	SDs	STATISTICS	Pg. # & NOTES
		Questionnaire  Admin data  Other (specify)  Unclear	Participation	Participation	Participation	P  t  F  Df  ES  Other
			Comparison	Comparison	Comparison

11.5 ASSESSMENT OF RISK OF BIAS IN INCLUDED STUDIES

Risk of bias table

Risk of bias tool

Studies for which RoB tool is intended

The risk of bias model was developed by Prof. Barnaby Reeves in association with the Cochrane Non‐Randomised Studies Methods Group.This model, an extension of the Cochrane Collaboration's risk of bias tool, covers risk of bias in both randomised controlled trials (RCTs and QRCTs) and in non‐randomised studies (NRCTs and NRSs). 2014001029

The point of departure for the risk of bias model is the Cochrane Handbook for Systematic Reviews of interventions (Higgins & Green, 2008). The existing Cochrane risk of bias tool needs elaboration when assessing non‐randomised studies because, for non‐randomised studies, particular attention should be paid to selection bias / risk of confounding. Additional item on confounding is used only for non‐randomised studies (NRCTs and NRSs) and is not used for randomised controlled trials (RCTs and QRCTs).

Assessment of risk of bias

Issues when using modified RoB tool to assess included non‐randomised studies:

Confounding worksheet

Confounders described by researchers

Tick (yes⁰/no¹ judgment) if confounder considered by the researchers [Cons'd?]

Score (1[good precision] to 5[poor precision]) precision with which confounder measured

Score (1[balanced] to 5[major imbalance]) imbalance between groups

Score (1[very careful] to 5[not at all careful]) care with which adjustment for confounder was carried out

User guide for unobservables

Selection bias is understood as systematic baseline differences between groups and can therefore compromise comparability between groups. Baseline differences can be observable (e.g. age and gender) and unobservable (to the researcher; e.g. motivation and ‘ability’). There is no single non‐randomised study design that always solves the selection problem. Different designs solve the selection problem under different assumptions and require different types of data. Especially how different designs deal with selection on unobservables varies. The “right” method depends on the model generating participation, i.e. assumptions about the nature of the process by which participants are selected into a programme.

As there is no universal correct way to construct counterfactuals we will assess the extent to which the identifying assumptions (the assumption that makes it possible to identify the counterfactual) are explained and discussed (preferably the authors should make an effort to justify their choice of method). We will look for evidence that authors using e.g. (this is NOT an exhaustive list):

Natural experiments:

Discuss whether they face a truly random allocation of participants and that there is no change of behavior in anticipation of e.g. policy rules.

Instrument variable (IV):

Explain and discuss the assumption that the instrument variable does not affect outcomes other than through their effect on participation.

Matching (including propensity scores):

Explain and discuss the assumption that there is no selection on unobservables, only selection on observables.

(Multivariate, multiple) Regression:

Explain and discuss the assumption that there is no selection on unobservables, only selection on observables. Further discuss the extent to which they compare comparable people.

Regression Discontinuity (RD):

Explain and discuss the assumption that there is a (strict!) RD treatment rule. It must not be changeable by the agent in an effort to obtain or avoid treatment. Continuity in the expected impact at the discontinuity is required.

Difference‐in‐difference (Treatment‐control‐before‐after):

Explain and discuss the assumption that outcomes of participants and nonparticipants evolve over time in the same way.

Item	Judgement 2014001029	(quote from paper, or describe key information)Description
1. Sequence generation
2. Allocation concealment
3. Confounding, 2014001029 2014001029
4. Blinding? 2014001029
5. Incomplete outcome data addressed? 2014001029
6. Free of selective reporting? 2014001029
7. Free of other bias?
protocol? 8. A priori 2014001029
analysis plan? 9. A priori 2014001029

Assessment of how researchers dealt with confounding
Method forrelevant confounders described by researchers:identifying	Yes	□
	no	□
If yes, describe the method used:
Relevant confounders described:	yes	□
	no	□
List confounders described on next page
Method used for controlling for confounding
‘At design stage (e.g. matching, regression discontinuity, instrument variable):
………………………………………
………………………………………
………………………………………
At analysis stage (e.g. stratification, regression, difference‐indifference):
………………………………………
………………………………………
………………………………………
Describe confounders controlled for below

Confounder	Considered	Precision	Imbalance	Adjustment
Gender	□	□	□	□
Age	□	□	□	□
Grade level	□	□	□	□
Socioeconomic status	□	□	□	□
Base line achievement	□	□	□	□
Local education spending	□	□	□	□
Unobservables 2014001029	□	Irrelevant	□	□
Other:	□	□	□	□

12 Data and analyses

Sensitivity analysis: Reading

Sensitivity analysis: Mathematics

Supporting information

Filges, T., Sonne‐Schmidt, C. S., Nielsen, B. C. V.Small class sizes for improving student achievement in primary and secondary schools. Campbell Systematic Reviews 2018:10 DOI: 10.4073/csr.2018.10