1,040
Views
14
CrossRef citations to date
0
Altmetric
Methodological Studies

Making Sense of Effect Sizes: Systematic Differences in Intervention Effect Sizes by Outcome Measure TypeOpen Data

ORCID Icon & ORCID Icon
Pages 134-161 | Received 28 May 2021, Accepted 31 Mar 2022, Published online: 01 Aug 2022
 

Abstract

One challenge in understanding “what works” in education is that effect sizes may not be comparable across studies, raising questions for practitioners and policymakers using research to select interventions. One factor that consistently relates to the magnitude of effect sizes is the type of outcome measure. This article uses study data from the What Works Clearinghouse to determine average effect sizes by outcome measure type. Outcome measures were categorized by whether the group who developed the measure potentially had a stake in the intervention (non-independent) or not (independent). Using meta-analysis and controlling for study quality and intervention characteristics, we find larger average effect sizes for non-independent measures than for independent measures. Results suggest that larger effect sizes for non-independent measures are not due to differences in implementation fidelity, study quality, or intervention or sample characteristics. Instead, non-independent and independent measures appear to represent partially but minimally overlapping latent constructs. Findings call into question whether policymakers and practitioners should make decisions based on non-independent measures when they are ultimately responsible for improving outcomes on independent measures.

Acknowledgements

The authors thank the following for providing thoughtful feedback on earlier versions of this paper: Matthew Soldner, Commissioner of the National Center for Education Evaluation and Regional Assistance (NCEE), Elizabeth Eisner, Associate Commissioner of NCEE; Jonathan Jacobson, Branch Chief of the Knowledge Utilization Division of NCEE; Erin Pollard, Education Research Analyst at NCEE; and members of the What Works Clearinghouse (WWC) Statistics, Website, and Training (SWAT) Team and Statistical, Technical, and Analysis Team (STAT) Measurement Small Working Group. The authors also acknowledge Robert Slavin (1950–2021) for providing feedback on an earlier version of this paper, and note that this article builds on his long-term work.

Disclosure Statement

This was written in Betsy Wolf’s official capacity as part of the national conversation on education, is intended to promote the exchange of ideas among researchers and policy makers and to express views as part of ongoing research and analysis, and is not intended to necessarily reflect the official views of the U.S. Department of Education.

Open Scholarship

This article has earned the Center for Open Science badge for Open Data through Open Practices Disclosure. The data are openly accessible at https://osf.io/z79ju.

Open Research Statements

Study and Analysis Plan Registration

There is no study and analysis plan registration associated with this manuscript.

Data, Code, and Materials Transparency

The data, code, and materials that support the findings of this study are openly accessible at: https://osf.io/z79ju/files/osfstorage.

Design and Analysis Reporting Guidelines

This manuscript was not required to disclose use of reported guidelines, as it was initially submitted prior to JREE mandating the disclosure of open research practices in April 2022.

Transparency Declaration

This manuscript was not required to submit a transparency declaration, as it was initially submitted prior to JREE mandating the disclosure of open research practices in April 2022.

Replication Statement

This manuscript reports an original study.

Notes

1 Instruments for researcher and developer measures are often not included in the original studies, making it difficult to determine whether these instruments cover narrow or broad domains. Therefore, researcher and developer measures were coded mutually exclusively from narrow and broad measures.

2 National assessments are commercial or government assessments used by school districts or post-secondary institutions across the country to assess competency in a topic area.

3 Given the nature of most behavioral outcomes, very few measures in the behavior area were classified as “broad;” only schoolwide measures of school climate were classified as “broad” in behavior.

4 The technical appendix contains information about how the WWC’s outcome domains were revised for this paper.

5 We dropped any covariates that were redundant with the study fixed effects.

6 We assumed effect sizes within studies to be dependent and correlated at ρ = .80, although we do not know the true covariance structure. Results were not sensitive to changes in the assumed covariance structure.

7 The 95% prediction interval contains 95% of the values of the effect sizes in the study population and is calculated by (u1.96τ2+ω2, u+1.96τ2+ω2) where u is the average effect size, τ2 is the between-study variance in the effect sizes, and ω2 is the within-study variance in the effect sizes. While robust variance estimation does not require a normality assumption, estimates of τ2 and ω2 are accurately estimated when the normality assumption is met; if the normality assumption is not met, these estimates are approximations.

8 We corrected the observed correlation for measurement error using the following formula rcorrected=robservedrxxryy where rcorrected is the correlation corrected for measurement error, robserved is the observed correlation, rxx and ryy are the reliabilities of non-independent and independent measures, respectively (Wiernik & Dahlke, Citation2020). We could not observe reliability information for the majority of outcome measures in our study data. Therefore, we used the average reliability (.845) for all outcome measures in WWC data. For the subsample of our study data where we did observe the reliability of outcome measures, the reliability for researcher and narrow measures was each .86.

9 We conducted this analysis using the Vevea & Coburn Shiny application available at https://vevealab.shinyapps.io/WeightFunctionModel/.

10 These models also include the covariates previously listed; when the covariates varied within studies, we applied the average values by study and outcome measure type.

11 Along with RCTs, RDDs are eligible for the highest WWC study rating of “Meets without reservations.” Similar to RCTs, the treatment assignment procedure in RDDs is part of the research design and fully known (Rossi et al., Citation2019).

12 There were 67 studies that included at least one non-independent (researcher or developer) measure and at least one independent (broad or narrow) measure, but only 50 studies contained both an independent and non-independent measure in the same outcome domain as determined by the WWC. We restricted to the same outcome domain for this descriptive analysis because it is plausible that an intervention may affect achievement in one outcome domain (e.g., literacy) but not another (e.g., mathematics). The meta-analytic models control for the outcome domains.

13 We conducted post-hoc Wald tests using the metafor R package.

14 The WWC selects studies to review for a variety of purposes, which may not result in a representative study sample.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.