ABSTRACT
Meta-analysis has become commonplace within sport and exercise science for synthesising and summarising empirical studies. However, most research in the field focuses upon mean effects, particularly the effects of interventions to improve outcomes such as fitness or performance. It is thought that individual responses to interventions vary considerably. Hence, interest has increased in exploring precision or personalised exercise approaches. Not only is the mean often affected by interventions, but variation may also be impacted. Exploration of variation in studies such as randomised controlled trials (RCTs) can yield insight into interindividual heterogeneity in response to interventions and help determine generalisability of effects. Yet, larger samples sizes than those used for typical mean effects are required when probing variation. Thus, in a field with small samples such as sport and exercise science, exploration of variation through a meta-analytic framework is appealing. Despite the value of embracing and exploring variation alongside mean effects in sport and exercise science, it is rarely applied to research synthesis through meta-analysis. We introduce and evaluate different effect size calculations along with models for meta-analysis of variation using relatable examples from resistance training RCTs.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 To clarify language here for those unfamiliar, the term and concept model is used commonly in statistics. A statistical model essentially is specification of what we think the data generating process might be for a given situation. In the context of meta-analyses the data are usually the individual effects that we have extracted from studies i.e., the results of each study. The model, in mathematical formulae, is intended to approximate the processes that we assume led to the generation of the data.
2 Effect size is an agnostic term used for a family of statistics which communicate the strength of a given “effect” resulting from research. This includes descriptive statistics ranging from mean raw values to correlation coefficients and everything in between (Caldwell & Vigotsky, Citation2020) including, as we shall see, statistics describing variation.
3 This estimation can be done using a variety of methods and is an area of ongoing investigation as to how different methods perform. This is beyond the scope of this paper to discuss. We note however that the models we present all utilise Restricted Maximum Likelihood estimation.
4 Hence current efforts to conduct direct replications (see https://ssreplicationcentre.com/).
5 For example, strength might be examined in different studies using different operationalisations including one repetition maximum testing or maximum voluntary contractions. Or the same operationalisations may be employed but different exercises such as the squat or bench press.
6 Though notably not all meta-analyses use magnitude based effect sizes. Indeed some explicitly use what Caldwell and Vigotsky (Citation2020) term signal-to-noise effect sizes (e.g., Heidel et al. (Citation2022)).
7 For those unfamiliar with the terminology, an estimator for a statistic is unbiased if it produces parameter estimates that are on average correct. Thus a bias corrected statistic is one which would be biased without the correction applied, but otherwise has been shown to be unbiased.
8 We will refer to both merely as the SMD throughout the manuscript for simplicity and note that throughout when reporting a “SMD” we are reporting the bias-corrected version. We also note that another magnitude based effect size, Glass’ , is commonly recommended as it is the simplest form of SMD though makes assumptions about the impact of the intervention having no effect on the denominator (i.e., variance; Caldwell and Vigotsky (Citation2020)).
9 Exploration of methodological approaches and their impact on heterogeneity have also been explored in preclinical research (Usui et al., Citation2021).
10 Though notably, in the case of health behaviour studies it may be the case that if someone volunteers for a study it could conceivably motivate them to alter various habits even when they are assigned to a control group thus influencing change scores.
11 For one clear example, see in Vigotsky et al. (Citation2020) who show that the mean and standard deviation for baseline strength values typically scale with one another across most studies.
12 The authors of the meta-analysis did not make their extracted data openly available, nor did they respond to our request for the extracted data. Further, their original analysis included 119 studies however we were unable to extract data for our analyses from 8 of these for a variety of reasons (e.g., only percentage change data was reported, no standard deviations for control groups reported).
13 Regression analyses are likely familiar to most readers where in the simplest form they try to predict the value of some dependent variable from some independent variable(s). This can be extended to meta-analytic synthesis where the independent variables reflect characteristics associated with the effects included. For example, they may reflect characteristics of the sample in the study for which the effect was extracted such as age or sex, or they might reflect characteristics of the intervention received such as the dose or frequency of exposure.
14 It is worth noting that in the sport and exercise sciences, similarly to other fields that examine the effects of experimental interventions, the most common study design for testing or estimating intervention effects is the randomised pretest-postest-control design (i.e., an intervention and control, or other intervention, group randomly allocated and measured pre- and post-exposure). We presented the SMD and effect sizes in equations Equation 5 and Equation 9 merely for simplicity in the introduction, but note that extension of these for such 2 × 2(i.e., condition x time) study designs have been presented in detail elsewhere (see: Gurevitch et al. (Citation2000); Morris et al. (Citation2007); Morris (Citation2008); Lajeunesse (Citation2011, Citation2015)) and these are the effect sizes used in the meta-analyses referred to here.
15 We also explored for signs of small study bias, including publication bias favouring the finding of intervention effects, for the SMDs given that the relative lack of awareness for variation based effect sizes in the field implies that they might have more influence over such biases. There did not appear to be any obvious small study bias in the dataset (see https://osf.io/stqr3).
16 We use the term arm to refer to an intervention group-control group contrast to accommodate studies including multiple intervention groups. This is so as to not confuse the reader with the use of group to designate either the RT intervention group(s) or control group separately. Thus, in the instances of models using effect sizes relating to comparisons between an intervention group and control group (i.e., SMD, , , , and ) we calculate comparisons between each intervention group (i.e., arm) and the control group. Thus, where a study had for example two RT interventions and a control, two separate arms would be coded (RT intervention 1 compared to control, and RT intervention 2 compared to control). Data was coded such that study and arm had implicit nesting.
17 Technically then the random effects model presented earlier is also a mixed effects model. It is traditionally referred to as the random-effects model though.
18 In contrast to the models presented examining effect sizes relating to comparisons between an intervention group and control group, in the models examining and as a predictor the term arm refers to both the intervention groups(s) and control group. Thus, where a study had for example two RT interventions and a control, three separate arms would be coded (RT intervention 1, RT intervention 2, and control). Data were again coded such that study and arm had implicit nesting.
19 We do not have to limit ourselves to only fixed effect predictor terms as we have here. Indeed, for mixed effect models generally some argue that models should use a maximal random effects structure including both random intercepts and slopes (i.e., that the effect of the predictor term can vary within different levels of the model and is also assumed to come from an overarching distribution of slopes), and their correlations, to enhance generalisability of inferences (Barr et al., Citation2013). We could model a categorical variable for the outcome type and using random effects include or in the model with as a dummy coded variable for the outcome type (i.e., hypertrophy = 0, and strength = 1), where is the overall average slope or regression coefficient for , and is the deviation (random slope) from for the study and is the deviation for the arm. These model specifications do not assume that the difference between outcomes is fixed, but can vary between studies and arms. We could also do the same and include random slopes for on thus allowing for the strength of the relationship between and to also vary between studies and arms. Indeed, we fit a range of models using with and as a predictor with (1) random intercepts only for study and arm, (2) the inclusion of correlated random slopes for by study, (3) the inclusion of correlated random slopes for by study and arm, (4) the inclusion of correlated random slopes for by study, (5) the inclusion of correlated random slopes for by study and arm, and (6) the inclusion of correlated random slopes for both and by study, (7) the inclusion of correlated random slopes for both and by study and arm. The comparison of these models using (Kass & Raftery, Citation1995) from approximate Bayesian information criterion (Wagenmakers, Citation2007) to determine under which is the observed data most likely is included in the supplementary materials (https://osf.io/3tv6x). There was very strong evidence supporting the random intercepts plus correlated random slopes for both and by study model compared to all others and so this is presented here.
20 Note, as with the models examining upon baseline scores, we similarly explored change scores with and as a predictor with (1) random intercepts only for study and arm, (2) the inclusion of correlated random slopes for by study, (3) the inclusion of correlated random slopes for by study and arm, (4) the inclusion of correlated random slopes for by study, (5) the inclusion of correlated random slopes for both and by study, and (6) the inclusion of correlated random slopes for both and by study and by arm (we do not include the models with random slopes for by arm as in this model each arm refers to a particular group, RT or CON, and so no arm provides data for both). The comparison of these models using (Kass & Raftery, Citation1995) from approximate Bayesian information criterion (Wagenmakers, Citation2007) to determine under which is the observed data most likely is included in the supplementary materials (see https://osf.io/b5deh for strength and https://osf.io/5ektd for hypertrophy). Similar to the models including there was very strong evidence supporting the random intercepts plus correlated random slopes for both and by study and by arm model for strength, and random intercepts plus correlated random slopes for both and by study for hypertrophy, compared to all other models and so these are presented here. All estimates for the difference between RT and CON, where positive values indicate RT increased variation in changes scores and negative values indicate it decreased variation, can be seen in the supplementary materials (https://osf.io/5g7ce) all of which revealed similar conclusions.
21 It is perhaps worth explaining the assumptions that the different models explored make regarding the mean-variation relationship. For example, the and models can be thought of as similar in that they both make fixed assumptions about the relationship between mean and variance; in the it is assumed to be zero, and in the it is assumed to be proportional i.e., one. In both however this is a strong assumption. The multilevel meta-regressions on the other hand actually estimate this relationship (i.e., the value of , the slope or regression coefficient for ) and in models where random slopes are included this is also estimated allowing it to vary between studies and/or arms (i.e., in some studies there may be a more or less strong relationship compared to others). Mean-variation relationships are important to consider when exploring variation effects, but it is also important to consider whether or not this relationship is assumed to be some fixed proportional value (i.e., as the does) or whether or not this should be estimated from the data and whether it might also vary across studies and arms (i.e., as the multilevel meta-regression models allow). It should also be noted that these models all assume that the is estimated without error which is clearly not the case. Given that for most effects that might be included in such models we can determine the sampling variance for one approach to address this might be to employ models that incorporate the variance on this predictor (i.e., measurement error or errors in variables models). This is beyond the scope of this paper to discuss. It is not necessarily clear which model should be preferred here, and fortunately substantive conclusions are impacted little by model specification, but thought should be given to the assumptions each makes and the fit of each model to the data.
22 See supplementary materials (https://osf.io/e6vpr) for examples from model estimates for both SMD and , (used for simplicity of presenting moderator analysis results) across a range of categorical and continuous predictors for both strength and hypertrophy outcomes. There were no obvious moderators of in particular.
23 Indeed, it can be seen from figures that many of the individual study effect estimates have very large sampling errors.