24,068
Views
95
CrossRef citations to date
0
Altmetric
Editorial

Power, precision, and sample size estimation in sport and exercise science research

ORCID Icon, , , ORCID Icon, , & show all

The majority of papers submitted to the Journal of Sports Sciences are experimental. The data are collected from a sample of the population and then used to test hypotheses and/or make inferences about that population. A common question in experimental research is therefore “how large should my sample be?”. Broadly, there are two approaches to estimating sample size – using power and using precision. If a study uses frequentist hypothesis testing, it is common to conduct a power calculation to determine how many participants would be required to reject the null hypothesis assuming an effect of a given size is present. That is, if there’s an effect of the treatment (of given size x), a power calculation will determine approximately how many participants would be required to detect that effect (of size x or larger) a given percentage of the time (often 80%). Power calculations as conducted in popular software programmes such as G*Power (Faul et al., Citation2009) typically require inputs for the estimated effect size, alpha, power (1 – ), and the statistical tests to be conducted. All of these inputs are subjective (or informed by previous studies) and up to the researcher to decide the most appropriate balance between type 1 error rate (false positive), type 2 error rate (false negative), cost, and time. In contrast, estimating sample size via precision involves estimating how many participants would be required for the frequentist confidence interval or Bayesian credible interval resulting from a statistical analysis to be of a certain width. The implication is that a narrower confidence interval or credible interval allows a more precise estimation of where the “true” population parameter (e.g., mean difference) might be.

To get a sense of the sample sizes and methods used to estimate sample size by studies submitted to the Journal of Sports Sciences we randomly selected 120 papers submitted over the previous three years. The data were positively skewed, so the median (median absolute deviation) sample size was 19 (11). Of these 120 papers only 12 included a formal a priori sample size estimation based on power and 1 estimated sample size using a precision approach. Although the 12 papers that did include an a priori power calculation identified the effect size to be detected, alpha, and power, all of those papers failed to include full information on the statistical test(s) to be conducted to detect the chosen effect size and 4 failed to include a convincing rationale for why the given effect size was chosen.

In order to understand why this is a problem, we need to examine problems with studies that are not adequately powered to detect what could be considered a meaningful effect. As outlined by Brysbaert (Citation2019) and others (Button et al., Citation2013; Ioannidis, Citation2005, Citation2008; Ioannidis et al., Citation2011) the problems with underpowered studies are numerous. For example, the type 2 error rate is increased, if statistically significant effects are detected they will likely overestimate the population effect size (by a considerable amount), a greater proportion of statistically significant effects will be type 1 errors, statistically significant effects are more likely to have low precision in the population estimate, and underpowered studies are less replicable. In regard to overestimating population effect size, the Open Science Collaboration (Citation2015) conducted 100 replications of psychology studies using high-powered designs and reported that the mean effect size (r = 0.2; ~d = 0.4) was approximately half the magnitude of that reported in the original studies. Moreover, Fraley and Vazire (Citation2014) reported that the mean sample size used in psychology studies was 104 participants, yet the mean power was only 50% to detect an effect size of d = ~0.4 (r = ~0.2). If we contrast that with the median sample size of 19 for papers submitted to the Journal of Sports Sciences, it’s quite likely that we have a problem with underpowered studies in sport and exercise science. Although this is a serious problem, and one we’ve heard before (Beck, Citation2013; Heneghan et al., Citation2012) there are a number of solutions. Although there are multiple ways of increasing power (Kruschke, Citation2015), the obvious solution is to substantially increase the sample size of studies in our field. This would certainly increase the power/precision (and quality) of studies and might also reduce the number of papers submitted to academic journals and pressure on over-stretched reviewers (the Journal of Sports Sciences has experienced a 40% increase in the number of submissions between 2017 and 2019).

Although larger sample sizes are encouraged, how sample size is estimated and how data are collected are equally important. If researchers do conduct an a priori sample size estimation they will most likely do so via a power calculation. Ensuring that studies are adequately powered is important, yet sample size estimation via power analysis serves only one purpose – to estimate the sample size required to reject the null hypothesis if indeed there’s an effect of a given size. However, a power calculation does not identify the minimum sample size that would ensure a precise estimate of the population parameter (Maxwell et al., Citation2008). To achieve the latter, we need to estimate sample size using precision – sometimes called accuracy in parameter estimation (AIPE) when using a frequentist confidence interval (Kelley et al., Citation2003; Kelley & Rausch, Citation2006; Maxwell et al., Citation2008). In contrast to the traditional sample size estimation based on power, the AIPE approach bases the sample size estimation on what is required to achieve a certain width of confidence interval. The width of the confidence interval is proportional to the sample size such that to halve the interval the sample size must increase approximately by a factor of four (Cumming & Calin-Jageman, Citation2017). Consequently, the AIPE approach can sometimes require very large sample sizes to obtain high precision (Kelley & Rausch, Citation2006). The R package MBESS (Kelley, Citation2019) can be used to estimate sample size using the AIPE approach, as can ESCI software (Cumming & Calin-Jageman, Citation2017). Although sample-size calculations are contextual and therefore influenced by the research design, an example using the MBESS ss.aipe.smd function is useful to highlight the approach. For a standardised mean difference (Cohen’s d) of 0.4 between two groups, to achieve a 95% confidence interval with a width of 0.6 (0.3 either side of the point estimate) would require a sample size of at least 88. Using the median Journal of Sports Sciences sample size of 19 as described earlier, a confidence interval width of 1.3 (0.65 either side of the point estimate) would be achieved. This means for d = 0.4 the confidence interval would range from −0.25 (small negative effect) to 1.05 (large positive effect), and therefore such an interval is clearly imprecise. Although some argue for a move from using power to AIPE for sample size estimation (Cumming & Calin-Jageman, Citation2017; Kelley et al., Citation2003), the approach still suffers from using a frequentist confidence interval, which is inherently tied to the p value and all of its problems (Cohen, Citation1994; McShane et al., Citation2019; Wasserstein & Lazar, Citation2016). The confidence interval also contains no distributional information, which means that all values within the interval are equally likely (Kruschke & Liddell, Citation2018). The probability of the true population parameter being within the confidence interval is either 1 or 0 because the chosen probability (e.g., 95%) refers to the long-run process of generating the interval, not the interval itself (Barker & Schofield, Citation2008; Morey et al., Citation2016). Some argue that because the confidence interval is a theoretical long-run pre-data procedure with a fixed probability (e.g., 95%), there is no guarantee that a post-data confidence interval will contain the population parameter at all, or have the desired precision (Morey et al., Citation2016). Moreover, most researchers incorrectly interpret the confidence interval like a Bayesian credible interval (Kruschke & Liddell, Citation2018), which does contain distributional information and can be used to obtain direct probabilities for the true population parameter (Kruschke, Citation2013).

Although power analysis and AIPE can be used to estimate sample size, both approaches result in a fixed N. An alternative is to use sequential testing (Kelley et al., Citation2018; Rouder, Citation2014). Sequential testing involves collecting data until an a priori stopping rule is satisfied. One possible advantage of sequential designs is that sample sizes might be smaller than fixed-N designs, yet with the same error rates (Lakens, Citation2014; Schönbrodt et al., Citation2017). Sequential testing can be incorporated into null hypothesis significance testing (Kelley et al., Citation2018; Lakens, Citation2014), although it has been criticised for this use because only a limited number of interim tests can be performed (Schönbrodt et al., Citation2017; Wagenmakers, Citation2007) and Kruschke (Citation2013) contends that it will inevitably lead to a 100% false alarm rate (falsely rejecting the null hypothesis). Alternatively, model comparison (hypothesis testing) or parameter estimation using Bayesian methods avoids such criticisms (Rouder, Citation2014). That is, when computing Bayes factors (Schönbrodt et al., Citation2017) or estimating the highest density interval (credible interval) of the posterior distribution (parameter estimation), Bayesians are free to monitor the data as often as they wish as it is being collected (Wagenmakers et al., Citation2018). For example, to help researchers embrace sequential designs when using Bayes factors, Bayes Factor Design Analysis (BFDA) has recently been developed (Schönbrodt & Wagenmakers, Citation2018; Stefan et al., Citation2019). When using a sequential design BFDA helps researchers determine when data collection should stop once there is strong evidence (as determined by a particular Bayes factor) for either the null hypothesis or the alternative hypothesis. In this scheme, the researcher outlines a priori the Bayes factor at which data collection will end (e.g., BF10 > 10). As the data accumulates the Bayes factor is continuously monitored and once it reaches the set threshold, data collection ceases. Researchers can also set a minimum and maximum N and determine the probability of obtaining misleading evidence (false positives/negatives). As an example of how to use BFDA, a web-based Shiny app has been developed to allow calculations for an independent-group t-test with directional hypotheses to be performed (Stefan et al., Citation2019).

As suggested by a number of authors (Cumming, Citation2014; Kruschke & Liddell, Citation2018), planning a study based on obtaining a given precision in the parameter estimate has some advantages over the use of power. Sequential designs using Bayesian hypothesis testing or parameter estimation offer a number of advantages over frequentist methods (Rouder, Citation2014; Schönbrodt & Wagenmakers, Citation2018). Although we’ve heard some of these calls before in sport and exercise science (Barker & Schofield, Citation2008; Bernards et al., Citation2017), the software required to conduct Bayesian data analysis has until recently been inaccessible for many or difficult to use. However, we now have access to Bayesian methods through a range of packages in R (R Core R Core Team, Citation2020) as well as menu-driven software such as JASP (JASP Team, Citation2020) and SPSS (IBM Corp, Citation2019).

The Journal of Sports Sciences recommends that submissions of experimental studies include a formal a priori sample size estimation and rationale. As outlined in this editorial, this requirement could be satisfied using a variety of methods, although other methods for power analysis are available (Kruschke, Citation2013; Weiss, Citation1997). Whatever the method chosen, authors should report the full range of information required to enable the sample size estimation and rationale to be examined and checked by editors, reviewers, and ultimately, by readers. This should include any software used, the exact inputs to calculations, a rationale for those inputs, stopping rules, and the statistical tests used to test a hypothesis or estimate a population parameter. Like any aspect of the method section, readers should be able to replicate your sample size calculations and thereby judge if your study is adequately powered and/or precise to answer the research question(s) posed and support the conclusions reached.

We are all probably guilty of conducting underpowered and imprecise studies, and as such we all have a vested interest in changing the way we plan and conduct research. We hope that our recommendations outlined above will encourage authors to consider more fully the related issues of power, precision and sample size estimation and how they can change their practice to allow more robust outcomes from their research, and ultimately, better science.

Acknowledgments

Valuable comments on the editorial were provided by Dr Tony Myers and Dr Keith Lohse.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.