2,696
Views
7
CrossRef citations to date
0
Altmetric
Articles

Knowing how effective an intervention, treatment, or manipulation is and increasing replication rates: accuracy in parameter estimation as a partial solution to the replication crisis

ORCID Icon & ORCID Icon
Pages 59-77 | Received 20 Jun 2017, Accepted 10 Feb 2020, Published online: 07 May 2020

Abstract

Objective

Although basing conclusions on confidence intervals for effect size estimates is preferred over relying on null hypothesis significance testing alone, confidence intervals in psychology are typically very wide. One reason may be a lack of easily applicable methods for planning studies to achieve sufficiently tight confidence intervals. This paper presents tables and freely accessible tools to facilitate planning studies for the desired accuracy in parameter estimation for a common effect size (Cohen’s d). In addition, the importance of such accuracy is demonstrated using data from the Reproducibility Project: Psychology (RPP).

Results

It is shown that the sampling distribution of Cohen’s d is very wide unless sample sizes are considerably larger than what is common in psychology studies. This means that effect size estimates can vary substantially from sample to sample, even with perfect replications. The RPP replications’ confidence intervals for Cohen’s d have widths of around 1 standard deviation (95% confidence interval from 1.05 to 1.39). Therefore, point estimates obtained in replications are likely to vary substantially from the estimates from earlier studies.

Conclusion

The implication is that researchers in psychology -and funders- will have to get used to conducting considerably larger studies if they are to build a strong evidence base.

As Cohen learned and taught, “the primary product of a research inquiry is one or more measures of effect size, not p values,” and, “having found the sample effect size, you can attach a p value to it, but it is far more informative to provide a confidence interval” (1990, p. 1310). Cohen was not alone in this conviction: the case for effect sizes and confidence intervals has been made excellently and extensively (Cohen, Citation1988, Citation1992; Cumming & Finch, Citation2001; Gardner & Altman, Citation1986; Thompson, Citation2002), and resulted in the imperative to always report confidence intervals (American Psychological Association, Citation2008, Citation2009). For example, the American Psychological Association “stresses that […] reporting elements such as effect sizes, confidence intervals, and extensive description are needed to convey the most complete meaning of the results” (2009, p. 33).

Despite this apparent consensus that psychological science would benefit from consistently computing effect size measures and their corresponding confidence intervals (note that confidence intervals comes with problems of their own, which we discuss in the discussion), few tools have been provided to facilitate planning studies for a desired confidence interval width. Power tables (Cohen, Citation1988, Citation1992) and software (Champely, Citation2016; Faul et al., Citation2007) for determining the required sample size when conducting null hypothesis significance tests (NHSTs) are quite common and well-known. However, if a researcher desires to obtain an association strength estimate (or, ‘effect size’ estimateFootnote1) with a given accuracy in an experimental setting (i.e. a 95% confidence interval with a maximum width of .2 for a Cohen’s dsFootnote2 value that is estimated to be .4 in the population), then very few tools exist that are accessible to psychological researchers with a modest background in statistics. Maxwell, Kelley and Rausch do provide a visualisation of sample size requirements for confidence intervals (2008), but no tables or tools. The very insightful Effect Size Confidence Intervals (ESCI) spreadsheet accompanying Cumming (Citation2014) and Cumming and Calin-Jageman (Citation2016) does provide a dynamic interface allowing sample size computations, and is very helpful in understanding the dynamics at play as well. Other work is based on the sampling distribution of correlation coefficients, such as the script and tables provides by Moinester and Gottfried (Moinester & Gottfried, Citation2014; based on Bonett & Wright, Citation2000). Schönbrodt and Perugini introduced the Corridor of Stability to denote correlation estimates that are not only close to the true correlation (i.e. the population value), but which also remain close as data collection progresses (Lakens & Evers, Citation2014; Schönbrodt & Perugini, Citation2013).

These latter correlation-based approaches, while similar in their main message, rely on the (sometimes simulated) sampling distribution of Pearson’s r instead of that of Cohen’s ds.Footnote3 This is problematic when planning experiments for two reasons. First, a relatively minor problem is that conversion between r and d is not a straightforward affair: for example, r = .3 converts to ds = .63 instead of ds = 0.5, and r = .5 converts to ds = 1.15 instead of ds = 0.8 (Cohen, Citation1988; McGrath & Meyer, Citation2006).Footnote4 This means that estimates for required sample sizes for experiments, derived from, for example, moderate (r = .3) or strong (r = .5) correlations, will underestimate the required sample sizes for moderate (ds = 0.5) or large (ds = 0.8) Cohen’s ds values. More generally speaking: using correlation-based approaches may result in underestimates of the required sample sizes if the researcher is unaware of this and does not pay close attention.

Second, a major problem is that as we shall see further on, whereas the sampling distribution of the correlation coefficient becomes more narrow as the population correlation (i.e. the true effect size) approaches −1 or 1, this does not happen with the sampling distribution for Cohen’s ds. In fact, the opposite happens: the sampling distribution of Cohen’s ds becomes slightly wider as the difference between two means in the population increases. Individual values of d and r can be converted between the two metrics, but entire sampling distributions cannot. The 95% confidence interval for r = .5 is tighter than the 95% confidence interval for r = 0 (with equal sample sizes), but the 95% confidence interval for ds = .8 is wider than the 95% confidence interval for ds = 0.

Thus, currently, when planning an experiment to study whether an intervention, treatment, behaviour change principle (BCP, see Crutzen & Peters, Citation2018; such as Intervention Mapping’s behavior change methods, Kok et al., Citation2016, or a behavior change technique, BCT; Abraham & Michie, Citation2008) is effective (and therefore, how effective it is), researchers have limited access to free and easily accessible tools to compute the required sample size. This paucity of tools and the associated neglect to plan for accurate estimation of effects may in part explain why confidence intervals in psychological research are very wide (Brand & Bradley, Citation2016). In this paper, we provide both power tables and an easy to use tool to facilitate planning of studies that aim to draw conclusions about how effective a manipulation or treatment is with confidence intervals of a priori determined width.

Why confidence intervals are crucial in intervention evaluations

First, we will briefly summarize why confidence intervals are so valuable (albeit they still have their own interpretational problems; see the discussion). All point estimates computed from sample data, such as estimates of Cohen’s ds or Pearson’s r, vary from sample to sample. Therefore, they are by themselves not informative when the goal is to learn about the population instead of about one random sample. A point estimate’s interpretation requires knowledge about how much it can be expected to vary from sample to sample, in other words, knowledge about its sampling distribution. A Cohen’s ds of 0.5 might mean that in the population, two means differ by half a standard deviation, but if the corresponding sampling distribution is sufficiently wide, population values of 0.1 or 0.9 might also be plausible on the basis of that same dataset. Information about the variance of a statistic’s sampling distribution is most commonly conveyed using confidence intervals.

A confidence interval is an interval that will contain the population value of the corresponding statistic in a given percentage of the samples. For example, if in a sample of 100 participants a Cohen’s ds value of 0.5 is found, the corresponding 95% confidence interval is [.10; .90] (see below for an explanation of this computation). This 95% confidence interval will contain the population value of Cohen’s ds in 95% of the samples if the same study would be repeated infinitely.Footnote5 This interval is useful in that it allows inferring that, for example, a population value of −2 seems unlikely, and more generally, that any substantial harmful effects of this treatment seem unlikely. Computing an interval with a higher confidence level, such as 99% or 99.99%, makes it possible to make statements about the population with almost complete accuracy and certainty: only one in every 10 000 samples will not contain the population value in the 99.99% confidence interval.

In other words, confidence intervals around effect size measures are valuable instruments when the goal is to establish how strongly two variables are associated in the population, for example when establishing the effectiveness of a treatment, intervention, or the effect of a behaviour change technique or other experimental manipulation. Tight confidence intervals enable more confident statements about association strength (e.g., difference between conditions in an experiment), and therefore, researchers will usually want their confidence intervals to be as tight as possible. A common example is when researchers want to establish the likely population effect size in order to properly power a main study. In such a situation, researchers may conduct a pilot study to establish that effect size, and then power their main study accordingly, which makes acquiring an accurate effect size estimate a central concern.

For example, take the scenario above. Let us assume that a researcher correctly estimates a population value of dpop = 0.5, and that the researcher then conducts a pilot study and coincidently happens to obtain a sample value of exactly ds = 0.5. If the researcher obtains this value using two groups of 64 participants each, the corresponding confidence interval is [0.15; 0.85]. This means that on the basis of that one dataset, population values for Cohen’s dpop of 0.15 and 0.85 (the lower and upper bounds of the confidence interval) are both equally likely. The high likelihood that the population value of Cohen’s dpop is considerably lower than 0.5, the obtained point estimate, would make it unwise to power for that value of Cohen’s d. Instead, it would make sense to power for, for example, the lower bound of the confidence interval. In this case, if the researcher would want to use null hypothesis significance testing (NHST) for their main study, this would mean that the researcher would have to power their main study on Cohen’s d = .15. This would mean that the researcher would require 2312 participants to obtain a power of 95%, and 1398 participants to obtain a power of 80%. If the researcher had obtained a tighter confidence interval, this lower bound would have had a higher value, reflecting the higher certainty as to the population value of Cohen’s dpop. One could argue that if the researcher had already obtained such an accurate estimation of the association (i.e., with a tight confidence interval), there would no longer be a need for another study that utilised NHST. This is a sensible argument, underlining the importance of basing required sample size estimates on desired confidence interval widths (Kraemer et al., Citation2006).

Tight confidence intervals are also valuable in other settings. For example, when conducting null hypothesis significance testing and rejecting a null hypothesis, the researcher concludes that it is likely that the two variables are associated in the population. However, without knowing how strongly the variables are associated, their association might have no practical or clinical significance. In addition, without some measure of association strength, the Numbers Needed to Treat (NNT; for an application to behavior change, see Gruijters & Peters, Citation2017) and cost effectiveness cannot be established. The researcher therefore commonly proceeds to compute a measure of association strength such as an effect size measure, but since the obtained point estimate changes value from sample to sample, it cannot inform the researchers of the likely association strength in the population. To conclude anything about the population, confidence intervals are normally computed, and given this intention to learn how strongly variables are associated in a population, researchers usually want these to be sufficiently tight.

When evaluating intervention effectiveness, it is clear that determining effect size is necessary for computing the NNT or cost effectiveness, and this also applies to the study of effectiveness of BCPs. However, in more fundamental (or basic) research, variables’ operationalisations sometimes have no meaningful scales, rendering effect sizes of secondary importance. In such studies, if no meaningful effect size estimate can be computed, the value of tight confidence intervals around the available effect size measures may be less clear at first glance. However, as will become clear, to yield results that are likely to replicate, such studies, too, require tight confidence intervals (or more accurately, narrow sampling distributions).

Thus, when planning a study, it can often be unwise if researchers limit themselves to computation of the sample size required to reject the null hypothesis in a given proportion of samples and given a specified expected association strength in the population (i.e. NHST-based power analysis). To be able to make useful statements about the likely strength of the studied association, researchers should also plan for confidence intervals of a given width. Although this point has been made repeatedly (e.g. Cumming, Citation2014; Maxwell et al., Citation2008), the width of confidence intervals in current psychological research (Brand & Bradley, Citation2016) implies that it has not yet been widely implemented. The tool we will now present is designed to facilitate this implementation in a situation where two groups of participants are compared.

How to compute the required sample size

We implemented this tool in the open source package ufs (Peters, Citation2019) for the open source statistical package R (R Core Team, Citation2018), which is often used in conjunction with the graphical user interface provided by the open source software RStudio (RStudio Team, Citation2019).Footnote6 We also implemented it in the ufs module for the open source application jamovi (jamovi project, 2019). We will first introduce the R package. To install this package, the following command can be used in an R analysis script or entered in the R console:

install.packages("ufs", repos="https://cran.rstudio.com/");

This command only needs to be run once: the package will remain installed. After installing the package, the following command can be used to request sample sizes:

ufs::pwr.cohensdCI(.5, w=.1);

The command above requests the sample size required to obtain a confidence interval with a margin of error (‘half-width’, argument ‘w’) of 0.1, assuming Cohen’s dpop has a value of 0.5 in the population (the first argument, which can optionally be named ‘d’), therefore specifying a desired confidence interval with a total width of 0.2, from 0.4 to 0.6. This function will return the required total sample size. If a user wishes to receive more extensive results, the argument ‘extensive = TRUE’ can be used to also return the requested and obtained lower and upper bounds of the confidence interval, and the desired confidence level can be specified using argument ‘conf.level’ (the default confidence level of 95% is used when nothing is specified):

ufs::pwr.cohensdCI(.5, w=.1, extensive = TRUE); ufs::pwr.cohensdCI(.5, w=.1, conf.level=.99);

Under the hood, this function uses an iterative procedure where sample size is increased from 4 in steps of 100, then 10, then 1, to find the smallest sample size that yields a confidence interval with the desired width (or rather, tightness). To find the confidence interval of Cohen’s ds, an approximation of the quantile function of the distribution of Cohen’s d is used. This approximation is achieved by converting the Cohen’s d value to Student’s t value and then using the quantile function of Student’s t to obtain the relevant t values, which are then converted back to Cohen’s d (using “MBESS::conf.limits.nct” from the “MBESS” package; Kelley, Citation2018; also see Kelley & Pornprasertmanit, Citation2016). This function to compute the confidence interval for a given confidence level, value of Cohen’s ds and sample size is also available in the ufs package:

ufs::cohensdCI(.5, 128, .95);

The first argument specifies the point estimate of Cohen’s ds, the second argument the sample size, and the third argument the desired confidence level (these can also be named using respectively ‘d’, ‘n’, and ‘conf.level’, the last of which has a default value of .95 and therefore can be omitted). This function returns the confidence interval.

The ufs jamovi module is an interface to these same R functions. The module can be installed from the jamovi library, and will add a ufs menu. This menu contains the analyses “Sample Size for Accuracy: Cohen’s d” (an interface to ufs::pwr.cohensdCI()) and “Effect Size Confidence Interval: Cohen’s d” (and interface to ufs::cohensdCI()). Selecting one of these analyses will open a dialog where the argument can be specified, after which the required sample size or resulting confidence interval is computed.

We have used the functions from the ufs R package to produce (required sample sizes for 95% confidence intervals) and (required sample sizes for 99% confidence intervals). Both tables show the required sample size for desired total confidence interval widths varying from a tenth of a standard deviation to an entire standard deviation for population values of Cohen’s dpop of 0.2, 0.5, and 0.8 (the tentative qualitative labels denoting small, moderate, and strong associations). Two important implications follow from these tables.

Table 1. The required sample sizes for obtaining Cohen’s ds 95% confidence intervals of the desired width.

Table 2. The required sample sizes for obtaining Cohen’s ds 99% confidence intervals of the desired width.

First, it is important to realise that because, unlike Pearson’s r values, Cohen’s d and Student’s t values are not bounded, for Cohen’s d confidence intervals, the association strength does not matter much for the required sample size. Higher expected population values of Cohen’s d require slightly larger samples. This is opposite to the dynamics of correlation coefficient estimation, where for stronger associations, smaller samples suffice (Moinester & Gottfried, Citation2014). The difference is also much smaller than when estimating correlation coefficients: when estimating Cohen’s d values, differences in the expected population value have much less effect on the required sample size.

Second, the required sample sizes for somewhat precise estimation of effect sizes are considerably larger than those required to reject the null hypothesis assuming a given effect size: for comparison, the required sample sizes for detecting small, moderate, and strong associations with 80%, 90%, and 95% power are shown in . Although 128 participants suffice to detect a moderate effect with 80% power, the corresponding 95% confidence interval for ds would run from 0.15 to 0.85; a total width of over half a standard deviation. In other words, on the basis of this dataset, it is not possible to say whether the effect would be trivial or large. And this is the situation when the point estimate represents a moderate effect: if instead a small effect is found, even with 128 participants, the 95% confidence interval for ds runs from −0.15 to 0.55, so it would not be possible to say whether the effect is absent or of moderate strength in the population. To help get a firmer grasp of these dynamics, it can be useful to visualise the sampling distribution of Cohen’s ds. To do this, the argument ‘plot = TRUE’ can be used when calling the ufs::cohensdCI() function:

Table 3. The required sample sizes for obtaining 80%, 90%, and 95% power.

ufs::cohensdCI(d=.5, n = 128, plot = TRUE);

This function shows the sampling distribution of Cohen’s d for a given sample size, assuming that the specified value of Cohen’s d is the population value, and with the confidence interval shown. This sampling distribution is the distribution from which one value of Cohen’s ds is randomly chosen when a study is conducted with the specified sample size, assuming that the association (or effect) in the population has the magnitude of the specified value of Cohen’s dpop. Visualising this is useful when planning studies or interpreting results, as it helps to get a feel for the variation that can be expected in the obtained effect sizes. In this case, clearly illustrates the unpredictability of the obtained effect size estimate when a study is conducted with 128 participants in a situation where the population effect is Cohen’s dpop = 0.5.

Figure 1. The sampling distribution of Cohen’s d for a population effect size of dpop = 0.5 and a total sample size of 128 participants. Any Cohen’s ds point estimate that is obtained in a study of 128 participants is drawn at random from this distribution, assuming that in the population, Cohen’s dpop is indeed 0.5. The 95% confidence interval that a researcher would compute based on a sample estimate of ds = 0.5 is shown in blue [.15;.85].

Figure 1. The sampling distribution of Cohen’s d for a population effect size of dpop = 0.5 and a total sample size of 128 participants. Any Cohen’s ds point estimate that is obtained in a study of 128 participants is drawn at random from this distribution, assuming that in the population, Cohen’s dpop is indeed 0.5. The 95% confidence interval that a researcher would compute based on a sample estimate of ds = 0.5 is shown in blue [.15;.85].

A sampling distribution-based perspective on the replication crisis

These wide sampling distributions mean that even when researchers power their studies quite highly for a given effect size for the purpose of null hypothesis significance testing, the obtained effect size estimates will still vary erratically from study to study. When studies are powered less strongly, and unfortunately most studies in psychology remain embarrassingly underpowered for all but the largest effect sizes (Bakker et al., Citation2012), observed effect sizes can vary even more. From this point of view, the somewhat depressing results of large-scale replication studies (e.g. Open Science Collaboration, Citation2015) seem to make sense. Most replicated studies (the original studies, that is) were extremely underpowered, even from an NHST point of view. This means that regardless of whether those original studies were correct regarding whether the hypothesized associations exist, the obtained effect sizes were pretty much random. To explore this, we selected all original studies with two-cell designs included in the Reproducibility Project: Psychology (Open Science Collaboration, Citation2015), available at https://osf.io/fgjvw. We extracted the sample sizes and effect size estimates from these studies and used these to construct the 95% confidence intervals.Footnote7 We repeated this using the sample sizes and effect sizes found in each study’s replication (valid statistics were extracted for 70 effect size estimates). Both sets of confidence intervals are shown in diamond plots (Peters, Citation2017) in . From both sets of confidence intervals, we extracted the widths, and these are shown in .

Figure 2. Confidence intervals for the original studies and replications in the Reproducibility Project: Psychology. Note the scale of the x axes.

Figure 2. Confidence intervals for the original studies and replications in the Reproducibility Project: Psychology. Note the scale of the x axes.

Figure 3. Confidence interval widths for the original studies and replications in the Reproducibility Project: Psychology.

Figure 3. Confidence interval widths for the original studies and replications in the Reproducibility Project: Psychology.

These figures confirm the expectations formulated above. The original studies did not allow accurate effect size estimates: in almost all cases, the effect size estimates’ sampling distributions were very wide. Even the replications do not allow accurate effect size estimates, again with confidence interval widths around 1 (the 95% confidence interval for the widths runs from 1.05 to 1.39 with a median width of 1.00). This means that the effect size estimates of these replications, too, were drawn from very wide sampling distributions. Therefore, replicating these replications with similar power (or more accurately, similar sample sizes) can yield very different effect size estimates. Note that although we used data from the Reproducibility Project: Psychology as an example, these wide sampling distributions can help explain results of other replication projects (e.g. Hagger et al., Citation2016) as well.

For determining how strongly two variables are associated, sample size estimates from power analyses based on null hypothesis significance testing are simply not good enough. This same logic holds for original research, of course. If researchers want their study to have a high likelihood of being replicable, the effect size they find should be drawn from a sufficiently narrow sampling distribution, or, in other words, the confidence intervals for their effect size estimates should be sufficiently tight. and can help researchers to determine the sample size required for simple designs. For more complicated designs, where multivariate associations are estimated and sampling distributions are therefore conditional upon covariance between variables, each study may require a dedicated simulation to determine the required sample size. Researchers that do not have access to such simulations can at least use and to get some sense of the order of magnitude of sample sizes one should consider.

Discussion

To determine whether an intervention is effective is to determine how effective an intervention is. After all, the knowledge that an effect is unlikely to be zero in a population has little value if that non-zero value might still represent a trivial effect. Determining whether an intervention is worthwhile requires establishing cost effectiveness, and such calculations will require accurate effect size estimates. Similarly, when studying behaviour change principles (BCPs, e.g., BCTs or Intervention Mapping’s methods for behaviour change), the goal is to establish how effective a given method can be for changing a given determinant (Kok et al., Citation2016; Peters et al., Citation2015). This information is required during intervention development to decide which behaviour change methods to select for targeting the relevant determinants (Kok, Citation2014). Therefore, health psychologists who evaluate intervention effectiveness or conduct experiments to examine the effectiveness of behaviour change principles may find and , as well as function ufs::pwr.cohensdCI(), useful. When using these tables and function to plan studies, the resulting body of evidence will be more likely to replicate. This will have as added advantage that such studies will do well when computing indices such as the replicability index, because the median power against all but the smallest effect sizes will be very high (Schimmack, Citation2016). Note, though, that replication depends on many other factors than sample size alone (Amrhein et al., Citation2019).

Shifting attention from null hypothesis significance tests to the accuracy of parameter estimates comes with a more acute awareness of the fact that any effect size estimate (in fact, anything computed from sample data) is randomly drawn from the corresponding sampling distribution. In every replication these estimates will take on different values, and the width of the sampling distribution determines how far these values can lie apart. Therefore, learning how effective an intervention is (therefore, whether it is effective), or learning how effective a BCP is (therefore, whether it is effective), or more generally, learning whether (therefore, how strongly) two variables are associated, requires narrow sampling distributions of the effect size estimate. Achieving sufficiently narrow sampling distributions, and therefore, tight confidence intervals and accurate parameter estimates, requires much larger sample sizes than are commonly seen in the literature. When surveying the literature, it would be easy to get the impression that experiments to assess the effectiveness of BCPs such as goal setting, implementation intentions, or fear appeals would require only a few dozen, or perhaps a few hundred participants. As the examples in this paper show, this is not true. Whereas for correlation coefficients, strong population effects mean that smaller samples suffice to achieve accurate estimates (Bonett & Wright, Citation2000; Moinester & Gottfried, Citation2014), for Cohen’s ds (i.e., the effect size measure used for comparing two groups) the sampling distribution even becomes slightly wider as the population effect increases.

This shift from NHST to sampling distribution-based thinking (and the accompanying acute awareness of the instability of each sample point estimate) does not mean that estimation of association strengths should become the sole focus of health psychology research. It is important that parameter estimation is used in parallel with hypothesis testing. Not necessarily tests of the null hypothesis, nor significance tests using p values, let alone NHST (Cumming, Citation2014; Morey et al., Citation2014); but tests of theoretical hypotheses nonetheless (Morey et al., Citation2014). Sampling distribution-based thinking can facilitate formulating conditions for theory confirmation or refutation. For example, one can establish in advance which values should lie in the confidence interval or which should lie outside it, by committing oneself to considering a theory refuted by a dataset if a 99% confidence interval with a half-width of d = .1 includes values of .1 or lower. Such a scenario might be reasonable when testing the theoretical hypothesis that, for example, making coping plans has an effect on binge drinking. Note that setting these decision criteria (which confidence level to use, and how tight one wants the confidence interval to be) are inevitably subjective to a degree. The practice of full disclosure enables other researchers to apply different criteria to the same dataset (Peters et al., Citation2012).

Note that the results of equivalence tests, another approach to refuting theory (Lakens, Citation2017), can also vary wildly with small sample sizes. When using the Two One-Sided Tests (TOST) procedure, at an alpha of .05, one only requires 69 participants to have 80% power to reject associations stronger (or effects larger) than d = 0.5 (Lakens, Citation2017, Table 1). However, with 69 participants, the sampling distribution of Cohen’s d from which the point estimate for Cohen’s ds that is obtained in any given study is drawn is still very wide. If the population value of Cohen’s dpop is 0, the 95% confidence interval runs from −0.47 to 0.47; if the population d = 0.6, from 0.12 to 1.08. Even in this last scenario, obtaining a low point estimate of Cohen’s d in any given study is still quite plausible because the low sample size means that the point estimate is drawn from a very wide sampling distribution. TOST lacks consistent replicability as much as NHST unless the study is very highly powered (e.g. a power of 99% to detect (NHST) or reject (TOST) a small effect of d = 0.2).

Limitations

In its current version, the sample size planning function does not yet account for the desired assurance level. Assurance is a parameter that allows one to take into account the variation in sample variance from study to study. Like the mean, variance is a random variable, drawn at random from its sampling distribution. This variance (or rather, its square root, the standard deviation) is used to compute the standard error of the mean’s sampling distribution. Therefore, a researcher could be lucky and happen to obtain a relatively low variance estimate (and a tight confidence interval), or unlucky and obtain a relatively high variance estimate (and a wide confidence interval). Specifying the desired assurance allows a researcher to estimate the sample size required to obtain a confidence interval of the specified with in a given proportion of the studies. ESCI does allow the user to specify the desired assurance level (Cumming, Citation2014; Cumming & Calin-Jageman, Citation2016). It can therefore be useful to use both tools in parallel; the presently introduced R functions can be used to efficiently and reproducibly obtain a range of estimates for different scenarios, and once one or several scenarios have been selected, the estimates can be finetuned using ESCI. Note that taking the standard deviations’ sampling distribution into account by also parametrizing assurance leads to even higher sample size estimates. Therefore, the main message of the current paper, that NHST power analyses often substantially underestimate the sample sizes required to obtain replicable results, remains the same.

NHST, problems with confidence intervals, and Bayesian statistics

One of the reasons that use of NHST is discouraged in many situations is the widespread misinterpretation of p values (Amrhein et al., Citation2019; Wasserstein & Lazar, Citation2016; also see this special issue: Wasserstein et al., Citation2019). For example, Greenland et al. (Citation2016) list 18 common misinterpretations of p values. This problem, however, is not entirely solved (though perhaps alleviated a bit) by using confidence intervals: the same authors list five common misinterpretations of confidence intervals (also see Morey et al., Citation2016). The first of these is the false belief that “[t]he specific 95% confidence interval presented by a study has a 95% chance of containing the true effect size.” This interpretation of confidence intervals is simultaneously widespread, intuitive, and wrong; and worse, frequentist methods cannot provide any such estimates. Instead, they yield an interval that, if the same study were repeated infinitely, will contain the population value in a given proportion of those studies. It would be much more informative to have an interval that, with a given probability, contains the population value.

Bayesian methods can provide such an interval, which is called the credible interval. Researchers trained in Bayesian methods, therefore, are inclined to compute credible intervals instead of confidence intervals. Unfortunately, many researchers are unfamiliar with Bayesian statistics (despite the increasing popularity of user friendly and freely available tools such as JASP; JASP Team, Citation2018). Fortunately, however, in situations where no informative prior is available, Bayesian methods and frequentist procedures tend to result in similar intervals (Albers et al., Citation2018; for another interesting exercise in comparing different statistical approaches, see Dongen et al., Citation2019). Because researchers who evaluate behaviour change interventions usually evaluate newly developed, complex, interventions, informative priors will rarely be available. Therefore, in such situations, thinking of a 95% confidence interval as an interval that has a 95% probability to contain the population effect size is hardly problematic in a practical sense (despite remaining formally incorrect, of course). Furthermore, even if confidence intervals are poorly understood, the shift towards sampling-distribution based thinking (i.e. a more acute awareness that all point estimates ‘dance around’ and as such, are often noninformative) that accompanies habitual use of confidence intervals remains valuable – also as a partial solution to the replication crisis.

Besides credible intervals, other approaches exist that can aid in sample size computations for accurate estimation. For example, the closeness procedure recommended by Trafimow et al. (2018) lets researchers compute the sample size required to obtain an estimate that, with a given confidence level, deviates no more than a given desired closeness from the population value. This approach is based on separate estimation of the means, as opposed to estimation of the difference between means. The variance of the sampling distribution of the difference (i.e. Cohen’s d) is larger than that of the sampling distribution of each mean, so which approach fits better depends on the researcher’s scenario (for details, see Trafimow, Citation2018). Note that when precise and replicable estimates are required, the methods yield similarly high sample size estimates (see Table 1 in Trafimow, Citation2018).

A related consideration is that, as Trafimow et al. (2018) argue, conclusions are ideally never based on single studies. Yet health psychologists often do applied research, working with politicians, policymakers, and stakeholder organisations, who often prefer (and sometimes demand) answers based on single studies, as opposed to waiting years or decades. Similarly, if funders fund the development of a behaviour change intervention, they commonly fund one evaluation study, not several. Both unfortunate facts of life often necessitate designing studies that allows conclusions that are as accurate as possible based on that single study.

Conclusion and implications

Concluding, we come to the somewhat depressing conclusion that the apparent norm in terms of sample sizes required for experimental research is a gross underestimation if the goal is to achieve replicable results. Replicable results require tight confidence intervals, because tight confidence intervals mean that the sampling distribution of the effect size is narrow, which means that in replications, effect size estimates of similar magnitude will be obtained. Conversely, if confidence intervals are wide, this means the sampling distribution of the effect size is wide, which means replications can obtain very different effect size point estimates. In such scenarios, significant results can easily disappear in a replication, and appear again in a third study. This has two implications.

First, funders should become aware of these dynamics, and cease funding small-scale studies. Conducting a psychological study will require more resources than funders (and researchers) are used to. On the other hand, there is no reason why conducting psychological studies should for some reason intrinsically be so much cheaper or quicker than in other fields. Creating the conditions necessary for studying the subject matter of a field has costs. In some fields, researchers require clean rooms (e.g. Peters & Tichem, Citation2016) or magnetic resonance imaging equipment (and also many participants; Szucs & Ioannidis, Citation2017). In psychology, the sampling and measurement error we have to deal with mean that to obtain sufficiently narrow sampling distributions for the associations we study, we require many measurements. Of course, ‘many measurements’ need not necessarily mean ‘many participants’, and in fact, intensive longitudinal methods are likely even a better solution when testing theories that make predictions about processes that occur within, rather than between, persons (see Inauen et al., Citation2016 for an excellent example; and see Naughton & Johnston, Citation2014 for an accessible introduction and tutorial of n-of-1 designs ). Regardless of whether measurements within or between participants are increased, funders will have to get used to considerably higher costs in terms of the time and funds required for one study.

Second, authors, reviewers, editors, and publishers and universities issuing press releases should be very tentative when drawing conclusions based on wide confidence intervals. Odds are, these conclusions will fail to replicate. Also, ethical review committees and institutional review boards should take this into account, to make sure that the scarce (often public) resources that are invested in research are not wasted on studies with sample sizes so low that the results are unlikely to replicate. In fact, for authors, reviewers, editors and publishers, these considerations are not only methodological, but ethical as well (Crutzen & Peters, Citation2017). Universities and publishers have a responsibility to critically assess their press releases, and there exists a point where an overenthusiastic press release becomes spreading of misinformation through neglect to properly scrutinize. It would be useful to start formulating rules of thumb as to how many datapoints are required before it is possible to have enough confidence in study outcomes to warrant a press release. Based on the present paper, one could, for example, argue that it is perhaps not justifiable to publish a press release about samples with only a hundred datapoints (especially if the study was an experiment; Peters & Gruijters, Citation2017).

On the bright side, once one has resigned to this unpleasant truth, a brilliant future may rise from the ashes. When using sufficiently large samples, very accurate statements can be made with high confidence. For example, with 750 participants (375 in each group), a 95% confidence interval for ds = 0.2 has a total width of .29, which allows once to draw conclusions with relatively certainty: a moderate effect is unlikely to attenuate to a weak effect in a replication. These large sample sizes come with a bonus: even the 99% confidence interval has a total width of only .38. In a study with 1500 participants (750 in each group), a 95% confidence interval has a width of only .2, and at that sample size, a 99% confidence interval still has a total width of only .27, about a third of a standard deviation. This means that only in one out of hundred studies, the population effect size will lie outside the confidence interval, which means that for any given study, the likelihood that the population effect size will be captured in the confidence interval is very large. Another advantage is that in experiments with such high sample sizes, randomization is very likely to succeed, whereas with a few hundred participants, randomization may still plausibly result in non-equivalent groups with respect to a relevant moderator (Peters & Gruijters, Citation2017). Perhaps this is why it was rumoured that Cohen’s “idea of the perfect study is one with 10.000 cases and no variables” (Cohen, Citation1990, p. 1305).

Combined with full disclosure of materials and data (Crutzen et al., Citation2012; Peters et al., Citation2012) and complete transparency regarding the research proceedings (Peters et al., Citation2017), conducting studies with sufficiently large sample sizes to enable accurate parameter estimation enables building a solid basis of empirical evidence. If this is combined with careful testing, development (Earp & Trafimow, Citation2015) and application (Peters & Crutzen, Citation2017) of theory, this can yield a theory- and evidence base that can then confidently be used in the development of behaviour change interventions, eventually contributing to improvements in health and well-being. Conducting studies with sufficiently large sample sizes is as close to a guarantee of replication one is likely to come. This is an important message to funders as well: if the goal is to build a strong, replicable evidence base in psychology, it is necessary to fund studies with sample sizes that are considerably larger than what was funded in the past. However, although the price is high (literally), the promised rewards are plentiful.

Acknowledgements

We would like to thank Robert Calin-Jageman and Geoff Cumming for constructive corrections on the preprint of this paper, Guy Prochilo for pointing out an inconsistency in the algorithms and constructive comments, and the editor Rob Ruiter and reviewers Rink Hoekstra and David Trafimow for constructive comments during the peer review process.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 Statistically, all effects are simply associations: whether an association involves variables that are manipulated or only measured is theoretically crucial but statistically irrelevant.

2 Following Lakens (Citation2013), we use the s subscript to unequivocally refer to the between-samples Cohen’s d; note that Goulet-Pelletier and Cousineau (Citation2018) use dp for this same form of Cohen’s d.

3 Note that the exact distribution of Pearson’s r is available in the R package SuppDists (Wheeler, Citation2016).

4 A number of free and easy-to-use tools exist that can help get a handle on how different values of r and d convert to each other. One is the FromR2D2 spreadsheet by Daniel Lakens, hosted at the Open Science Framework at https://osf.io/ixgcd. In addition, a family of conversion functions is available in the R package userfriendlyscience, such as convert.r.to.d and convert.d.to.r.

5 Note that whether this single confidence interval of [.10; .90] is among that 95% is not known: knowing this would require knowing the population value, knowledge of which would make collecting a sample redundant in the first place.

6 The analysis script and produced files are all available at the Open Science Framework at https://osf.io/5ejd8.

7 For some of these studies, no effect size estimate was available. For these studies, we constructed the confidence interval around zero to obtain the narrowest possible (i.e. most optimistic) confidence intervals.

References