578
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors

, , , , , , , , , , , , , & show all

Null hypothesis statistical tests are often conducted in healthcare research [Citation1], including in the physiotherapy field [Citation2]. Despite their widespread use, null hypothesis statistical tests have important limitations. This co-published editorial explains statistical inference using null hypothesis statistical tests and the problems inherent to this approach; examines an alternative approach for statistical inference (known as estimation); and encourages readers of physiotherapy research to become familiar with estimation methods and how the results are interpreted. It also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.

What is statistical inference?

Statistical inference is the process of making inferences about populations using data from samples [Citation1]. Imagine, for example, that some researchers want to investigate something (perhaps the effect of an intervention, the prevalence of comorbidity or the usefulness of a prognostic model) in people after stroke. It is unfeasible for the researchers to test all stroke survivors in the world; instead, the researchers can only recruit a sample of stroke survivors and conduct their study with that sample. Typically, such a sample makes up a miniscule fraction of the population, so the result from the sample is likely to differ from the result in the population [Citation3]. Researchers must therefore use their statistical analysis of the data from the sample to infer what the result is likely to be in the population.

What are null hypothesis statistical tests?

Traditionally, statistical inference has relied on null hypothesis statistical tests. Such tests involve positing a null hypothesis (e.g. that there is no effect of an intervention on an outcome, that there is no effect of exposure on risk or that there is no relationship between two variables). Such tests also involve calculating a p-value, which quantifies the probability (if the study were to be repeated many times) of observing an effect or relationship at least as large as the one that was observed in the study sample, if the null hypothesis is true. Note that the null hypothesis refers to the population, not the study sample.

Because the reasoning behind these tests is linked to the imagined repetition of the study, they are said to be conducted within a ‘frequentist’ framework. In this framework, the focus is on how much a statistical result (e.g. a mean difference, a proportion or a correlation) would vary among the repeats of the study. If the data obtained from the study sample indicate that the result is likely to be similar to the imagined repeats of the study, this is interpreted as an indication that the result is in some way more credible.

One type of null hypothesis statistical test is significance testing, developed by Fisher [Citation4–6]. In significance testing, if a result at least as large as the result observed in the study would be unlikely to occur in the imagined repeats of the study if the null hypothesis is true (as reflected by p < 0.05), then this is interpreted as evidence that the null hypothesis is false. Another type of null hypothesis statistical test is hypothesis testing, developed by Neyman and Pearson [Citation4–6]. Here, two hypotheses are posited: the null hypothesis (i.e. that there is no difference in the population) and the alternative hypothesis (i.e. that there is a difference in the population). The p-value tells the researchers which hypothesis to accept: if p ≥ 0.05, retain the null hypothesis; if p < 0.05, reject the null hypothesis and accept the alternative. Although these two approaches are mathematically similar, they differ substantially in how they should be interpreted and reported. Despite this, many researchers do not recognise the distinction and analyse their data using an unreasoned hybrid of the two methods.

Problems with null hypothesis statistical tests

Regardless of whether significance testing or hypothesis testing (or a hybrid) is considered, null hypothesis statistical tests have numerous problems [Citation4,Citation5,Citation7]. Five crucial problems are explained in Box 1. Each of these problems is fundamental enough to make null hypothesis statistical tests unfit for use in research. This may surprise many readers, given how widely such tests are used in published research [Citation1,Citation2].

It is also surprising that the widespread use of null hypothesis statistical tests has persisted for so long, given that the problems in Box 1 have been repeatedly raised in healthcare journals for decades [Citation8,Citation9], including physiotherapy journals [Citation10,Citation11]. There has been some movement away from null hypothesis statistical tests, but the use of alternative methods of statistical inference has increased slowly over decades, as seen in analyses of healthcare research, including physiotherapy trials [Citation2,Citation12]. This is despite the availability of alternative methods of statistical inference and promotion of those methods in statistical, medical and physiotherapy journals [Citation10,Citation13–16].

Box 1. Problems with null hypothesis statistical tests. Modified from Herbert [Citation26].

Box 2. Resources that provide additional information to respond to questions about the transition from null hypothesis statistical tests to estimation methods.

Estimation as an alternative approach for statistical inference

Although there are multiple alternative approaches to statistical inference [Citation13], the simplest is an estimation [Citation17]. Estimation is based on a frequentist framework but, unlike null hypothesis statistical tests, its aim is to estimate parameters of populations using data collected from the study sample. The uncertainty or imprecision of those estimates is communicated with confidence intervals [Citation10,Citation14].

A confidence interval can be calculated from the observed study data, the size of the sample, the variability in the sample and the confidence level. The confidence level is chosen by the researcher, conventionally at 95%. This means that if hypothetically the study were to be repeated many times, 95% of the confidence intervals would contain the true population parameter. Roughly speaking, a 95% confidence interval is the range of values within which we can be 95% certain that the true parameter in the population actually lies.

Confidence intervals are often discussed in relation to treatment effects in clinical trials [Citation18,Citation19], but it is possible to put a confidence interval around any statistic, regardless of its use, including mean difference, risk, odds, relative risk, odds ratio, hazard ratio, correlation, proportion, absolute risk reduction, relative risk reduction, number needed to treat, sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and difference in medians.

Interpretation of the results of the estimation approach

To use the estimation approach well, it is not sufficient simply to report confidence intervals. Researchers must also interpret the relevance of the information portrayed by the confidence intervals and consider the implications arising from that information. The path of migration of researchers from statistical significance and p-values to estimation methods is littered with examples of researchers calculating confidence intervals at the behest of editors, but then ignoring the confidence intervals and instead of interpreting their study’s result dichotomously as statistically significant or non-significant depending on the p-value [Citation20]. Interpretation is crucial.

Some authors have proposed a ban on terms related to the interpretation of null hypothesis statistical testing. One prominent example is an editorial published in The American Statistician [Citation13], which introduced a special issue on statistical inference. It states:

The American Statistical Association Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of “statistical significance” be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.

This may seem radical and unworkable to researchers with a long history of null hypothesis statistical testing, but many concerns can be allayed. First, such a ban would not discard decades of existing research reported with null hypothesis statistical tests; the data generated in such studies maintain their validity and will often be reported in sufficient detail for confidence intervals to be calculated. Second, reframing the study’s aim involves a simple shift in focus from whether the result is statistically significant to gauging how large and how precise the study’s estimate of the population parameter is. (For example, instead of aiming to determine whether a treatment has an effect on stroke survivors, the aim is to estimate the size of the average effect. Instead of aiming to determine whether a prognostic model is predictive, the aim is to estimate how well the model predicts.) Third, the statistical imprecision of those estimates can be calculated readily. Existing statistical software packages already calculate confidence intervals, including free software such as R [Citation21,Citation22]. Lastly, learning to interpret confidence intervals is relatively straightforward.

Many researchers and readers initially come to understand how to interpret confidence intervals around estimates of the effect of a treatment. In a study comparing treatment versus control with a continuous outcome measure, the study’s best estimate of the effect of the treatment is usually the average between-group difference in the outcome. To account for the fact that estimates based on a sample may differ by chance from the true value in the population, the confidence interval provides an indication of the range of values above and below that estimate where the true average effect in the relevant clinical population may lie. The estimate and its confidence interval should be compared against the ‘smallest worthwhile effect’ of the intervention on that outcome in that population [Citation23]. The smallest worthwhile effect is the smallest benefit from an intervention that patients feel outweighs its costs, risk and other inconveniences [Citation23]. If the estimate and the ends of its confidence interval are all more favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile by patients in that clinical population. If the effect and its confidence interval are less favourable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered trivial by patients in that clinical population. Results with confidence intervals that span the smallest worthwhile effect indicate a benefit with uncertainty about whether it is worthwhile. Results with a narrow confidence interval that spans no effect indicate that the treatment’s effects are negligible, whereas results with a wide confidence interval that spans no effect indicate that the treatment’s effects are uncertain. For readers unfamiliar with this sort of interpretation, some clear and non-technical papers with clinical physiotherapy examples are available [Citation10,Citation14,Citation18,Citation19].

Interpretation of estimates of treatment effects and their confidence intervals relies on knowing the smallest worthwhile effect (sometimes called the minimum clinically important difference) [Citation23]. For some research questions, such a threshold has not been established or has been established with inadequate methods. In such cases, researchers should consider conducting a study to establish the threshold or at least nominate the threshold prospectively.

Readers who understand the interpretation of confidence intervals around treatment effect estimates will find the interpretation of confidence intervals around many other types of estimates quite familiar. Roughly speaking, the confidence interval indicates the range of values around the study’s main estimate where the true population result probably lies. To interpret a confidence interval, we simply describe the practical implications of all values inside the confidence interval [Citation24]. For example, in a diagnostic test accuracy study, the positive likelihood ratio tells us how much more likely a positive test finding is in people who have the condition than it is in people who do not have the condition. A diagnostic test with a positive likelihood ratio greater than about 3 is typically useful and greater than about 10 is very useful [Citation25]. Therefore, if a diagnostic test had a positive likelihood ratio of 4.8 with a 95% confidence interval of 4.1 to 5.6, we could anticipate that the true positive likelihood ratio in the population is both useful and similar to the study’s main estimate. Conversely, if a study estimated the prevalence of depression in people after anterior cruciate ligament rupture at 40% with a confidence interval from 5% to 75%, we may conclude that the main estimate is suggestive of a high prevalence but too imprecise to conclude that confidently.

ISPJE member journals’ policy regarding the estimation approach

The executive of the ISPJE strongly recommends that member journals seek to foster use of the estimation approach in the papers they publish. In line with that recommendation, the editors who have co-authored this editorial advise researchers that their journals will expect manuscripts to use estimation methods instead of null hypothesis statistical tests. We acknowledge that it will take time to make this transition, so editors will give authors the opportunity to revise manuscripts to incorporate estimation methods if the manuscript seems otherwise potentially viable for publication. Editors may assist authors with those revisions where required.

Readers who require more detailed information to address questions about the topics raised in this editorial are referred to the resources in Box 2, such as the Research Note on the problems of significance and hypothesis testing [Citation25] and an excellent textbook that addresses confidence intervals and the application of estimation methods in various research study designs with clinical physiotherapy examples [Citation26]. Both are readily accessible to researchers and clinicians without any prior understanding of the issues.

Quantitative research studies in physiotherapy that are analysed and interpreted using confidence intervals will provide more valid and relevant information than those analysed and interpreted using null hypothesis statistical tests. The estimation approach is therefore of great potential value to the researchers, clinicians and consumers who rely upon physiotherapy research, and that is why ISPJE is recommending that member journals foster the use of estimation in the articles they publish.

Acknowledgements

We thank Professor Rob Herbert from Neuroscience Research Australia (NeuRA) for his presentation to the ISPJE on this topic and for comments on a draft of this editorial.

References

  • Nickerson RS. Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods. 2000;5(2):241–301.
  • Freire APCF, Elkins MR, Ramos EM, et al. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials. Braz J Phys Ther. 2019;23(4):302–310.
  • Altman DG, Bland JM. Uncertainty and sampling error. BMJ. 2014;349:g7064.
  • Barnett V. Comparative statistical inference. London, New York: Wiley; 1973.
  • Royall RM. Statistical evidence: a likelihood paradigm. 1st ed. London, New York: Chapman & Hall; 1997.
  • Gigerenzer G. The empire of chance: how probability changed science and everyday life. Cambridge, England: Cambridge University Press; 1989.
  • Goodman SN, Royall R. Evidence and scientific research. Am J Public Health. 1988;78(12):1568–1574.
  • Ziliak S, McCloskey D. The cult of statistical significance: how the standard error costs us jobs, justice, and lives. Ann Arbor, USA: University of Michigan Press; 2008.
  • Hubbard R. Corrupt research: the case for reconceptualizing empirical management and social science. Thousand Oaks, USA: Sage; 2016.
  • Herbert RD. How to estimate treatment effects from reports of clinical trials. I: continuous outcomes. Aust J Physiother. 2000;46(3):229–235.
  • Maher CG, Sherrington C, Elkins M, et al. Challenges for evidence-based physical therapy: accessing and interpreting high-quality evidence on therapy. Phys Ther. 2004;84(7):644–654.
  • Yi D, Ma D, Li G, et al. Statistical use in clinical studies: is there evidence of a methodological shift? PLOS One. 2015;10(10):e0140159.
  • Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p< 0.05. Am Stat. 2019;73(Suppl. 1):1–19.
  • Herbert RD. How to estimate treatment effects from reports of clinical trials. II: dichotomous outcomes. Aust J Physiother. 2000;46(4):309–313.
  • Sim J, Reid N. Statistical inference by confidence intervals: issues of interpretation and utilization. Phys Ther. 1999;79(2):186–195.
  • Rothman KJ. Disengaging from statistical significance. Eur J Epidemiol. 2016;31(5):443–444.
  • Cumming G. Multivariate applications series. New York: Routledge; 2012.
  • Kamper SJ. Showing confidence (intervals). Braz J Phys Ther. 2019;23(4):277–278.
  • Kamper SJ. Confidence intervals: linking evidence to practice. J Orthop Sports Phys Ther. 2019;49(10):763–764.
  • Fidler F, Thomason N, Cumming G, et al. Editors can lead researchers to confidence intervals, but can't make them think: statistical reform lessons from medicine. Psychol Sci. 2004;15(2):119–126.
  • R Core Team 2020. R: a language and environment for statistical computing. R foundation for statistical computing. Vienna, Austria. https://www.R-project.org/
  • RStudio Team 2019. RStudio: Integrated development for R. Boston, USA: RStudio, Inc. http://www.rstudio.com/
  • Ferreira M. Research note: the smallest worthwhile effect of a health intervention. J Physiother. 2018;64(4):272–274.
  • Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305–307.
  • Herbert R. Research note: significance testing and hypothesis testing: meaningless, misleading and mostly unnecessary. J Physiother. 2019;65(3):178–181.
  • Herbert RD, Jamtvedt G, Mead J, et al. Practical Evidence-Based physiotherapy. 2nd ed. Oxford: Elsevier; 2011.
  • Boos DD, Stefanski LA. P-Value precision and reproducibility. Am Stat. 2011;65(4):213–221.
  • Wasserstein R, Lazar N. The ASA’s statement on p-Values: context, process, and purpose. Am Stat. 2016;70(2):129–133.
  • International Committee of Medical Journal Editors. ICMJE recommendations for the conduct, reporting, editing and publication of scholarly work in medical journals. 2013. http://www.icmje.org/icmje-recommendations.pdf.
  • McGough JJ, Faraone SV. Estimating the size of treatment effects: moving beyond p values. Psychiatry. 2009;6(10):21–29.
  • Hayat MJ, Chandrasekhar R, Dietrich MS, et al. Moving otology beyond p< 0.05. Otol Neurotol. 2020;41:578–579.
  • Hayat MJ, Staggs VS, Schwartz TA, et al. Moving nursing beyond p <.05. Res Nurs Health. 2019;42(4):244–245.
  • Cumming G, Fidler F, Kalinowski P, et al. The statistical recommendations of the American Psychological Association Publication Manual: effect sizes, confidence intervals, and meta‐analysis. Aust J Psychol. 2012;64(3):138–146.
  • Calin-Jageman RJ, Cumming G. Estimation for better inference in neuroscience. eNeuro. 2019;6(4):ENEURO.0205-19.2019.
  • Schreiber JB. New paradigms for considering statistical significance: a way forward for health services research journals, their authors, and their readership. Res Social Adm Pharm. 2020;16(4):591–594.
  • Erickson RA, Rattner BA. Moving beyond p< 0.05 in ecotoxicology: a guide for practitioners. Environ Toxicol Chem. 2020;39:1657–1669.
  • Smith RJ. P >. 05: the incorrect interpretation of “not significant” results is a significant problem. Am J Phys Anthropol. 2020;172(4):521–527.
  • Du Sert NP, Ahluwalia A, Alam S, et al. Reporting animal research: explanation and elaboration for the ARRIVE guidelines 2.0. PLOS Biol. 2020;18(7):e3000411.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.