52
Views
0
CrossRef citations to date
0
Altmetric
Editorial

‘Trustworthy’ systematic reviews can only result in meaningful conclusions if the quality of randomized clinical trials and the certainty of evidence improves: an update on the ‘trustworthy’ living systematic review project

, , , , &

What is the ‘trustworthy’ living systematic review project?

The ‘trustworthy’ living systematic review (SR) project began in 2022 with the ambitious goal of synthesizing verifiable moderate- to high-quality certainty evidence that could be confidently translated into strong clinical practice recommendations. Relying upon this type of evidence: 1) prevents strongly discordant clinical recommendations based totally or partially on low-quality evidence [Citation1]; 2) breaks the cycle of the grossly overused phrase, ‘These results, however, should be interpreted with caution [Citation2]’; and 3) builds a foundation for synthesizing living, ‘trustworthy’ research for the practicing clinician that can be periodically updated [Citation3]. Our methods involved establishing the prospective validity of the identified randomized clinical trials (RCTs) (i.e. ensuring the studies were prospectively registered, conducted, and reported consistent with the registry). Establishing prospective validity through a prospective registry is critical as authors fail to follow the established rules after collecting the data and obtaining a result 73% of the time [Citation2]; this is a form of potentially inappropriate research behavior known as post-randomization bias [Citation4].

Post-randomization biases create an environment where researchers report what they create from the data instead of how the methods and the data answer the research question(s) of interest [Citation4]. Research questions should be asked a priori, not once data collection is underway or completed. Likewise, the data analysis plan is connected to the testable hypotheses established at the outset and, therefore, cannot be a ‘wait and see’ response to the data collected. These post-hoc modifications to prospective intent are often used intentionally or non-intentionally to create type I research errors (i.e. false positive findings) where deviating from the game’s rules after playing the game creates statistically significant findings where none exist [Citation5]. One of the side effects of these behaviors is that they render the tools used to assess the quality of RCTs, certainty of RCTs, confidence in SRs, and clinical applicability useless because they are predicated on the assumption of prospective validity [Citation6].

Additionally, our protocol ensured that the studies were externally valid (i.e. the inclusion and exclusion criteria were identified, at least moderately internally valid with a Physiotherapy Evidence Database (PEDro) score of 6 or higher [Citation7], and had a moderate- to low-risk of bias on the Cochrane Collaboration Risk of Bias tool 2 (RoB 2 tool) [Citation3].

What did we find?

After completing three ‘trustworthy’ SRs using the parameters we outlined in our published protocol [Citation7], we identified [Citation8] a single ‘trustworthy’ RCT [Citation9] guiding the use of manual therapy for treating patients with non-radicular cervical spine impairments and identified three trustworthy RCTs [Citation9–11] when investigating the effects of manual therapy on quantitative sensory testing and patient-reported outcome measures in participants with musculoskeletal impairments [Citation12]. Our most recent effort reviewed RCTs investigating the use of manual therapy to treat patients with shoulder dysfunction [Citation13]. Unfortunately, we could not identify any ‘trustworthy’ RCTs after evaluating prospective validity, external validity, internal validity, and risk of bias, which prevented a further assessment regarding the certainty of the evidence using the GRADE criteria [Citation13].

What are the GRADE criteria, and how are they used to establish the certainty of randomized clinical trials?

The protocol of the ‘trustworthy’ living SR project [Citation3] was designed to ensure that downgrading did not occur as the RCT data were synthesized into the certainty of the observed effect sizes and practice recommendations using the GRADE criteria. The GRADE criteria are used to establish if the certainty of the synthesized effect is accurate. Although a thorough discussion of the GRADE criteria is beyond the scope of this paper, further reading can be found in the Evidence-based medicine (EBM) toolkit the British Journal of Sports Medicine created [Citation14].

As the GRADE criteria are applied, the certainty of the evidence starts high if all included studies in the SR are RCTs [Citation15]. This certainty may then be downgraded by assessing the included studies for: 1) risk of bias on the RoB 2; 2) imprecision (accuracy of the 95% confidence interval (CI)); 3) inconsistency (i.e. overlapping CIs or heterogeneity that crosses zero in the meta-analysis, a.k.a. heterogeneity); 4) indirectness (i.e. patients or outcomes are different than the recommendation made in the RCTs); and 5) publication bias (i.e. assessed through statistical methods).

Why is it essential not to synthesize RCTs that result in downgrading?

GRADE certainty ratings can be found in [Citation14].

Table 1. GRADE certainty ratings.

The certainty ratings mean that downgrading the certainty of the data based on any two or one of the five above criteria creates very-low to low-certainty of the synthesized RCTs on the GRADE. This translates into an observed effect that is probably or might be wrong and ‘ … should be interpreted with caution [Citation2]. ’ The question then becomes, should this level of certainty in the RCT evidence be translated into any clinical practice recommendations when synthesized in an SR? When this happens, we cannot assume that an intervention is or is not effective [Citation16]. When reliable evidence does not exist, the absence of this evidence cannot be translated into any discordant clinical practice recommendations [Citation1]. A conclusion based on this absence of reliable evidence cannot provide data-driven evidence of any effect [Citation16]. The most honest interpretation when this occurs should be not to change clinical practice by not recommending the treatment [Citation16].

What do we know about the methodological quality and confidence in the conclusions of physical therapy SRs?

A Measurement Tool to Assess Systematic Reviews 2 (AMSTAR 2) is a critical appraisal tool used to assess SRs that use RCTs of healthcare interventions [Citation17]. The AMSTAR 2 has seven critical domains. These domains are: Protocol registered before the commencement of the review (item 2); Adequacy of the literature search (item 4); Justification for excluding individual studies (item 7); Risk of bias from individual studies being included in the review (item 9); Appropriateness of meta-analytical methods (item 11); Consideration of risk of bias when interpreting the results of the review (item 13); and Assessment of presence and likely impact of publication bias (item 15) [Citation17]. includes the criteria for how the AMSTAR 2 is rated.

Table 2. Rating overall confidence in the results of the review [Citation17].

Beyond the certainty of the synthesized RCTs included within SRs, the SR’s methodological quality impacts how confident clinicians can be in the SR’s conclusions. When the confidence in the SR results is critically-low or low, the interpretation of the SR’s findings is that the SR should not be relied on and may not provide an accurate and comprehensive summary of the available studies that address the question of interest, respectively. Again, this situation is where the synthesis of RCTs using SR methodology creates unreliable evidence. When reliable evidence does not exist, the most honest interpretation is that this evidence cannot be translated into clinical practice recommendations [Citation16]. Again, the most honest interpretation when this occurs should be not to change clinical practice by not recommending the treatment [Citation16].

What have we learned about heterogeneity (variability) in published research?

The GRADE criteria include imprecision and inconsistency, which involve the accuracy of the 95% CI and discuss the overlapping of CIs in RCTs or heterogeneity that crosses zero in the meta-analysis, respectively, as reasons to downgrade the certainty of the RCT evidence. Both criteria are related to the statistical heterogeneity or accuracy of the data. It is known that the false positive rate of p-values is approximately 30% [Citation18]. Researchers, therefore, should assess the size of the average treatment effect (ATE) (i.e. is it large enough to be clinically meaningful) and the variability of the effect [Citation5]. The heterogeneity measures inform the reader of the variability of ATE and provide a range of where that ATE could land if the RCT or SR were repeated. The standard measure of heterogeneity in RCTs is the 95% CI, and the most frequently used measures of heterogeneity in SRs include the I2 statistic and the prediction interval. High levels of statistical heterogeneity are the primary factor that limits generalizability [Citation19] and the clinical relevance of research findings.

It is known that if the 95% CI overlaps when assessing differences between and within groups, although statistically significant findings may be present and the effect size may be clinically meaningful, the accuracy of that finding is not precise enough to rely upon. This is a strong indicator that if the study was repeated, the effect size obtained could land anywhere within the CI, challenging the clinical utility and reproducibility of the study’s findings.

SR heterogeneity is commonly expressed via the I2 statistic. The I2 statistic assesses the variability between the individual study’s effects and the synthesized effect of the meta-analysis [Citation20]. The I2 is reported as a percentage, with 25% considered low, 50% considered moderate, and 75% considered high [Citation21]. Heterogeneity is expected when comparing results across studies [Citation19], but strict guidelines on meta-analysis based solely on heterogeneity using the I2 statistic are lacking [Citation22]. It has been suggested that the clinician consumer determines if the heterogeneity levels in a meta-analysis are appropriate [Citation22]. This is problematic on several levels.

  1. A meta-analysis should be a prospectively registered decision. A meta-analysis should not be performed if the data do not meet this prospectively derived heterogeneity threshold.

  2. The I2 as a measure of heterogeneity is a mathematical point estimate that should be reported with the 95% CI so that the reader can determine the estimate’s accuracy. This is a problem in published SRs as these 95% CIs are often not reported, and it is known that even I2 point estimates of 0% (i.e. suggesting no statistical heterogeneity) can have 95% CIs that range from 0 to 79% [Citation23]. When this occurs, it strongly suggests that the point estimate of heterogeneity and the variability are too variable to be trusted. In other words, the variability strongly suggests that the meta-analysis should not be performed. This is even more challenging considering that the I2 can be imprecise and biased in small SRs with few included studies [Citation24], which is often the case. Bias in I2 can lead to over- or under-estimating heterogeneity in small SRs, with positive bias present when heterogeneity is truly small and negative bias present when heterogeneity is substantial [Citation24].

  3. A more accurate means of expressing heterogeneity is the prediction interval [Citation20,Citation25]. The prediction interval statistically determines the variability across the synthesized studies. Unfortunately, the prediction interval is not reported in most SRs. When reported, they are wide and cross zero even when the 95% CI of the synthesized effect does not [Citation26]. This is important as the prediction interval helps to identify imprecision and inconsistency when applying the GRADE and, in this case, would downgrade the rating on the GRADE from high to low based on these two criteria.

What conclusions can we draw from the certainty of RCTs and confidence in SRs?

In 2023, it was identified that 95% of the SRs published in the International Society of Physiotherapy Journals Editors (ISPJE) member journals indexed in MEDLINE were rated as ‘critically low’ on the AMSTAR 2 [Citation27]. Additionally, it was identified that 87.5% of the ISPJE member journals do not require prospective SR registration [Citation27]. An SR that is not prospectively registered is downgraded on the AMSTAR 2 by one level. These findings grossly erode the evidence base of physical therapy and highlight the urgency of improving the quality of RCTs and SRs. There is no evidence-based practice without our ability to interpret findings accurately [Citation28].

The findings of these ‘trustworthy’ SRs do not suggest that manual therapy is ineffective. The findings indicate that there is an absence of moderate to high certainty evidence that can confidently translate into strong practice recommendations through the SR process. There are similar uncertainty and confidence challenges related to other interventions used by physical therapists related to heterogeneity and conflicts of interest that include: 1) blood flow restricted training [Citation29]; 2) cognitive behavioral therapy [Citation30]; 3) foam rolling [Citation31]; 4) instrument assisted soft tissue mobilization [Citation32]; 5) kinesiotaping [Citation33]; 6) movement system impairments [Citation34,Citation35]; 7) pain neuroscience education [Citation36]; 8) stratified interventions using the STarT Back Screening Tool [Citation37]; 9) therapeutic exercises [Citation38,Citation39]; and 10) trigger point dry needling [Citation40] to name a few. Additionally, these are not challenges unique to physical therapy interventions, as medical interventions also display unknown levels of certainty and confidence in observed effects [Citation41–44].

The findings (or gross lack thereof) signal two potential future paths. First, if the profession fails to respond accordingly by not generating trustworthy published evidence, then there will be no future need to rely upon living SRs or to consider this sector of the profession evidence-based. In contrast, if the profession chooses the alternate path and begins to generate published evidence consistent with a strict adherence to professional and ethical standards of performing, reporting, and publishing research, then the idea of a living SR project can assist with translating future published evidence into clinical practice. Given the current lack of certainty surrounding the evidence, publishers, editors, and authors should be scrutinized for appropriate clinical practice recommendations based on RCTs and the SRs that synthesize them in the published research. Again, an honest interpretation of critically low to low certainty RCT evidence synthesized into SRs with critically low to low confidence should not be used to make clinical recommendations as no reliable evidence can or should be used to make any decisions. The question then becomes, are we as a profession more interested in making grossly overused, cautious, and discordant practice recommendations, always finding an effect, or forging the difficult path ahead to establish a solid base of high-quality evidence and meaningful practice recommendations?

References

  • Yao L, Guyatt GH, Djulbegovic B. Can we trust strong recommendations based on low quality evidence? BMJ. 2021 Nov 25; 375(2833). doi: 10.1136/bmj.n2833
  • Gaylor JM. An overused phrase: interpreted with caution. J Clin Epidemiol. 2013 Feb;66(2):238–239. doi: 10.1016/j.jclinepi.2012.01.007
  • Riley SP, Swanson BT, Shaffer SM, et al. Protocol for the development of a ‘trustworthy’ living systematic review and meta analyses of manual therapy interventions to treat neuromusculoskeletal impairments. J Man Manip Ther. 2022 Sep;9:1–11.
  • Cook C, Garcia AN. Post-randomization bias. J Man Manip Ther. 2020 May;28(2):69–71. doi: 10.1080/10669817.2020.1739153
  • Riley SP, Swanson BT, Cook CE. “Trustworthiness,” confidence in estimated effects, and confidently translating research into clinical practice. Arch Physiother. 2023 Apr 6;13(1):8. doi: 10.1186/s40945-023-00162-9
  • Riley SP, Swanson BT, Shaffer SM, et al. Why do ‘Trustworthy’ living systematic reviews matter? J Man Manip Ther. 2023 Aug;31(4):215–219. doi: 10.1080/10669817.2023.2229610
  • Riley SP, Swanson BT, Shaffer SM, et al. Protocol for the development of a ‘trustworthy’ living systematic review and meta analyses of manual therapy interventions to treat neuromusculoskeletal impairments. J Man Manip Ther. 2022:1–11. doi: 10.1080/10669817.2022.2119528
  • Riley SP, Shaffer SM, Flowers DW, et al. Manual therapy for non-radicular cervical spine related impairments: establishing a ‘Trustworthy’ living systematic review and meta-analysis. J Man Manip Ther. 2023:1–15. doi: 10.1080/10669817.2023.2201917
  • Valera-Calero A, Gallego-Izquierdo T, Malfliet A, et al. Endocrine response after cervical manipulation and mobilization in people with chronic mechanical neck pain: a randomized controlled trial. Eur J Phys Rehabil Med. 2019;55(6):792–805. doi: 10.23736/S1973-9087.19.05475-3
  • Carrasco-Martínez F, Ibáñez-Vera AJ, Martínez-Amat A, et al. Short-term effectiveness of the flexion-distraction technique in comparison with high-velocity vertebral manipulation in patients suffering from low-back pain. Complement Ther Med. 2019;44:61–67. doi: 10.1016/j.ctim.2019.02.012
  • de Oliveira, RF, Costa LOP, Nascimento LP, et al. Directed vertebral manipulation is not better than generic vertebral manipulation in patients with chronic low back pain: a randomised trial. J Physiother. 2020;66(3):174–179. doi: 10.1016/j.jphys.2020.06.007
  • Riley SP, Swanson BT, Shaffer SM, et al. Does manual therapy meaningfully change quantitative sensory testing and patient reported outcome measures in patients with musculoskeletal impairments related to the spine?: a ‘trustworthy’systematic review and meta-analysis. J Man & Manipulative Ther. 2023:1–16. doi: 10.1080/10669817.2023.2247235
  • Flowers DW, Swanson BT, Shaffer SM, et al. Is there ‘trustworthy’ evidence for using manual therapy to treat patients with shoulder dysfunction?: a systematic review. PLOS ONE. 2024;19(1):e0297234. doi: 10.1371/journal.pone.0297234
  • Practice BB. What is GRADE? BMJ Publishing Group Limited; 2024 [cited 2024 Feb 1]. Available from: https://bestpractice.bmj.com/info/us/toolkit/learn-ebm/what-is-grade/
  • Andrews JC, Schunemann HJ, Oxman AD, et al. GRADE guidelines: 15. Going from evidence to recommendation-determinants of a recommendation’s direction and strength. J Clin Epidemiol. 2013 Jul;66(7):726–735. doi: 10.1016/j.jclinepi.2013.02.003
  • Feres M, Feres MFN. Absence of evidence is not evidence of absence. J Appl Oral Sci. 2023 Mar 27; 31:ed001. doi: 10.1590/1678-7757-2023-ed001
  • Shea BJ, Reeves BC, Wells G, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017 Sep 21;358:j4008. doi: 10.1136/bmj.j4008
  • Karpen SC. P value problems. Am J Pharm Educ. 2017 Nov;81(9):6570. doi: 10.5688/ajpe6570
  • Deeks JJ HJ, Altman DG. Chapter 10: analysing data and undertaking meta-analyses. In: Higgins JT, Chandler J, Cumpston M. , editors. Cochrane handbook for systematic reviews of interventions version 6.3 (updated February 2022). Cochrane: Cochrane; 2022.
  • Borenstein M. Research note: In a meta-analysis, the I2 index does not tell us how much the effect size varies across studies. J Physiother. 2020;66(2):135–139. doi: 10.1016/j.jphys.2020.02.011
  • Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–560. doi: 10.1136/bmj.327.7414.557
  • Israel H, Richter RR. A guide to understanding meta-analysis. J Orthop Sports Phys Ther. 2011;41(7):496–504. doi: 10.2519/jospt.2011.3333
  • Teichert F, Karner V, Doding R, et al. Effectiveness of exercise interventions for preventing neck pain: a systematic review with meta-analysis of randomized controlled trials. J Orthop Sports Phys Ther. 2023 Oct;53(10):594–609. doi: 10.2519/jospt.2023.12063
  • von Hippel PT. The heterogeneity statistic I2 can be biased in small meta-analyses. BMC Med Res Methodol. 2015;15(1):1–8. doi: 10.1186/s12874-015-0024-z
  • IntHout J, Ioannidis JP, Rovers MM, et al. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open. 2016;6(7):e010247. doi: 10.1136/bmjopen-2015-010247
  • Kovanur Sampath K, Treffel L, Pt O, et al. Changes in biochemical markers following a spinal manipulation - a systematic review update. J Man Manip Ther. 2024 Feb;32(1):28–50. doi: 10.1080/10669817.2023.2252187
  • Riley SP, Swanson BT, Shaffer SM, et al. Is the quality of systematic reviews influenced by prospective registration: a methods review of systematic musculoskeletal physical therapy reviews. J Man Manip Ther. 2022 Aug;8:1–14.
  • Jette AM. Without scientific integrity, there can be no evidence base. Phys Ther. 2005 Nov;85(11):1122–1123. doi: 10.1093/ptj/85.11.1122
  • Miller BC, Tirko AW, Shipe JM, et al. The systemic effects of blood flow restriction training: a systematic review. Int J Sports Phys Ther. 2021;16(4):978–990. doi: 10.26603/001c.25791
  • Fordham B, Sugavanam T, Edwards K, et al. Cognitive-behavioural therapy for a variety of conditions: an overview of systematic reviews and panoramic meta-analysis. Health Technol Assess. 2021 Feb;25(9):1–378. doi: 10.3310/hta25090
  • Konrad A, Tilp M, Nakamura M. A comparison of the effects of foam rolling and stretching on physical performance. A systematic review and meta-analysis. Front Physiol. 2021;12:720531. doi: 10.3389/fphys.2021.720531
  • Seffrin CB, Cattano NM, Reed MA, et al. Instrument-assisted soft tissue mobilization: a systematic review and effect-size analysis. J Athl Train. 2019 Jul;54(7):808–821. doi: 10.4085/1062-6050-481-17
  • Ramirez-Velez R, Hormazabal-Aguayo I, Izquierdo M, et al. Effects of kinesio taping alone versus sham taping in individuals with musculoskeletal conditions after intervention for at least one week: a systematic review and meta-analysis. Physiotherapy. 2019 Dec;105(4):412–420. doi: 10.1016/j.physio.2019.04.001
  • Azevedo DC, Ferreira PH, Santos HO, et al. Movement system impairment-based classification treatment versus general exercises for chronic low back pain: randomized controlled trial. Phys Ther. 2018 Jan 1;98(1):28–39. doi: 10.1093/ptj/pzx094
  • Salamh PA, Hanney WJ, Boles T, et al. Is it time to normalize scapular dyskinesis? The incidence of scapular dyskinesis in those with and without symptoms: a systematic review of the literature. Int J Sports Phys Ther. 2023;V18(3):558–576. doi: 10.26603/001c.74388
  • Wood L, Hendrick PA. A systematic review and meta-analysis of pain neuroscience education for chronic low back pain: short-and long-term outcomes of pain and disability. Eur J Pain. 2019 Feb;23(2):234–249. doi: 10.1002/ejp.1314
  • Rhon DI, Greenlee TA, Poehlein E, et al. Effect of risk-stratified care on disability among adults with low back pain treated in the military health system: a randomized clinical trial. JAMA Netw Open. 2023 Jul 3;6(7):e2321929. doi: 10.1001/jamanetworkopen.2023.21929
  • Karlsson M, Bergenheim A, Larsson MEH, et al. Effects of exercise therapy in patients with acute low back pain: a systematic review of systematic reviews. Syst Rev. 2020 Aug 14;9(1):182. doi: 10.1186/s13643-020-01412-8
  • Hayden JA, Ellis J, Ogilvie R, et al. Exercise therapy for chronic low back pain. Cochrane Database Syst Rev. 2021 Sep 28;9(9): CD009790.doi: 10.1002/14651858.CD009790.pub2
  • Gattie E, Cleland JA, Snodgrass S. The effectiveness of trigger point dry needling for musculoskeletal conditions by physical therapists: a systematic review and meta-analysis. J Orthop Sports Phys Ther. 2017 Mar;47(3):133–149. doi: 10.2519/jospt.2017.7096
  • Goldacre B, Drysdale H, Powell-Smith A, et al. The COMPare trials project 2016. [cited 2023 Feb 16]. Available from: www.COMPare-trials.org
  • Goldacre B, Drysdale H, Dale A, et al. COMPare: a prospective cohort study correcting and monitoring 58 misreported trials in real time. Trials. 2019 Feb 14;20(1):118. doi: 10.1186/s13063-019-3173-2
  • Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021 Apr;76(4):472–479. doi: 10.1111/anae.15263
  • Van Noorden R. Medicine is plagued by untrustworthy clinical trials. How many studies are faked or flawed? Nature. 2023 Jul;619(7970):454–458. doi: 10.1038/d41586-023-02299-w

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.