123
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Artificial Intelligence in Cancer Clinical Research: II. Development and Validation of Clinical Prediction Models

ORCID Icon &

Introduction

In the first segment of this series on Artificial Intelligence in Cancer Clinical Research, we approached the subject first by developing a better understanding of human intelligence (Citation1). The goals of AI often revolve around creating algorithms and machines capable of outperforming or amplifying the capabilities of humans to perform complex tasks more rapidly and beyond our own capabilities both in terms of speed and complexity. While unprecedented capabilities of AI in addressing logical algorithmic or computable tasks have been observed, it has long been clear that there are activities of the human mind that are particularly challenging to replicate or even define. The other primary goal of AI is thus to emulate or, at least, better understand the inner workings and unique capabilities of the human brain. Clearly, many of these capabilities appear to reside outside of logical and computational aspects of the human brain and enable unique aspects of human intelligence such as insight, intuition, discovery, imagination, discrimination, and recognition arguably representing a deeper understanding of the meaning of our world and our actions.

Background

While considerable progress has been made in attempts to better model the operational aspects of the human brain through neural networks and connections, we are often left with very interesting results perhaps surpassing other approaches but, at the same time, largely leaving us in the dark as to the inner workings of how the results were achieved. Unlike logical and symbolic reasoning, which generally can be traced back in order to understand the how the results were obtained, the workings of connected neural networks often leave us with promising results that we cannot entirely trace back or fully understand. In no aspect of AI and machine learning (ML) are these issues more apparent or potentially problematic than in the development and validation of diagnostic, prognostic, and predictive modeling. In fact, it is this very important realm of clinical research where a better understanding of these different approaches to AI is most critically needed along with a more complete appreciation of the challenges and limitations of AI-guided predictive modeling and the results that emerge.

Over time, the purpose of prognostic models has largely remained those of a better understanding of disease processes, improving the design and analysis of clinical trials by risk stratification, comparison of outcomes between treatment groups by appropriate adjustment, defining risk groups based on prognosis, and, to predict disease outcome more accurately and guide clinical decision-making including treatment selection as well as patient counseling (Citation2). Altman and Lyman have previously discussed the common causes of variation in the results of prognostic factor studies including small sample size, adjustment for differing variables, the use of different cut-points in biomarker studies, consideration of different patient subsets, as well as the questionable use of stepwise variable selection procedures (Citation2). Important criteria for assessing such studies have been discussed including clearly stating apriori, in a formal study protocol, study hypotheses and planned subgroup analyses, previously identified prognostic factors, along with explicitly identifying the study population including inclusion and exclusion criteria, and providing a detailed discussion of patient treatment assignment and prognostic factor assays. It is critical to estimate the appropriate sample size to achieve the desired power to detect meaningful differences in outcomes and the relationship of the sample size to the number of outcome events. Finally, the planned analysis must be detailed in advance of the study including hypothesis testing on subgroups and any anticipated interactions (Citation2).

As noted previously and despite the rapid progress witnessed across a broad array of medical fields, general limitations to the accuracy, reliability and validity of AI approaches have been demonstrated often presenting serious constraints on the application of such models in medicine including clinical cancer research. Multiple examples of false negative and false positive results have led to the reporting of factually erroneous results and the continuing need for continued clinical oversight. A comprehensive evaluation of large language models (LLMs), a subset of AI models, in the field of oncology was recently reported on five publicly available LLMs based on a battery of 2044 oncology questions (Citation3). Model performance was evaluated against a novel validation set of oncology questions. Although the authors note that model performance was good overall, all LLMs tested demonstrated clinically significant error rates, including overconfidence and consistent inaccuracies. The authors urge standardized evaluations of the strengths and limitations of LLMs in order to guide patients as well as medical professionals.

Importantly, broad and readily available access to patient-level data with AI has confirmed the need for rigorous ethical review, apriori research approval and informed consent. In addition to the pervasiveness of digital and AI illiteracy, there is the fundamental challenge of understanding and explaining study results when the internal processes appear to operate in a ‘black box’. This is true of all highly complex analyses but is particularly acute in the rapidly expanding areas of neural network and generative AI applications. Our inability to truly understand and explain how results were generated is a major challenge for all stakeholders including clinicians, policy makers and patients. As recently reported by Solovitz, generative AI models often demonstrate amazing abilities while at the same time may result in serious errors (Citation4). As he notes, we have little understanding of why such models work as well as they do nor when and why they often give incorrect responses. He emphasizes the need for additional research and considerable caution in using such models especially for clinical applications given our lack of understanding of how and why they demonstrate such apparently miraculous results while being unable to anticipate when serious errors may occur.

Limitations of AI in the development of clinical prediction models

While AI developments are apparent across virtually all fields of medicine and public health, much attention has been directed toward clinical decision making in oncology and improving the screening, diagnosis, staging and treatment of patients with cancer along with research and educational applications (Citation5, Citation6). While some efficiency in the generation of diagnostic results have been demonstrated and appear quite reproducible for simple pathology or radiology images, more detailed evaluation of AI algorithms has shown significant limitations in details such as for tumor subtypes, patient subgroups based on complex risk patterns and the optimal dosing of treatment and supportive care (Citation7). However, nowhere are the challenges more apparent than in the development and validation of clinical prognostic and prediction models.

Numerous studies have reported a range of methodologic challenges to the use of ML and other AI techniques in the development and validation of diagnostic, prognostic and predictive risk models across medical disciplines. Importantly, many of the major limitations associated with AI-driven prognostic and predictive modeling studies are the same limitations that pervade the development of risk models using logistic regression and other more conventional modeling approaches () (Citation8). As with conventional modeling, some caution is needed with regard to the distinction between prognostic versus prediction models as well as those where risk factors are both prognostic and predictive. Prognostic factors are generally considered as those patient or tumor characteristics that affect patient outcome, whereas predictive factors are considered those that define the effect of an intervention on clinical outcomes. The authors of many studies including those discussed here often do not distinguish between prognostic and predictive models for purposes of methodologic critique. For that reason, the reported systematic reviews and methodologic limitations observed appear to be considered interchangeable and apply to both prognostic and predictive modeling studies alike. Related to this, the tools used to assess the quality of multivariable modeling studies including the risk of bias, analytic methodology as well as reporting are generally applied interchangeably to diagnostic, prognostic, and prediction modeling studies (Citation9, Citation10). Importantly, a consistent reduction in the performance of predictive models in independent populations highlights the essential importance of validation in separate populations in which model applications are most likely to be found including underserved individuals.

Table 1. Essential considerations in risk model development and validation [Citation8].

Dhiman and colleagues conducted a systematic review of the risk of bias in the development and validation of prognostic model studies across 62 individual publications using the Prediction Risk of Bias Assessment Tool (PROBAST) demonstrating an overall high risk of bias judgement in 84% of developed models and 51% of validated models (Citation11). Bias was introduced most commonly in the analysis, frequently related to insufficient sample size and the use of split sample internal validation approaches. The authors concluded that the quality of prognostic models in oncology developed using ML demonstrates a high risk of bias limiting their use in clinical decision making and urging better adherence to quality standards. The authors also argue that the reporting of prognostic modeling studies based on ML methods needs considerable improvement with median adherence to the protocol for development of a reporting guidelines (TRIPOD) items of 41% with only 19% reporting at least 50% adherence (Citation12). Adherence was lower in model development studies (38%) and only somewhat higher in studies that also include model validation (49%).

Very similar findings have been observed when using ML for developing prediction models as summarized in a systematic review by Navarro and colleagues evaluating the risk of bias in the development of prediction models evaluating 152 identified studies using the PROBAST tool (Citation13). Of the 171 separate analyses, 87% were rated at high risk for bias. The analysis domain was most commonly affected with 56% of the models developed without adequate events per candidate predictor, 41% with inadequate management of missing values and 39% with inadequate assessment of overfitting. Inadequate information was also reported on blinding of outcomes in 40% of studies and on blinding of predictors in 52% of models. While the majority of studies used appropriate data to develop and externally validate the ML based tools, information on blinding to the primary outcomes and to predictors was absent in 40% and 52% of the developed models, respectively. The authors concluded that most studies of ML-based prediction models demonstrate poor methodologic quality and are at high risk for bias. Further analysis by the same group of authors found that most ML studies only reported the development of prediction models (87.5%), focused on binary outcomes (86.2%), and did not report on sample size estimation (82.2%) (Citation14). They also noted that calibration metrics were not reported in 94.6% of studies with the area under the Receiver Operating Characteristic curve ranging from 0.45 to 1.00,

Christodoulou and colleagues conducted a systematic review of clinical prediction models including those utilizing either ML or regression techniques (Citation15). The most common ML methods utilized included classification trees, random forests, artificial neural networks and support vector machines. A potential bias in the validation procedures was detected in 68% of the studies and calibration was not reported in 79%. They reported 282 comparisons between logistic regression and ML modeling methods. They found essentially no difference in discrimination between modeling methods among the 145 comparisons at low risk of bias while reporting significantly better discrimination with logistic regression among the 137 comparisons considered at high risk of bias. The authors concluded that there is no evidence for the superior performance of ML over logistic regression in developing clinical predictive models (Citation15).

Further analyses of the several studies reported above highlight evidence for spin practices, overinterpretation and inconsistent and otherwise poor reporting standards among the majority of predictive modeling studies based on ML (Citation16–18). More than half of identified publications utilized overly strong or leading words in their title, abstract, results, discussion or conclusion (Citation17). Many studies also made recommendations for clinical use of the modeling results in the main text without reporting any external validation (Citation16). Navaro and colleagues ultimately conclude that the completeness of reporting of prediction models studies developed using ML methods, like those using regression-based techniques, is poor missing information critical for the appropriate use the resulting models including model specification and performance (Citation18). Among the152 articles considered, they report adherence to a median of 38.7% of the items in the TRIPOD Statement with no article adhering to all elements. Few reported on the flow of subjects, blinding of predictors, model specification or model predictive performance.

As noted above, the critical elements of conducting and reporting clinical prediction models are the same for those based on ML and those based on more traditional regression methods. An updated TRIPOD guidance for reporting clinical prediction models using either approach has recently been published (Citation19). This includes an updated checklist for the reporting of such models that differs little from previous versions but is worth reviewing and summarizing. In the background, the need to describe any known health inequities between sociodemographic groups represented in the study population is emphasized. Perhaps not adequately emphasized is the need to identify anticipated differences between the study population and a likely population to which the model will be applied. Critically important is the need to acknowledge differences in the sources of data and population characteristics between the model derivation study set and that used for model validation. Essential data elements to be reported include dates of data collection, including start and end of participant accrual; start and completion of the intervention of interest and the duration of follow-up. Additional critical data elements include eligibility criteria and reasons for subject inclusion and exclusion. The outcome of interest being predicted needs to be clearly defined and fully specified including the time horizon, how and when assessed, the rationale or importance for choosing this outcome as well as any effort of blinding subjects or investigators. The choice and basis for the selection of initial predictors to be considered for model entry should be detailed.

All of these challenges, along with the overall variety and complexity of ML modeling studies have posed a challenge for clinicians and even many methodologists interpreting and applying the results of such studies (). However, the majority of challenges and limitations in the design, conduct, analysis and reporting of such studies are very similar if not identical to those that have challenged the validity and utility of modeling studies based on conventional methods only made more challenging due to the increased complexity, unfamiliarity or even ‘black box’ nature of ML models. Based on the critical analyses and reviews discussed above, it is apparent that many investigators utilizing ML approaches for prognostic and predictive modeling have not been as attentive to the fundamental requirements for conducting accurate and valid modeling studies worthy of further consideration and trustworthy for clinical application (Citation9–12,Citation14–28). There is clearly a need for investigators to go back to basics and apply their fundamental understanding of what is essential for estimating reliable, accurate and valid modeling results. We cannot allow ourselves to be confused or misled by the mystique and enthusiasm for the application of ML techniques no matter how alluring to investigators as well as reviewers and editors. Far more work is required along with high quality, methodologically sound and reproducible results that are confirmed by other investigators in other settings for the safe and effective application of the results of such models moving forward.

Table 2. Risk of bias in AI-driven clinical prognostic and predictive models [Citation9–28].

Conclusions

Despite the many impressive advances in AI technology, important limitations exist with both traditional and AI-guided modeling efforts that must be addressed in order to provide accurate, reliably and valid results and trustworthy clinical recommendations. These include, first and foremost, establishing the quality, integrity and generalizability of the data utilized for both model development and model validation in the setting to which the results are to be applied. The same attention needed in study design, rigorous analysis and reporting of results are needed in AI modeling studies as in those based on traditional methods while acknowledging the additional complexity and often limited understanding of AI driven processing. Particular attention must be paid to issues of sample size, missing data, an adequate number of events per predictor, the challenge of overfitting, appropriate blinding to outcomes, reporting of calibration metrics and model performance, and the great importance of appropriate population and methods for model validation. It is particularly important that the results of AI modeling efforts not be overinterpreted or exaggerated given strong evidence for frequent ‘spinning’ of AI methods and results. Finally, a thorough and honest discussion of the limitations of all aspects of modeling studies including the potential impact on health inequities is important for both scientific interpretation and appropriate clinical application. Careful adherence to the quality standards of the TRIPOD reporting items provides important guidance for investigators and reassurance to editors, reviewers and clinicians who may apply the results of modeling efforts in clinical oncology practice.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.

References

  • Lyman GH, Kuderer NM. Artificial intelligence in cancer clinical research: I. Introduction. Cancer Invest. 2024;2024:1–4. doi:10.1080/07357907.2024.2347784.
  • Altman DG, Lyman GH. Methodological challenges in the evaluation of prognostic factors in breast cancer. Breast Cancer Res Treat. 1998;52(1–3):289–303. doi:10.1023/a:1006193704132.
  • Rydzewski NR, Dinakaran D, Zhao SG, Ruppin E, Turkbey B, Citrin DE, Patel KR. Comparative evaluation of LLMs in clinical oncology. Nejm Ai. 2024;1(5):1–11. doi:10.1056/AIoa2300151.
  • Szolovits z. Large language models seem miraculous, but science abhors miracles. Nejm Ai. 2024;2024:103 doi:10.1056/AIp2300103.
  • Rösler W, Altenbuchinger M, Baeßler B, Beissbarth T, Beutel G, Bock R, et al. An overview and a roadmap for artificial intelligence in hematology and oncology. J Cancer Res Clin Oncol. 2023;149(10):7997–8006. doi:10.1007/s00432-023-04667-5.
  • Liao J, Li X, Gan Y, Han S, Rong P, Wang W, et al. Artificial intelligence assists precision medicine in cancer treatment. Front Oncol. 2022;12:998222. doi:10.3389/fonc.2022.998222.
  • Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27(12):2176–2182. doi:10.1038/s41591-021-01595-0.
  • Lyman GH, Msaouel P, Kuderer NM. Risk model development and validation in clinical oncology: lessons learned. Cancer Invest. 2023;41:1–11.
  • Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;385(4):e078378–g7594. doi:10.1136/bmj.g7594.
  • Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. doi:10.1136/bmjopen-2020-048008.
  • Dhiman P, Ma J, Andaur Navarro CL, Speich B, Bullock G, Damen JAA, et al. Risk of bias of prognostic models developed using machine learning: a systematic review in oncology. Diagn Progn Res. 2022;6(1):13. doi:10.1186/s41512-022-00126-w.
  • Dhiman P, Ma J, Navarro CA, et al. Reporting of prognostic clinical prediction models based on machine learning methods in oncology needs to be improved. J Clin Epidemiol. 2021;138:60–72.
  • Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. 2021;375:n2281. doi:10.1136/bmj.n2281.
  • Andaur Navarro CL, Damen JAA, van Smeden M, et al. Systematic review identifies the design and methodological conduct of studies on machine learning-based prediction models. J Clin Epidemiol. 2023;154:8–22.
  • Christodoulou E, Ma J, Collins GS, et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.
  • Andaur Navarro CL, Damen JAA, Takada T, et al. Systematic review finds "spin" practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. 2023;158:99–110.
  • Dhiman P, Ma J, Andaur Navarro CL, et al. Overinterpretation of findings in machine learning prediction model studies in oncology: a systematic review. J Clin Epidemiol. 2023;157:120–133.
  • Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med Res Methodol. 2022;22(1):12. doi:10.1186/s12874-021-01469-6.
  • Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378.
  • Gichoya JW, Thomas K, Celi LA, et al. AI pitfalls and what not to do: mitigating bias in AI. Br J Radiol. 2023;96:20230023.
  • Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. doi:10.1126/science.aax2342.
  • Adler-Milstein J, Redelmeier DA, Wachter RM. The limits of clinician vigilance as an AI safety Bulwark. Jama. 2024;331(14):1173–1174. doi:10.1001/jama.2024.3620.
  • Liu Y, Chen P-HC, Krause J, Peng L. How to read articles that use machine learning: users’ guides to the medical literature. Jama. 2019;322(18):1806–1816. doi:10.1001/jama.2019.16489.
  • Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, Abdulnour R-E, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med. 2024;184(5):581–583. doi:10.1001/jamainternmed.2024.0295.
  • Clift AK, Dodwell D, Lord S, Petrou S, Brady M, Collins GS, et al. Development and internal-external validation of statistical and machine learning models for breast cancer prognostication: cohort study. BMJ. 2023;381:e073800. doi:10.1136/bmj-2022-073800.
  • Damen JAA, Moons KGM, van Smeden M, et al. How to conduct a systematic review and meta-analysis of prognostic model studies. Clin Microbiol Infect. 2023;29:434–440.
  • Levis B, Snell KIE, Damen JAA, Hattle M, Ensor J, Dhiman P, et al. Risk of bias assessments in individual participant data meta-analyses of test accuracy and prediction models: a review shows improvements are needed. J Clin Epidemiol. 2024;165:111206.
  • Dhiman P, Ma J, Andaur Navarro CL, Speich B, Bullock G, Damen JAA, et al. Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review. BMC Med Res Methodol. 2022;22(1):101. doi:10.1186/s12874-022-01577-x.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.