350
Views
0
CrossRef citations to date
0
Altmetric
Statistical Innovation in Healthcare: Celebrating the Past 40 Years and Looking Toward the Future - Special issue for the 2021 Regulatory-Industry Statistics Workshop

Statistical Considerations and Challenges for Pivotal Clinical Studies of Artificial Intelligence Medical Tests for Widespread Use: Opportunities for Inter-Disciplinary Collaboration

Pages 476-490 | Received 02 Dec 2021, Accepted 10 Jan 2023, Published online: 14 Feb 2023

References

  • Abbas, H., Garberson, F., Liu-Mayo, S., Glover, E., and Wall, D. P. (2020), “Multi-modular AI Approach to Streamline Autism Diagnosis in Young Children,” Scientific Reports, 10, 5014. DOI: 10.1038/s41598-020-61213-w.
  • Altman, D. G., McShane, L. M., Sauerbrei, W., and Taube, S. E. (2012), “Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): Explanation and Elaboration,” PLoS Medicine, 9, e1001216. DOI: 10.1371/journal.pmed.1001216.
  • Altman, D. G., Vergouwe, Y., Royston, P., and Moons, K. G. M. (2009), “Prognosis and Prognostic Research: Validating a Prognostic Model,” BMJ, 338, b605. DOI: 10.1136/bmj.b605.
  • Arora, S., Stouffer, G. A., Kucharska-Newton, A. M., Qamar, A., Vaduganathan, M., Pandey, A., Porterfield, D., Blankstein, R., Rosamond, W. D., Bhatt, D. L., and Caughey, M. C. (2019), “Twenty Year Trends and Sex Differences in Young Adults Hospitalized With Acute Myocardial Infarction,” Circulation, 139, 1047–1056. DOI: 10.1161/CIRCULATIONAHA.118.037137.
  • Biewald, L. (2019), “Why are Machine Learning Projects so Hard to Manage?,” available at https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641
  • Bland, J. M., and Altman, D. G. (1986), “Statistical Methods for Assessing Agreement between Two Methods of Clinical Measurement,” Lancet, 1, 307–310. DOI: 10.1016/S0140-6736(86)90837-8.
  • Bland, J. M., and Altman, D. G. (1999), “Measuring Agreement in Method Comparison Studies,” Statistical Methods in Medical Research, 8, 135–160. DOI: 10.1177/096228029900800204.
  • Bland, J. M., and Altman, D. G. (2007), “Agreement between Methods of Measurement with Multiple Observations per Individual,” Journal of Biopharmaceutical Statistics, 17, 571–582. DOI: 10.1080/10543400701329422.
  • Bleeker, S. E., Moll, H. A., Steyerberg, E. W., Donders, A. R. T., Derksen-Lubsen, G., Grobbee, D. E., and Moons, K. G. M. (2003), “External Validation is Necessary in Prediction Research: A Clinical Example,” Journal of Clinical Epidemiology, 56, 826–832. DOI: 10.1016/s0895-4356(03)00207-5.
  • Brenner, H., and Gefeller, O. (1997), “Variation of Sensitivity, Specificity, Likelihood Ratios, and Predictive Values with Disease Prevalence,” Statistics in Medicine, 16, 981–99124. DOI: 10.1002/(sici)1097-0258(19970515)16:9 < 981::aid-sim510 > 3.0.co;2-n.
  • Campbell, G. (2021), “The Role of Statistics in the Design and Analysis of Companion Diagnostic (CDx) Studies,” Biostatistics & Epidemiology, 5, 218–231. DOI: 10.1080/24709360.2021.1913706.
  • CDC (2019), “Coronary Heart Disease, Myocardial Infarction, and Stroke,” available at https://www.cdc.gov/aging/agingdata/docs/Coronary-Stroke-Brief-508.pdf
  • Choi, B. C. (1997), “Causal Modeling to Estimate Sensitivity and Specificity of a Test When Prevalence Changes,” Epidemiology, 8, 80–86. DOI: 10.1097/00001648-199701000-00013.
  • Clinical and Laboratory Standards Institute (CLSI) (2014), “Evaluation of Precision Performance of Quantitative Measurement Methods; Approved Guideline,” (3rd ed.), CLSI document EP5-A3. Wayne: Clinical and Laboratory Standards Institute.
  • Code of Federal Regulations (CFR), “Intended Use,” Title 21, Volume 8, Available at https://www.ecfr.gov/current/title-21/chapter-I/subchapter-H/part-801/subpart-A/section-801.4
  • Cook, N. R. (2007), “Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction,” Circulation, 115, 928–935. DOI: 10.1161/CIRCULATIONAHA.106.672402.
  • De, A., Meier, K., Tang, R., Li, M., Gwise, T., Gomatam, S., and Pennello, G. (2013), “Evaluation of Heart Failure Biomarker Tests: A Survey of Statistical Considerations,” Journal of Cardiovascular Translational Research, 6, 449–457. DOI: 10.1007/s12265-013-9470-3.
  • Diamond, G. A. (1986), “Reverend Bayes’ Silent Majority: An Alternative Factor Affecting Sensitivity and Specificity of Exercise Electrocardiograpy,” American Journal of Cardiology, 57, 1175–1180. DOI: 10.1016/0002-9149(86)90694-6.
  • Feng, J., Emerson, S., and Simon, N. (2021), “Approval Policies for Modifications to Machine Learning-based Software as a Medical Device: A Study of Bio-Creep,” Biometrics, 77, 31–44. DOI: 10.1111/biom.13379.
  • Feng, J., Pennello, G., and Petrick, N. (2022), “Sequential Algorithmic Modification with Test Data Reuse,” in Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence (UAI 2022), PMLR (Vol. 180), pp. 674–684.
  • Gambre, A. S., Liew, C., Hettiarachchi, G., Lee, S., MacDonald, M., Kam, C., and Poh, A. (2018), “Accuracy and Clinical Outcomes of Coronary CT Angiography for Patients with Suspected Coronary Artery Disease: A Single-Centre Study in Singapore,” Singapore Medical Journal, 59, 413–418. DOI: 10.11622/smedj.2018096.
  • Gelman, A., and Loken, E. (2014), “The Statistical Crisis in Science,” American Scientist, 102, 460–465. DOI: 10.1511/2014.111.460.
  • Gianfrancesco, M. A., Tamang, S., Yazdany, J., and Schmajuk, G. (2018), “Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data,” JAMA internal Medicine, 178, 1544–1547. DOI: 10.1001/jamainternmed.2018.3763.
  • Guggenmoos-Holzmann, I., and van Houwelingen, H. C. (2000), “The (In)Validity of sensitivity and specificity,” Statistics in Medicine, 19, 1783–1792. <1783::AID-SIM497 > 3.0.CO;2-B. DOI: 10.1002/1097-0258(20000715)19:13.
  • ISO (International Standards Organization) (1994a), “Accuracy (trueness and precision) of Measurement Methods and Results—part 1: General Principles and Definitions,” ISO 5725–1. Geneva: International Organization for Standardization.
  • ISO (International Standards Organization) (1994b), “Accuracy (trueness and precision) of Measurement Methods and Results—part 2: Basic Method for the Determination of Repeatability and Reproducibility of a Standard Measurement Method,” ISO 5725–2. Geneva: International Organization for Standardization.
  • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021), An Introduction to Statistical Learning: With Applications in R, New York: Springer.
  • Konings, H. (1982), “Use of Deming Regression in Method-Comparison Studies,” Survey of Immunologic Research, 1, 371–374. DOI: 10.1007/BF02918550.
  • Leeflang, M. M. G., Rutjes, A. W. S., Reitsma, J. B., Hooft, L., and Bossuyt, P. M. M. (2013), “Variation of a Test’s Sensitivity and Specificity with Disease Prevalence,” Canadian Medical Association Journal, 185, E537–E544. DOI: 10.1503/cmaj.121286.
  • Lijmer, J. G., Mol, B. W., Heisterkamp, S., Bonsel, G. J., Prins, M. H., van der Meulen, J. H., and Bossuyt, P. M. (1999), “Empirical Evidence of Design-Related Bias in Studies of Diagnostic Tests,” JAMA, 282, 1061–1066. DOI: 10.1001/jama.282.11.1061.
  • Linnet, K. (1983), “Evaluation of Regression Procedures for Methods Comparison Studies,” Clinical Chemistry, 39, 424–432. DOI: 10.1093/clinchem/39.3.424.
  • Mansfield, E. A. (2014), “FDA Perspective on Companion Diagnostics: An Evolving Paradigm,” Clinical Cancer Research, 20, 1453–1457. DOI: 10.1158/1078-0432.CCR-13-1954.
  • McGeechan, K., Macaskill, P., Irwig, L., Liew, G., and Wong, T. Y. (2008), “Assessing New Biomarkers and Predictive Models for Use in Clinical Practice: A Clinician’s Guide,” Archives of Internal Medicine, 168, 2304–2310. DOI: 10.1001/archinte.168.21.2304.
  • McShane, L. M., and Polley, M. Y. C. (2013), “Development of Omics-based Clinical Tests for Prognosis and Therapy Selection: The Challenge of Achieving Statistical Robustness and Clinical Utility,” Clinical Trials, 10, 653–665. DOI: 10.1177/1740774513499458.
  • Meijer, F., Honing, M., Roor, T., Toet, S., Calis, P., Olofsen, E., Martini, C., van Velzen, M., Aarts, L., Niesters, M., Boon, M., and Dahan, A. (2020), “Reduced Postoperative Pain using Nociception Level-Guided Fentanyl Dosing during Sevoflurane Anaesthesia: A Randomised Controlled Trial,” British Journal of Anaesthesia, 125, 1070–1078. DOI: 10.1016/j.bja.2020.07.057.
  • Moons, K. G. M., Altman, D. G., Reitsma, J. B., Ioannidis, J. P. A., Macaskill, P., Steyerberg, E. W., Vickers, A. J., Ransohoff, D. F., and Collins, G. S. (2015), “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration,” Annals of Internal Medicine, 162, W1–73. DOI: 10.7326/M14-0698.
  • Moons, K. G. M., and Harrell, F. E. (2003), “Sensitivity and Specificity should be De-emphasized in Diagnostic Accuracy Studies,” Academic Radiology, 10, 670–672. DOI: 10.1016/S1076-6332(03)80087-9.
  • Moons, K. G. M., Kengne, A. P., Grobbee, D. E., Royston, P., Vergouwe, Y., Altman, D. G., and Woodward, M. (2012b), “Risk Prediction Models: II. External Validation, Model Updating, and Impact Assessment,” Heart (British Cardiac Society, 98, 691–698. DOI: 10.1136/heartjnl-2011-301247.
  • Moons, K. G. M., Kengne, A. P., Woodward, M., Royston, P., Vergouwe, Y., Altman, D. G., and Grobbee, D. E. (2012a), “Risk Prediction Models: I. Development, Internal Validation, and Assessing the Incremental Value of a New (bio)marker,” Heart (British Cardiac Society), 98, 683–690. DOI: 10.1136/heartjnl-2011-301246.
  • Moons, K. G., van Es, G. A., Deckers, J. W., Habbema, J. D., and Grobbee, D. E. (1997), “Limitations of Sensitivity, Specificity, Likelihood Ratio, and Bayes’ Theorem in Assessing Diagnostic Probabilities: A Clinical Example,” Epidemiology, 8, 12–17. DOI: 10.1097/00001648-199701000-00002.
  • Moons, K. G. M., Wolff, R. F., Riley, R. D., Whiting, P. F., Westwood, M., Collins, G. S., Reitsma, J. B., Kleijnen, J., and Mallett, S. (2019), “PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration,” Annals of Internal Medicine, 170, W1–W33. DOI: 10.7326/M18-1377.
  • Muehlematter, U. J., Daniore, P., and Vokinger, K. N. (2021), “Approval of Artificial Intelligence and Machine Learning based Medical Devices in the USA and Europe (2015–20): A Comparative Analysis,” The Lancet. Digital Health, 3, e195–203. DOI: 10.1016/S2589-7500(20)30292-2.
  • Nature Computational Science (2021), “Moving towards Reproducible Machine Learning,” Nature Computational Science, 1, 629–630. DOI: 10.1038/s43588-021-00152-6.
  • NCI (2020), “Why Is Colorectal Cancer Rising Rapidly among Young Adults?” Available at https://www.cancer.gov/news-events/cancer-currents-blog/2020/colorectal-cancer-rising-younger-adults
  • Pennello, G., Sahiner, B., Gossmann, A., and Petrick, N. (2021), “Discussion on “Approval Policies for Modifications to Machine Learning-based Software as a Medical Device: A Study of Bio-Creep” by Jean Feng, Scott Emerson, and Noah Simon,” Biometrics, 77, 45–48. DOI: 10.1111/biom.13381.
  • Pepe, M. S. (2003), The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford: Oxford University Press.
  • Pepe, M. S., Fan, J., Seymour, C. W., Li, C., Huang, Y., and Feng, Z. (2012), “Biases Introduced by Choosing Controls to Match Risk Factors of Cases in Biomarker Research,” Clinical Chemistry, 58, 1242–1251. DOI: 10.1373/clinchem.2012.186007.
  • Qin, L.-X., Huang, H.-C., and Begg, C. B. (2016), “Cautionary Note on Using Cross-Validation for Molecular Classification,” Journal of Clinical Oncology, 34, 3931–3938. DOI: 10.1200/JCO.2016.68.1031.
  • Roscoe, D. M., Hu, Y. F., and Philip, R. (2015), “Companion Diagnostics: A Regulatory Perspective from the Last 5 Years of Molecular Companion Diagnostic Approvals,” Expert Review of Molecular Diagnostics, 15, 869–880. DOI: 10.1586/14737159.2015.1045490.
  • Rothman, K. J. (2012), Epidemiology: An Introduction (2nd ed.), Oxford: Oxford University Press.
  • Russell, S., and Norvig, P. (2009), Artificial Intelligence: A Modern Approach (4th ed.), Hoboken, NJ: Pearson.
  • Rutjes, A. W. S., Reitsma, J. B., Vandenbroucke, J. P., Glas, A. S., and Bossuyt, P. M. M. (2005), “Case–Control and Two-Gate Designs in Diagnostic Accuracy Studies,” Clinical Chemistry, 51, 1335–1341. DOI: 10.1373/clinchem.2005.048595.
  • Saad, M., Ray, L. B., Bujaki, B., Parvaresh, A., Palamarchuk, I., De Koninck, J., Douglass, A., Lee, E. K., Soucy, L. J., Fogel, S., Morin, C. M., Bastien, C., Merali, Z., and Robillard, R. (2019), “Using Heart Rate Profiles during Sleep as a Biomarker of Depression,” BMC Psychiatry, 19, 168. DOI: 10.1186/s12888-019-2152-1.
  • Schmidt, R. L., and Factor, R. E. (2013), “Understanding Sources of Bias in Diagnostic Accuracy Studies,” Archives of Pathology & Laboratory Medicine, 137, 558–565. DOI: 10.5858/arpa.2012-0198-RA.
  • Shiri, I., Sorouri, M., Geramifar, P., Nazari, M., Abdollahi, M., Salimi, Y., et al. (2021), “Machine Learning-based Prognostic Modeling using Clinical Data and Quantitative Radiomic Features from Chest CT Images in COVID-19 Patients,” Computers in Biology and Medicine, 132, 104304. DOI: 10.1016/j.compbiomed.2021.104304.
  • Sounderajah, V., Ashrafian, H., Aggarwal, R., De Fauw, J., Denniston, A. K., Greaves, F., Karthikesalingam, A., King, D., Liu, X., Markar, S. R., McInnes, M. D. F., Panch, T., Pearson-Stuttard, J., Ting, D. S. W., Golub, R. M., Moher, D., Bossuyt, P. M., and Darzi, A. (2020), “Developing Specific Reporting Guidelines for Diagnostic Accuracy Studies Assessing AI Interventions: The STARD-AI Steering Group,” Nature Medicine, 26, 807–808. DOI: 10.1038/s41591-020-0941-1.
  • Steyerberg, E. W. (2009), Clinical Prediction Models, New York: Springer.
  • Steyerberg, E. W., Harrell, F. E., Borsboom, G. J., Eijkemans, M. J., Vergouwe, Y., and Habbema, J. D. (2001), “Internal Validation of Predictive Models: Efficiency of Some Procedures for Logistic Regression Analysis,” Journal of Clinical Epidemiology, 54, 774–781. DOI: 10.1016/s0895-4356(01)00341-9.
  • Terrin, N., Schmid, C. H., Griffith, J. L., D’Agostino, R. B., and Selker, H. P. (2003), “External Validity of Predictive Models: A Comparison of Logistic Regression, Classification Trees, and Neural Networks,” Journal of Clinical Epidemiology, 56, 721–729. DOI: 10.1016/S0895-4356(03)00120-3l.
  • U.S. FDA and Health Canada (2021), “Good Machine Learning Practice for Medical Device Development,” available at https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles
  • U.S. FDA (2007), “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests,” available at https://www.fda.gov/media/71147/download
  • U.S. FDA (2013), “Design Considerations for Pivotal Clinical Investigations for Medical Devices,” available at https://www.fda.gov/media/87363/download
  • U.S. FDA (2014), “In Vitro Companion Diagnostic Devices,” available at https://www.fda.gov/media/81309/download
  • U.S. FDA (2017), “Software as a Medical Device (SaMD): Clinical Evaluation,” available at https://www.fda.gov/media/100714/download
  • U.S. FDA (2018a), “Precision Medicine,” available at https://www.fda.gov/medical-devices/in-vitro-diagnostics/precision-medicine
  • U.S. FDA (2018b), “De Novo Classification Request For IDx-Dr,” available at https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN180001.pdf
  • U.S. FDA (2020a), “Multiple Function Device Products: Policy and Considerations,” available at https://www.fda.gov/media/112671/download
  • U.S. FDA (2020b), “Clinical Performance Assessment: Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions”, available at https://www.fda.gov/media/77642/download
  • U.S. FDA (2021), “Requests for Feedback and Meetings for Medical Device Submissions: The Q-Submission Program,” available at https://www.fda.gov/media/114034/download
  • Usher-Smith, J. A., Sharp, S. J., and Griffin, S. J. (2016), “The Spectrum Effect in Tests for Risk Prediction, Screening, and Diagnosis,” British Medical Journal, 353:i3139. DOI: 10.1136/bmj.i3139.
  • van der Heijden, A. A., Abramoff, M. D., Verbraak, F., van Hecke, M. V., Liem, A., and Nijpels, G. (2018), “Validation of Automated Screening for Referable Diabetic Retinopathy with the IDx-DR Device in the Hoorn Diabetes Care System,” Acta Ophthalmologica, 96, 63–68. DOI: 10.1111/aos.13613.
  • Vishnuvajjala, R. L. (2015), “Issues with Training, Testing and Validation Datasets in the Development of Diagnostics”, in Proceedings of 2015 Joint Statistical Meetings, held at Seattle, WA.
  • Weiss, J. C., Natarajan, S., Peissig, P. L., McCarty, C., and Page, D. (2012), “Machine Learning for Personalized Medicine: Predicting Primary Myocardial Infarction from Electronic Health Records,” AI Magazine, 33, 33–45. DOI: 10.1609/aimag.v33i4.2438.
  • Xu, Z., and De, A. (2022), “Assessing Model Accuracy using Random Data Split: A Simulation Study,” Journal of Biopharmaceutical Statistics. DOI: 10.1080/10543406.2022.2089158.
  • Zhou, X. H., Obuchowski, N. A., and McClish, D. K. (2002), Statistical Methods in Diagnostic Medicine, New York: Wiley.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.