2,809
Views
2
CrossRef citations to date
0
Altmetric
Research Article

Wide range of applications for machine-learning prediction models in orthopedic surgical outcome: a systematic review

, , , , , & show all
Pages 526-531 | Received 15 Nov 2020, Accepted 14 Apr 2021, Published online: 10 Jun 2021

Abstract

Background and purpose — Advancements in software and hardware have enabled the rise of clinical prediction models based on machine learning (ML) in orthopedic surgery. Given their growing popularity and their likely implementation in clinical practice we evaluated which outcomes these new models have focused on and what methodologies are being employed.

Material and methods — We performed a systematic search in PubMed, Embase, and Cochrane Library for studies published up to June 18, 2020. Studies reporting on non-ML prediction models or non-orthopedic outcomes were excluded. After screening 7,138 studies, 59 studies reporting on 77 prediction models were included. We extracted data regarding outcome, study design, and reported performance metrics.

Results — Of the 77 identified ML prediction models the most commonly reported outcome domain was medical management (17/77). Spinal surgery was the most commonly involved orthopedic subspecialty (28/77). The most frequently employed algorithm was neural networks (42/77). Median size of datasets was 5,507 (IQR 635–26,364). The median area under the curve (AUC) was 0.80 (IQR 0.73–0.86). Calibration was reported for 26 of the models and 14 provided decision-curve analysis.

Interpretation — ML prediction models have been developed for a wide variety of topics in orthopedics. Topics regarding medical management were the most commonly studied. Heterogeneity between studies is based on study size, algorithm, and time-point of outcome. Calibration and decision-curve analysis were generally poorly reported.

Surgical decision-making in orthopedic surgery involves weighing the benefits of an intervention against its inherent risks. Prognostic scoring tools have been devised to individualize risk prediction and thus improve surgical decision-making (Janssen et al. Citation2015, Pereira et al. Citation2016, Shah et al. Citation2018). Although clinical prediction models are not new, recent advancements in artificial intelligence have created a host of prediction models based on machine learning (ML) (Cabitza et al. Citation2018).

ML is a branch of artificial intelligence that enables computer algorithms to learn from experience from large datasets without explicit programming. shows 3 commonly employed algorithms. Existing reviews of machine learning studies have provided a broad overview of applications ranging from vision to natural language processing and predictive analytics (Cabitza et al. Citation2018). To our knowledge, there is no study that has critically assessed the body of studies focused on ML prediction models for surgical outcome in orthopedics. These types of prediction models are most likely the first branch of artificial intelligence to be employed in clinical practice (Staartjes et al. Citation2020). Therefore, familiarizing practicing orthopedic surgeons with ML’s concepts and the topics these new methods have focused on can optimize their implementation in clinic.

Figure 1. (A) Decision trees are hierarchical structures in which each node performs a test on the input value with the subsequent branches representing the outcomes. Their graphical representation as seen here makes them easy to understand and interpret. However, they are prone to overfitting. (B) Neural networks are based on interconnected nodes. The input features are represented by the first (blue) layer. The designated outcome is represented by the final (green) layer. The middle, hidden layers (blue and orange) base their output on the input they get from prior layers. Neural networks have been around for a long time and offer good discriminative abilities, but interpretation of the relationships between the different layers remains difficult. (C) Support vector machines (SVMs) perform classification by determining the optimal separating hyperplane between datapoints, which maximizes the distance between the 2 closest points of either group. They can be used for both linear and nonlinear relationships. While they remain effective in data with a great number of features, they do not work well in larger datasets.

Figure 1. (A) Decision trees are hierarchical structures in which each node performs a test on the input value with the subsequent branches representing the outcomes. Their graphical representation as seen here makes them easy to understand and interpret. However, they are prone to overfitting. (B) Neural networks are based on interconnected nodes. The input features are represented by the first (blue) layer. The designated outcome is represented by the final (green) layer. The middle, hidden layers (blue and orange) base their output on the input they get from prior layers. Neural networks have been around for a long time and offer good discriminative abilities, but interpretation of the relationships between the different layers remains difficult. (C) Support vector machines (SVMs) perform classification by determining the optimal separating hyperplane between datapoints, which maximizes the distance between the 2 closest points of either group. They can be used for both linear and nonlinear relationships. While they remain effective in data with a great number of features, they do not work well in larger datasets.

As such, the purpose of this systematic review is to (1) evaluate which surgical outcomes orthopedic clinical prediction models have focused on, and (2) determine which techniques current prediction models use for development and validation.

Material and methods

Systematic literature search

Adhering to the 2009 PRISMA guidelines a systematic search was performed in PubMed, Embase, and the Cochrane Library for articles published up to June 18, 2020. 2 different domains of medical subject headings (MeSH) terms and keywords were combined with “AND” and within the 2 domains the terms were combined with “OR.” The 1st domain included words related to ML and the second domain related to possible orthopedic specialties (Appendix 1, see Supplementary data). Terms were restricted to MeSH, title, abstract, and keywords. Two reviewers (PTO, OQG) independently screened all titles and abstracts for eligible articles based on predefined criteria. Eligible full-text articles were evaluated and cross-referenced for potentially relevant articles not identified by the initial search (). Discrepancies between the 2 reviewers were adjudicated by the senior author (JHS).

Figure 2. Flowchart of study inclusions and exclusions.

Figure 2. Flowchart of study inclusions and exclusions.

Eligibility criteria

Studies reporting on ML-based prediction models addressing orthopedic surgical outcomes were included, as were all intraoperative and postoperative outcomes. The surgical orthopedic population was defined as disorders of the bones, joints, ligaments, tendons, or muscles treated by any type of operation. Excluded were studies (1) that did not include at least 1 ML-based prediction models for surgical outcome (e.g., logistic regression-based models), (2) non-English studies, (3) lack of full text, and (4) non-relevant study types such as animal studies, letters to the editors, and case reports.

Assessment of methodological quality

Quality assessment was performed based on a modified nine-item Methodological Index for Non-Randomized Studies (MINORS) checklist (Slim et al. Citation2003). We made it applicable to our systematic review by including disclosure, study aim, input feature, output feature, validation method, dataset distribution, performance metric, and explanation of the used AI model (Langerhuizen et al. Citation2019). These 9 items were scored on a binary scale: 0 (not reported or unclear) and 1 (reported and adequate).

Data extraction

lists the data we extracted from each study. For this review, 6 main orthopedic surgical outcome domains were identified, consisting of (1) intraoperative complications (e.g., blood transfusion, prolonged operative time), (2) postoperative complications (e.g., venous thromboembolism), (3) survival, (4) patient reported outcome measures (PROMs), (5) medical management (e.g., hospitalization), and (6) other. For studies reporting the performance of multiple ML models, the best performing ML model was used. 13 studies provided multiple models for multiple surgical outcomes; these were extracted separately resulting in more ML models than studies. Only the 2 performance measures AUC and accuracy were extracted as they were most the commonly reported results.

Table 1. Data extracted from each study

Study characteristics

After screening of titles and abstracts, 758 full-text articles were assessed for eligibility and ultimately 59 articles were included reporting on 77 ML prediction models (). Median sample size was 5,818 (IQR 635–26,869). Using the MINORS criteria, all 59 articles were found to be of similar quality. All included a minimum of 8 out of 9 appraisal items (Appendix 2, see Supplementary data).

Statistics

AUC scores and accuracies in tables are expressed as they were originally reported. For studies that reported multiple results within a single outcome domain (e.g., multiple different postoperative PROMs, each with an independent AUC) averages were taken. The sizes of the training, validation, and test sets are reported as percentages of the total dataset. No meta-analysis was performed because of obvious heterogeneity between studies and in orthopedic applications. However, to summarize the findings in some quantitative form, the median AUC and accuracy of the prediction performance were calculated for all studies.

We used Microsoft Excel (Version 16.31; Microsoft Inc, Redmond, WA, USA) for standardized forms for data extraction and quality assessment, and Mendeley as reference management software.

Ethics, funding, and potential conflicts of interests

Institutional review board approval was not required for this systematic review. No external funding was received. The authors have no conflicts of interest to declare.

Results

Study design

lists the characteristics of all included studies. More than half of the 77 models were developed with data from national databases or registries (42) (). The median number of predictor variables used in the ML model was 10 (IQR 8–15). Models using national data did not include more variables: 10 (IQR 8–13). 68 of the models had a binary distribution of the outcome variable. Most frequently employed algorithms were neural networks (42) and random forests (30). 36 of the neural networks were single-layer, 5 deep learning, and 1 convolutional. The median number of patients used was 5,507 (IQR 635–26,364). Median AUC was 0.80 (IQR 0.73–0.86) and median accuracy was 79% (IQR 75–88). Calibration was reported for 26 of the models and 23 provided Brier scores. Decision-curve analysis was employed in 14 studies. 18 provided a digital application for their prediction model.

Table 3. Characteristics of studies (n = 77). Values are count (%) unless otherwise specified

Table 2. Studies evaluating ML models for orthopedic surgical outcome prediction

Outcome

The most commonly reported outcome domains were medical management (17) and survival (16). Medical management mostly focused on discharge destination (7) and hospitalization (4). The studies on survival all addressed patient survival. 6 survival studies were in orthopedic oncology and 5 in orthopedic trauma. Both medical management and survival had a higher median AUC (0.82 and 0.84 than overall median AUC). Spinal surgery was the most commonly involved subspecialty (28).

Discussion

Recent years have seen an increasing interest in artificial intelligence and ML in orthopedics (Bini Citation2018, Jayakumar et al. Citation2019). With this systematic review we aimed to provide an introduction to the main concepts of developing ML models for orthopedic surgeons and analyze the current application and design of these models in orthopedic surgery. We found a wide range of potential applications ranging from predicting survival in spinal metastases, clinical outcome after shoulder arthroplasty, and hospitalization after hip fracture surgery.

This systematic review has a number of limitations. 1st, due to the relative novelty of this field of research in orthopedic surgery, the variety in study designs renders comparisons and comprehensive quantitative analysis difficult. We therefore opted to perform a qualitative analysis of the current publications. Hopefully, the increasing familiarity with these types of studies will lead to better reporting and open up the possibility to perform quantitative analyses. 2nd, this review is likely influenced by publication bias. ML prediction models with good performance are more likely to be published than models with mediocre or poor performance. This positive publication bias has been shown both in medicine and computational sciences (Boulesteix et al. Citation2015). The performance measures presented here were therefore likely to be more favorable than those of all developed models. 3rd, despite our efforts to perform a search across multiple online libraries, we have missed a number of studies reporting ML prediction models. Whilst unfortunate, we do no not think these omissions will significantly alter our findings on research topics or most utilized methodology as this review included nearly 60 studies.

This systematic review shows that ML models have been developed for a wide variety of topics across all subspecialties within orthopedics. Perhaps surprisingly, medical management was the most studied domain with the majority of models focusing on readmissions and discharge placement. Both readmissions and discharge delays impose a heavy burden on healthcare costs (Wan et al. Citation2016). Healthcare expenditure has risen steadily throughout the developed world in recent decades (OECD Citation2019). While there is enormous variation in healthcare systems, government institutions in virtually all countries have looked at improving medical management to help curb costs (Schwierz Citation2016). Papanicolas et al. (Citation2018) found activities relating to planning, regulating, and managing health services was a major factor in the difference in healthcare expenditure between the United States and 10 other high-income countries. Shrank et al. (Citation2019) concluded failure of care coordination, leading to unnecessary readmissions among other things, amounts to $78 billion of waste in the United States. To address this problem the Centers for Medicare and Medicaid Services started the Hospital Readmissions Reduction Program in 2012, incentivizing hospitals to lower readmission rates. Knowing in advance which patients are at risk of being readmitted within 30 days after discharge is crucial, which is a possible explanation as to why so many prediction models focus on this topic. Similarly, knowing in advance where patients are likely to be discharged to makes preventing delayed discharge a lot easier than the other interventions tried over the years (Bryan Citation2010, Ou et al. Citation2011). Furthermore, the databases available in the studies on medical management appear to be larger, enabling researchers to include more variables and create better performing prediction models. These models are more likely to be published as evidenced by the higher AUC for medical management compared to overall AUC.

Survival was the other commonly studied outcome domain. Accurately estimating remaining life-expectancy is an important feature in medical decision-making in orthopedic oncology (Pereira et al. Citation2016). In a patient group with only limited life-span remaining, the aim of treatment is to preserve quality of life. Accurate survival estimations can guide decision-making on whether or not to perform surgery and if so, which operative treatment should be opted for (Quinn et al. Citation2014). With an ageing population and cancer patients surviving longer, the incidence of bone metastases will continue to rise and prediction models will likely play an increasing role in this field (Quinn et al. Citation2014).

The AAOS Census 2018 showed only 8.3% of orthopedic surgeons’ primary specialty area was the spine, while one-third of the prediction models were linked to spinal surgery (AAOS Department of Clinical Quality and Value Citation2019). Cost reduction may also be the driving factor in the overrepresentation of spinal surgery prediction models; the economic cost of spinal surgery is large and growing with spinal fusions alone costing $30 billion annually in the United States (Johnson and Seifi Citation2018). Prediction models could play a role in curbing costs by improving patient selection and surgical decision-making, although this could be said for all other subspecialties. Another possible explanation for the disproportionate number is the overlap with neurosurgery. The neurosurgical field was relatively quicker to use ML to develop prediction models and had developed several models in spinal surgery earlier on (Senders et al. Citation2018). Finally, the field of prediction models is expanding but still small. A significant proportion of the prediction models are developed by a few research groups that happen to focus on spine surgery. With the field expanding as fast as it is with new prediction models being published every month, we expect the overrepresentation of spine surgery to be temporary in a field in its infancy.

While there is wide variation in study design, certain study design elements are fairly similar across most studies. The most common designs comprise binary outcomes; either a 70:30 or 80:20 split between training and test set; and 10-FCV as method of internal validation. Wide variety exists in study size, time-point of outcome, and choice of ML algorithms. Study size is mostly defined by whether a national database or registry was used for model development. These quality improvement databases offer a large number of datapoints with a variety of variables of a diverse group of hospitals, enabling the creation of prediction models. However, these databases are sometimes flawed by errors and their generalizability is also yet to be assessed (Rolston et al. Citation2017). External validation remains crucial considering generalizability outside the geographical origin of the database is not ensured (Janssen et al. Citation2018). Institutional databases offer the advantage of more veracious data, for instance including PROM data, which can extend over longer periods of time, but often lack adequate size.

Which ML algorithm is chosen seems highly random. While studies do list the pros and cons of certain algorithms, no study elaborates on why those algorithms were specifically chosen. A potential reason neural networks and random forests are selected so often is the familiarity of these algorithms. Neural networks have been around for decades, but were limited by lagging computational power (Hopfield Citation1988). The increase in computational power has led to a significant expansion of what neural networks can process and scientists have been able to build on the work of previous decades (Schmidhuber Citation2015). Future research should report on multiple ML algorithms and provide the performance measures of all models, thus enabling comparison between different approaches.

Despite the importance of performance metrics, a mere one-third of prediction models included information on calibration, similar to prior studies assessing prediction models in multiple medical domains (Bouwmeester et al. Citation2012, Heus et al. Citation2018). Calibration is important to evaluate wehther the model is under- or overestimating the risk regardless of the discriminative abilities. Systematically underestimating risk can lead to undertreatment, while overestimating risk can cause overtreatment (Van Calster and Vickers Citation2015, Van Calster et al. Citation2019). To improve the quality of reporting of clinical prediction models, Collins et al. (Citation2015) published the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement. While not tailored for ML prediction models this guideline can provide a framework for researchers to use during development. Hopefully, a more widespread adaptation of the TRIPOD statement can lead to less variation in study designs and better reporting of performance metrics.

Only one-fifth of prediction models have a digital application available. The purpose of prediction models is to aid clinicians and patients in decision-making, which can be achieved only if the models are available for use. Otherwise, predictive analytics based on ML will remain a mere theoretical exercise. Furthermore, researchers should be encouraged to not only provide a digital application of their prediction model, but share their code as well. With a field in its infancy, providing code of more experienced researchers can guide beginning research groups in their endeavors. Additionally, this can greatly increase the small number of external validation studies being performed.

In conclusion, ML prediction models have been developed for a wide variety of topics in orthopedic surgery. Topics regarding medical management and survival were the most commonly studied and spine surgery was the most involved subspecialty. Heterogeneity between studies is mostly based on study size, choice of ML algorithm, and time-point of outcome. Most published prediction models showed fair to good discriminative abilities, while calibration was poorly reported. Future studies should preferably include more multi-institutional, prospective databases and develop multiple models enabling comparison between different ML approaches. Also, important performance measures such as calibration should be reported to evaluate the prediction model accurately.

Supplemental material

Supplemental Material

Download PDF (150.4 KB)

All authors made a substantial contribution to the study. PTO, OQG, CO, JJV, and JHS contributed to the conception of the study. PTO and OQG screened all the titles and abstracts. PTO, OQG, AVK, and MB participated in data collection. PTO and OQG conducted the statistical analyses and prepared the manuscript. All authors contributed to interpretation of the data and participated in revision of the manuscript.

Acta thanks Max Gordon and Christoph Hubertus Lohmann for help with peer review of this study.

Supplementary data

Table 2 and appendices 1 and 2 are available as supplementary data in the online version of this article, http://dx.doi.org/10.­1080/17453674.2021.1932928

  • AAOS Department of Clinical Quality and Value. Orthopaedic Practice in the US 2018. 2019 (January): 1–68.
  • Bini S A. Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplasty 2018; 33(8): 2358–61.
  • Boulesteix A L, Stierle V, Hapfelmeier A. Publication bias in methodological computational research. Cancer Inform 2015; 14(Suppl. 5): 11–19.
  • Bouwmeester W, Zuithoff N P A, Mallett S, Geerlings M I, Vergouwe Y, Steyerberg E W, Altman D G, Moons K G M. Reporting and methods in clinical prediction research: a systematic review. PLoS Med 2012; 9(5): e1001221.
  • Bryan K. Policies for reducing delayed discharge from hospital. Br Med Bull 2010; 95(1): 33–46.
  • Cabitza F, Locoro A, Banfi G. Machine learning in orthopedics: a literature review. Front Bioeng Biotechnol 2018; 6: 75.
  • Collins G S, Reitsma J B, Altman D G, Moons K G M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. Eur Urol 2015; 67(6): 1142–51.
  • Heus P, Damen J A A G, Pajouheshnia R, Scholten R J P M, Reitsma J B, Collins G S, Altman D G, Moons K G M, Hooft L. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med 2018; 16(1): 120.
  • Hopfield J J. Artificial neural networks. IEEE Circuits Devices 1988; 4(5): 3–10.
  • Janssen S J, van der Heijden A S, van Dijke M, Ready J E, Raskin K A, Ferrone M L, Hornicek F J, Schwab J H. 2015 Marshall Urist Young Investigator Award: Prognostication in patients with long bone metastases: does a boosting algorithm improve survival estimates? Clin Orthop Relat Res 2015; 473(10): 3112–21.
  • Janssen D M C, van Kuijk S M J, D’Aumerie B B, Willems P C. External validation of a prediction model for surgical site infection after thoracolumbar spine surgery in a Western European cohort. J Orthop Surg Res 2018; 13(1): 114.
  • Jayakumar P, Moore M L G, Bozic K J. Value-based healthcare: can artificial intelligence provide value in orthopaedic surgery? Clin Orthop Relat Res 2019; 477(8): 1777–80.
  • Johnson W C, Seifi A. Trends of the neurosurgical economy in the United States. J Clin Neurosci 2018; 53(2018): 20–6.
  • Langerhuizen D W G, Janssen S J, Mallee W H, Van Den Bekerom M P J, Ring D, Kerkhoffs G M M J, Jaarsma R L, Doornberg J N. What are the applications and limitations of artificial intelligence for fracture detection and classification in orthopaedic trauma imaging? A systematic review. Clin Orthop Relat Res 2019; 477(11): 2482–91.
  • OECD 2019. Health at a Glance 2019. Available at: https://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-2019_4dd50c09-en.
  • Ou L, Chen J, Young L, Santiano N, Baramy L-S, Hillman K. Effective discharge planning: timely assignment of an estimated date of discharge. Aust Heal Rev 2011; 35(3): 357.
  • Papanicolas I, Woskie L R, Jha A K. Health care spending in the United States and other high-income countries. JAMA 2018; 319(10): 1024–39.
  • Pereira N R P, Janssen S J, Van Dijk E, Harris M B, Hornicek F J, Ferrone M L, Schwab J H. Development of a prognostic survival algorithm for patients with metastatic spine disease. J Bone Joint Surg Am 2016; 98(21): 1767–76.
  • Quinn R H, Randall R L, Benevenia J, Berven S H, Raskin K A. Contemporary management of metastatic bone disease: tips and tools of the trade for general practitioners. Instr Course Lect 2014; 63: 431–41.
  • Rolston J D, Han S J, Chang E F. Systemic inaccuracies in the National Surgical Quality Improvement Program database: implications for accuracy and validity for neurosurgery outcomes research. J Clin Neurosci 2017; 37(2017): 44–7.
  • Schmidhuber J. Deep learning in neural networks: an overview. Neural Networks 2015: 85–117.
  • Schwierz C. Cost-containment in the European Union 2016. https://ec.europa.eu/info/publications/economy-finance/cost-containment-policies-hospital-expenditure-european-union_en
  • Senders J T, Staples P C, Karhade A V, Zaki M M, Gormley W B, Broekman M L D, Smith T R, Arnaout O. Machine learning and neurosurgical outcome prediction: a systematic review. World Neurosurg 2018; 109: 476–486.e1.
  • Shah A A, Ogink P T, Nelson S B, Harris M B, Schwab J H. Nonoperative management of spinal epidural abscess: development of a predictive algorithm for failure. J Bone Joint Surg Am 2018; 100(7): 546–55.
  • Shrank W H, Rogstad T L, Parekh N. Waste in the US health care system: estimated costs and potential for savings. JAMA 2019; 322(15): 1501–9.
  • Slim K, Nini E, Forestier D, Kwiatkowski F, Panis Y, Chipponi J. Methodological index for non-randomized studies (Minors): development and validation of a new instrument. ANZ J Surg 2003; 73(9): 712–16.
  • Staartjes V E, Stumpo V, Kernbach J M, Klukowska A M, Gadjradj P S, Schröder M L, Veeravagu A, Stienen M N, van Niftrik C H B, Serra C, Regli L. Machine learning in neurosurgery: a global survey. Acta Neurochir (Wien) 2020; 162(12): 3081–91.
  • Van Calster B, Vickers A J. Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Mak 2015; 35(2): 162–9.
  • Van Calster B, McLernon D J, Van Smeden M, Wynants L, Steyerberg E W, Bossuyt P, Collins G S, MacAskill P, McLernon D J, Moons K G M, Steyerberg E W, Van Calster B, Van Smeden M, Vickers A J. Calibration: the Achilles heel of predictive analytics. BMC Med 2019; 17(1): 1–7.
  • Wan H, Zhang L, Witz S, Musselman K J, Yi F, Mullen C J, Benneyan J C, Zayas-Castro J L, Rico F, Cure L N, Martinez D A. A literature review of preventable hospital readmissions: [receding the Readmissions Reduction Act. IIE Trans Healthc Syst Eng 2016; 6(4): 193–211.

APPENDIX 1:

Search syntaxes for the PubMed, Embase, and Cochrane databases

PubMed: June 18, 2020—6,036 hits

((“Foot”[Mesh] OR “Ankle”[Mesh] OR “Knee Joint”[Mesh] OR “Knee”[Mesh] OR “Ankle Joint”[Mesh] OR “Hip”[Mesh] OR “Hip Joint”[Mesh] OR “Hip Prosthesis”[Mesh] OR “Hip Fractures”[Mesh] OR “Shoulder Joint”[Mesh] OR “Shoulder”[Mesh] OR “Shoulder Fractures”[Mesh] OR “Shoulder Dislocation”[Mesh] OR “Elbow”[Mesh] OR “Elbow Joint”[Mesh] OR “Wrist Joint”[Mesh] OR “Spine”[Mesh] OR “Intervertebral Disc Degeneration”[Mesh] OR “Bone Neoplasms”[Mesh] OR “Arthroplasty”[Mesh] OR “Fractures, Bone”[Mesh] OR “Orthopedics”[Mesh] OR “Foot”[Tiab] OR “Ankle”[Tiab] OR Knee[Tiab] OR Hip[Tiab] OR “Shoulder”[Tiab] OR Elbow[Tiab] OR Wrist[Tiab] OR Spina*[Tiab] OR Spine*[tiab] OR “degenerative disc”[Tiab] OR “Bone Neoplasms”[Tiab] OR Arthroplast*[Tiab] OR Fractur*[Tiab] OR Orthop*[Tiab])) AND (“Artificial Intelligence”[Mesh] OR “Machine Learning”[Mesh] OR “Supervised Machine Learning”[Mesh] OR “Neural Networks Computer”[Mesh] OR “Deep Learning”[Mesh] OR “support vector machine”[MeSH Terms] OR “support vector machine”[All Fields] OR “Support Vector Machine”[Mesh] OR naive bayes[tiab] OR “bayesian learning”[tiab] OR neural network*[tiab] OR “support vector”[tiab] OR support vectors[tiab] OR random forest[tiab] OR “deep learning”[tiab] OR “machine prediction”[tiab] OR “machine intelligence”[tiab] OR “computational intelligence”[tiab] OR “computational learning”[tiab] OR “computer reasoning”[tiab] OR “machine learning”[tiab] OR convolutional network*[tiab] OR “artificial intelligence”[tiab])

Embase: June 18, 2020—2,819 hits

(‘foot’/exp/mj OR ‘ankle’/exp/mj OR ‘knee’/exp/mj OR ‘hip’/exp/mj OR ‘hip prosthesis’/exp/mj OR ‘hip fracture’/exp/mj OR ‘shoulder’/exp/mj OR ‘shoulder fracture’/exp/mj OR ‘shoulder dislocation’/exp/mj OR ‘elbow’/exp/mj OR ‘wrist’/exp/mj OR ‘spine’/exp/mj OR ‘intervertebral disk disease’/exp/mj OR ‘bone tumor’/exp/mj OR ‘arthroplasty’/exp/mj OR ‘fracture’/exp/mj OR ‘orthopedic surgery’/exp/mj OR foot:ab, ti OR ankle:ab,ti OR knee:ab,ti OR hip:ab,ti OR shoulder:ab,ti OR spine:ab,ti OR ‘degenerative disc’:ab,ti OR elbow:ab,ti OR wrist:ab,ti OR ‘bone tumor’:ab,ti OR arthroplasty:ab,ti OR fractur:ab,ti OR orthop:ab,ti) AND (‘artificial intelligence’/exp/mj OR ‘machine learning’/exp/mj OR ‘supervised machine learning’/exp/mj OR ‘artificial neural network’/exp/mj OR ‘deep learning’/exp/mj OR ‘support vector machine’/exp/mj OR ‘bayesian learning’/exp/mj OR ‘neural network’:ab,ti OR ‘naive bayes’:ab, ti OR ‘beyesian learning’:ab,ti OR ‘support vector’:ab,ti OR ‘support vectorts’:ab,ti OR ‘random forest’:ab,ti OR ‘deep learning’:ab,ti OR ‘machine prediction’:ab,ti OR ‘machine intelligence’:ab,ti OR ‘computational intelligence’:ab,ti OR ‘computer learning’:ab,ti OR ‘computer reasoning’:ab,ti OR ‘machine learning’:ab,ti OR ‘convolutional network’:ab,ti OR ‘artificial intelligence’:ab,ti)

Cochrane: June 18, 2020—315 hits

([mh Foot] OR [mh Knee] OR [mh “Knee Joint”] OR [mh “Ankle Joint”] OR [mh Hip] OR [mh “Hip Joint”] OR [mh “Hip Prosthesis”] OR [mh “Hip Fractures”] OR [mh “Shoulder Dislocation”] OR [mh Elbow] OR [mh “Elbow Joint”] OR [mh “Wrist Joint”] OR [mh Spine] OR [mh “Intervertebral Disk Degeneration”] OR [mh “Bone Neoplasms”] OR [mh Arthroplasty] OR [mh “Fractures, Bone”] OR [mh Orthopedics] OR ((Foot OR Ankle OR Knee OR Hip OR Shoulder OR Elbow OR Wrist OR Spine OR Spina* OR “degenerative disk” OR “Bone Neoplasms” OR Arthroplast* OR Fractur* OR Orthop*):ti,ab,kw)) AND (([mh “Artificial Intelligence”] OR [mh “Machine Learning”] OR [mh “Supervised Machine Learning”] OR [mh “Neural Networks (Computer)”] OR [mh “Deep Learning”] OR [mh “Support Vector Machine”] OR ((“naive bayes” OR “bayesian learning” OR “neural network*” OR “support vector” OR “support vectors” OR “random forest” OR “deep learning” OR “machine prediction” OR “machine intelligence” OR “computational intelligence” OR “computational learning” OR “computer reasoning” OR “machine learning” OR “convolutional network*” OR “artificial intelligence”):ti,ab,kw)))

Supplementary data

Appendix 2.

Critical appraisal of included studies