311
Views
0
CrossRef citations to date
0
Altmetric
Emerging and Re-Emerging Coronaviruses

Prediction models for COVID-19 disease outcomes

, , , , , , & show all
Article: 2361791 | Received 11 Feb 2024, Accepted 26 May 2024, Published online: 14 Jun 2024

ABSTRACT

SARS-CoV-2 has caused over 6.9 million deaths and continues to produce lasting health consequences. COVID-19 manifests broadly from no symptoms to death. In a retrospective cross-sectional study, we developed personalized risk assessment models that predict clinical outcomes for individuals with COVID-19 and inform targeted interventions. We sequenced viruses from SARS-CoV-2-positive nasopharyngeal swab samples between July 2020 and July 2022 from 4450 individuals in Missouri and retrieved associated disease courses, clinical history, and urban-rural classification. We integrated this data to develop machine learning-based predictive models to predict hospitalization, ICU admission, and long COVID.

The mean age was 38.3 years (standard deviation = 21.4) with 55.2% (N = 2453) females and 44.8% (N = 1994) males (not reported, N = 4). Our analyses revealed a comprehensive set of predictors for each outcome, encompassing human, environment, and virus genome-wide genetic markers. Immunosuppression, cardiovascular disease, older age, cardiac, gastrointestinal, and constitutional symptoms, rural residence, and specific amino acid substitutions were associated with hospitalization. ICU admission was associated with acute respiratory distress syndrome, ventilation, bacterial co-infection, rural residence, and non-wild type SARS-CoV-2 variants. Finally, long COVID was associated with hospital admission, ventilation, and female sex.

Overall, we developed risk assessment models that offer the capability to identify patients with COVID-19 necessitating enhanced monitoring or early interventions. Of importance, we demonstrate the value of including key elements of virus, host, and environmental factors to predict patient outcomes, serving as a valuable platform in the field of personalized medicine with the potential for adaptation to other infectious diseases.

GRAPHICAL ABSTRACT

Model summary and motivation. Individuals infected with SARS-CoV-2 experience a wide spectrum of clinical manifestations ranging from no symptoms to death. Using the Virus-Human Outcomes Prediction (ViHOP) algorithm, we aim to utilize the individual’s clinical characteristics, the individual’s location, and the infecting SARS-CoV-2 virus characteristics obtained by whole genome sequencing to determine their likelihood of admission to the hospital, admission to the intensive care unit (ICU), or experiencing long COVID. This model allows clinicians to identify at-risk patients for further monitoring and/or early treatment.

Introduction

The coronavirus disease of 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has resulted in devastating global health impacts [Citation1]. The Centers for Disease Control and Prevention (CDC) has reported a cumulative 6.2 million hospitalizations due to COVID-19 in the United States (US) alone [Citation2]. While some individuals experience severe health consequences including death from COVID-19, others may exhibit mild to no symptoms [Citation3, Citation4]. In addition to a wide array of acute clinical manifestations, long COVID has become a debilitating chronic complication for many, and approximately 7.5% of US adults have reported long-term sequelae from the infection at least 3 months after initial infection [Citation5].

Early outpatient interventions, including transfusion of polyclonal convalescent plasma,[Citation6] anti-SARS-CoV-2 monoclonal therapy, [Citation7] Paxlovid, [Citation8] remote monitoring, [Citation9] and early administration of supplemental oxygen [Citation10] have been shown to improve COVID-19 outcomes, indicating the value of identifying individuals who are at risk for poor outcomes. While many attempts have been made to characterize the disease severity of COVID-19, predicting clinical outcomes remains challenging due to the wide heterogeneity and complexity of host and virus factors. A variety of human factors [Citation11–14] such as age, comorbidities, immune status, and urban-rural classification, [Citation15–24] and virus factors [Citation25, Citation26] including SARS-CoV-2 variants and mutations have been associated with outcomes. However, most studies analyze only host [Citation27–29] or virus factors independently or incorporate limited human and virus combined data [Citation30–33]. Further, while public databases provide a wealth of information and are often used, they suffer from inconsistencies in data collection, reporting, and breadth, limiting such analyses [Citation34, Citation35]. Without the integration of host, virus, and environment factors, understanding how to most effectively predict patient outcomes will continue to remain incomplete.

We conducted a retrospective cohort study to curate an integrated dataset that uniquely incorporates the epidemiologic triad containing virus, human, and environmental data. Our primary objective was to utilize this unique dataset and develop machine learning algorithms that can predict an individual’s likelihood of hospitalization, intensive care unit (ICU) admission, and long COVID. From this study, we constructed the COVID-19 Virus-Human Outcomes Prediction (ViHOP) model, a personalized clinical risk assessment tool for predicting COVID-19 clinical outcomes. Overall, our models have the potential to assist clinicians in identifying high-risk COVID-19 patients, facilitating a more personalized approach for earlier interventions and monitoring.

Methods

Summary of study design

We conducted a retrospective human cohort study, using secondary SARS-CoV-2-positive nasopharyngeal swab samples and host information collected from electronic health records (EHR). SARS-CoV-2-positive specimens were collected from University of Missouri Health Care (UMHC) and CoxHealth between July 1, 2020, and July 31, 2022. Overall, 4471 total samples comprising 4,450 unique individuals were collected. We performed whole genome sequencing on these viruses to extract virus characteristics, which included cycle threshold (ct)-value, amino acid substitutions, and variant name, and chart reviews from the EHR to extract human characteristics, which included demographics, comorbidities, vaccination history, and disease course (), environmental information (patient’s residential ZIP Code that was assigned a 4-level rural-urban commuting area (RUCA) classification: urban, rural as large, small and isolated town) [Citation36]. The combination of these virus, human, and environment data was used to develop a machine learning model to predict three key primary outcomes of COVID-19: hospitalization, ICU admission, and long COVID (). We use the CDC’s broad definition of long COVID as the continuation of signs, symptoms, and conditions at least four weeks after the initial COVID-19 infection [Citation37].

Figure 1. Features used for model training and selection. The model inputs included host, virus, and environment data. Host features were collected from electronic health records, virus features were collected using whole genome sequencing, and environmental features included the individual’s urban-rural classification based on their home ZIP Code. The model outcomes included hospital admission, ICU admission, long COVID, evaluated separately.

Figure 1. Features used for model training and selection. The model inputs included host, virus, and environment data. Host features were collected from electronic health records, virus features were collected using whole genome sequencing, and environmental features included the individual’s urban-rural classification based on their home ZIP Code. The model outcomes included hospital admission, ICU admission, long COVID, evaluated separately.

Ethical approval

This study has been reviewed by the University of Missouri Institutional Review Board (#2025449, #2049364).

Problem formulation

Utilizing the combination of host, virus, and environmental data, we developed a machine learning model, the ViHOP algorithm, to predict clinical outcomes. We formulated this as a classification type of machine learning problem, where the task was to classify whether or not an individual with COVID-19 was going to become hospitalized, admitted to the ICU, or experience long COVID. We defined an individual experiencing long COVID as those with a follow-up appointment that reported sustained COVID-19 symptoms between 4-8 weeks after the nasopharyngeal swab date in consistency with the CDC’s broad definition of long COVID as the continuation of signs, symptoms, and conditions at least four weeks after the initial COVID-19 infection [Citation37]. These outcomes were encoded as binary target variables where yes = 1 and no = 0. The models learned from training datasets (see Model Training and Validation) which included individuals who experienced these outcomes and those who did not. Predictor variables included host features such as demographics (age, sex, race, and ethnicity), comorbidities (pregnancy, immunosuppression, having any factors associated with metabolic syndrome (i.e. obesity, hypertension, diabetes, and hyperlipidaemia), cardiovascular disease, hepatic disease, chronic lung disease, or mental health disorder), vaccination (influenza or COVID-19 defined as vaccinated at least two weeks prior to swab collection), and COVID-19 symptoms and complications (i.e. ventilation or cardiopulmonary resuscitation required for management, sepsis, acute respiratory distress syndrome, or co-infection with another bacteria or virus during hospitalization), virus features including cycle threshold (ct)-values, variant, and amino acid substitutions, and environment features including urban-rural ZIP Code classification. These features are further detailed in Data Pre-Processing. Performance was evaluated using the area under the receiver operating characteristic (AUROC) score, which reflects how well the classifier distinguished between the positive and negative classes.

Data collection

We collected clinical, virus, and environmental data in association with secondary SARS-CoV-2-positive nasopharyngeal swab samples.

Sample collection

SARS-CoV-2-positive nasopharyngeal swab samples were collected from University of Missouri Health Care (UMHC) and CoxHealth inpatient, ambulatory, and emergency department encounters between July 1, 2020, and July 31, 2022. Samples were tested for SARS-CoV-2 in clinical diagnostic laboratories using SARS-CoV-2 RNA, molecular tests, rapid antigen tests, and nucleic acid amplification tests. Sample collection targeted three swabs per ZIP Code per week for temporospatial representation based on availability. Inclusion criteria involved a positive SARS-CoV-2 diagnostic test and available nasopharyngeal swab sample. Overall, 4,471 total samples from 4,450 individuals were collected including 20 individuals with two or more swabs. For the 20 individuals who had more than one swab within 60 days, the first swab with sufficient RNA for genome sequencing was included in the analyses. One individual had two swabs during two separate infections, and each was analyzed as a separate encounter.

Clinical and environmental data

Clinical and environmental data were extracted by the University of Missouri NextGen Biomedical Informatics Center from the electronic health records (EHR) of individuals associated with the collected samples and manually validated and further supplemented by the study team. All clinical and environment data were captured in a secure REDCap (Research Electronic Data Capture) database.

Virus data

Virus genome data were collected by performing whole genome sequencing on the viruses extracted from the nasopharyngeal swabs. The Juno system (Fluidigm Corporation, CA, USA) was used with 47 pairs of custom-designed specific primers targeting the reference sequence, Wuhan-Hu-1 (Accession Number: NC_045512.2) and paired with real-time reverse transcription polymerase chain reaction test (RT–PCR) to amplify whole virus genomes [Citation38]. Amplicon libraries were prepared using the Illumina DNA Prep kit, followed by sequencing with the MiSeq Reagent Kit v3 (600-cycle) and MiSeq sequencing system or the NovaSeq Reagent Kit SP v1.5 (300-cycle) on NovaSeq 6000 instrument (Illumina, San Diego, California, USA). Adapters were trimmed using the BBDuk software from the BBMap v.39.01 package [Citation39]. A k-mer of 17 was used with a maximum Hamming distance of 1, and trimming was restricted to the 30 terminal bases. Trimmed sequences were then assembled using the Iterative Refinement Meta-Assembler (IRMA)[Citation40] CoV module with a quality score of 20. The amended global alignments to the profile hidden Markov model (HMM) sequences (a2 m) were extracted as the consensus sequences, and sequences with a coverage breadth of ≥ 50% were included for subsequent analyses. Of the 4471 swabs collected, 2015 samples were adequately sequenced, representing 2014 individuals. The sequence quality data of whole genome sequencing for each sample are shown in Supplementary Table 1. Amino acid substitutions were extracted from the assembled consensus sequences using Nextclade v2.14.0 and SARS-CoV-2 reference dataset v2023-05-10T12:00:00Z [Citation41]. Sequences determined by NextClade to be “mediocre” or “poor” quality frameshifts were re-assembled using CLC Genomics Workbench v22.0.1, where reads were trimmed with a quality limit of 0.5. After reassembly, sequences still determined by NextClade to be of “poor” quality frameshifts were excluded from analyses. Variant names as defined by the World Health Organization for each sequence were also extracted from NextClade, and variant frequencies detected among study samples are summarized in Supplementary Table 2. Ct-values were determined using reverse transcription-polymerase chain reaction.

Data preprocessing and feature engineering

Datasets were preprocessed using feature encoding, imputation, and re-sampling.

Variables and feature encoding

Host features include demographics (age, sex, race, ethnicity, and urban-rural ZIP Code classification), comorbidities (pregnancy, immunosuppression, any factors associated with metabolic syndrome (i.e. obesity, hypertension, diabetes, and hyperlipidaemia), cardiovascular disease, hepatic disease, chronic lung disease, or mental health disorder), vaccination (influenza or COVID-19 defined as vaccinated at least two weeks prior to swab collection), and COVID-19 complications (i.e. ventilation or cardiopulmonary resuscitation required for management, sepsis, acute respiratory distress syndrome, or co-infection with another bacteria or virus during hospitalization). Individuals were considered partially vaccinated against COVID-19 if they began but did not complete the primary COVID-19 vaccine series, fully vaccinated if they completed the primary vaccine series, and boosted if they received additional shots after completing the primary COVID-19 vaccine series.

Urban-rural ZIP Code classifications were determined using secondary Rural-Urban Commuting Area (RUCA) codes defined by the Federal Office of Rural Health Policy (FORHP), U.S. Department of Agriculture’s Economic Research Service (ERS), and Washington, Wyoming, Alaska, Montana and Idaho (WWAMI) Rural Health Research Center [Citation42]. Each United States ZIP Code is assigned one of 33 RUCA codes based on population density, levels of urbanization and journey-to-work commuting. The WWAMI Rural Health Research Center further aggregates these RUCA codes into four categories: urban, large rural, small rural town, and isolated small rural town [Citation36]. Feature encoding strategies for each feature are detailed in Supplementary Table 3. Individuals who were hospitalized were screened for the reason for admission. Briefly, binary variables were left unchanged, age and ct-values were binned into ordinal categories through label encoding, and COVID-19 vaccinations and urban-rural classification, which we classified as ordinal categories were also encoded using label encoding. Individuals with COVID-19 reported as incidental findings were excluded from the analyses.

Virus features include amino acid substitutions throughout the consensus viral genome and viral load, represented by ct-value. Amino acid substitutions were identified and encoded as binary features (0 = absent, 1 = present). Missing or ambiguous bases were imputed as 0.5. Features and feature encoding are further defined and detailed in Supplementary Table 3. Further, the variant name of the virus was also used as a feature and analyzed separately from the amino acid substitutions to avoid collinearity.

We also sought to determine whether the co-occurrence of Spike gene substitutions may have affected outcomes and thus incorporated them by using pairwise, 1-norm, and 2-norm methods to generate additional datasets to evaluate optimal encoding of virus features. These feature combinations were analyzed as interaction terms and encoded as follows: Let i and j be two feature indices and Xi and Xj be their representing single features. We constructed the co-features of sites i and j by three strategies: pairwise product of vector Xi and Xj, 1-norm of vector Xi and Xj, and 2-norm of vector Xi and Xj. For example, if the vector of each single feature is Xi=(Xi1,Xi2,,XiK)T and Xj=(Xj1,Xj2,,XjK)T, then the corresponding pairwise product co-feature is represented as Xij=(Xi1Xj1,Xi2Xj2,,XiKXjK)T, the 1-norm co-feature is Xij=(abs(Xi1+Xj1),abs(Xi2+Xj2),,abs(XiK+XjK))T, the 2-norm co-feature is Xij=(sqrt((Xi1)2+(Xj1)2),;sqrt((Xi2)2+(Xj2)2),,sqrt((XiK)2+(XjK)2))T. For these co-features, we concentrated on residues that are in physical proximity under the assumption that these amino acids are more likely to interact and collaboratively influence the protein’s function. Our primary focus was on residues in the Spike protein. Feature interactions were determined based on the proximity of residues as mapped in the PDB 6VXX spike glycoprotein structure. Specifically, positions within 7 Angstroms (Å) of one another, denoting proximity,[Citation43] were included as interaction terms in our model. At 7 Å, the following interaction residue pairs were identified: 212-213, 221-222, 307-308, 371-373, 373-375, 375-376, 405-408, 439-440, 439-498, 452-493, 452-494, 493-494, 494-496, 496-498, 498-501, 501-505, 653-655, 950-954, and 981-982. The interaction pairs were analyzed as separate features.

Datasets

Clinical features included in the hospital admission datasets were demographics, vaccination history, urban-rural classification, comorbidities, and symptoms. Clinical features included in the ICU admission datasets were demographics, vaccination history, urban-rural classification, comorbidities, symptoms, ventilation requirements, sepsis, acute respiratory distress syndrome, and the presence of a co-infection with another virus and bacteria. Finally, clinical features included in long COVID datasets were demographics, vaccination history, urban-rural classification, comorbidities, symptoms, admission to the hospital, and admission to the ICU. To account for potential biases in feature selection, variables with fewer than 5 events-per-variable were eliminated [Citation44]. A total of seven datasets were generated and analyzed for each outcome: host-only, virus-only (amino acid substitutions), combined virus (amino acid substitutions) and host, combined virus (amino acid substitutions) and host with pairwise spike substitution pairs, combined virus (amino acid substitutions) and host with 1-norm spike substitution pairs, combined virus (amino acid substitutions) and host with 2-norm spike substitution pairs, and combined clinical and virus variant.

Missing data

Clinical data often suffer from missing data, and most robust machine learning models cannot handle missing data without additional processing. Filtering missing data introduces bias and decreases statistical power [Citation45]. To account for missing data while preserving all data points, we employed a state-of-the-art Multivariate Imputation by Chained Equations (MICE) [Citation46, Citation47] technique to impute clinical data using the assumption that the missing data were missing at random (MAR) [Citation45]. Imputed variables include sex, race, ethnicity, urban-rural classification, flu vaccination, COVID-19 vaccination, pregnancy, immunosuppression, metabolic syndrome factors, liver disease, mental health disorder, chronic lung disease, and symptom categories (respiratory, gastrointestinal, cardiac, and constitutional). Factors with minimal missing data, including age, sex, race, ethnicity, urban-rural classification, payer, flu vaccination, COVID-19 vaccination, pregnancy, immunosuppression, metabolic syndrome factors, liver disease, mental health disorder, chronic lung disease, and symptom categories (respiratory, gastrointestinal, cardiac, constitutional) were used to impute those variables. Ct-values were imputed using the average value for each lineage. Specific imputation techniques for each feature are further detailed in Supplementary Table 3. Imputation was performed on RStudio 2023.06.2 + 561 with R v4.2.3.

Balancing the datasets

Due to the low number of individuals experiencing poor outcomes (hospitalized, admitted to the ICU, or long COVID) compared with those who did not, we evaluated the effectiveness of re-sampling techniques on the training data [Citation48] to minimize the impacts of class imbalance: [Citation49, Citation50]. No resampling, random oversampling, and Synthetic Minority Oversampling TEchnique (SMOTE) [Citation51]. Re-sampling was performed using imblearn v0.11.0 in Python v3.7.10.

Feature selection

Each model was tested with and without feature selection. Feature selection is a strategy for dimensionality reduction and involves selecting the most significant predictors to construct a model. The Least Absolute Shrinkage and Selection Operator (Lasso) is a popular feature selection technique because it can select features while also imposing a penalty on the regression coefficients to regularize the model and mitigate overfitting [Citation52]. Lasso accomplishes this by shrinking coefficients towards zero, and the corresponding predictors are then removed from the model, enabling the identification of significant predictors for each model.

In our pipeline, we trained our models with and without feature selection. We used the Lasso class from the Scikit-learn v 1.2.1 linear_model module in Python v 3.10.9. We defined a range of alpha values (0.001, 0.01, 0.1, 1.0) to test different regularization strengths and performed hyperparameter tuning over these alpha values using GridSearchCV with a 10-fold cross-validation. Model performance was assessed using the area under the Receiver Operating Characteristic (AUROC) curve, then fitted to the training data. The AUROC score ranges from 0 to 1, with a higher AUROC score signifying better model performance. Predictors with non-zero coefficients from the best-performing LASSO estimator were subsequently used for model selection described in the next section. The same features were extracted from the testing dataset for model evaluation described in later sections.

For the best-performing model that did not employ feature selection using Lasso (hospitalization), feature importance was analyzed using SHapley Additive exPlanations (ShAP) values and visualized with a beeswarm plot. To identify the most predictive features for each outcome, we assessed feature importance with ShAP,[Citation53] which quantifies each feature’s contribution to a model’s performance. In the context of our models, a positive ShAP value indicates a positive prediction (i.e. presence of the tested outcome), whereas a negative ShAP value denotes a negative prediction (i.e. absence of the outcome). The magnitude of the ShAP value signifies the impact of the feature on the prediction. We computed and visualized feature importance using Python v3.7.10 and the shap library v0.42.1.

Model selection

We evaluated seven commonly used machine learning classification models for predicting binary clinical outcomes. Mathematically, we aimed to identify a function Y=f(X:Θ), where Y represents the discrete outcomes for one of three clinical outcomes, X is the input feature data, and Θrepresents the model parameters.

Model selection

The tested classifiers included Random Forest, Gradient Boosting, Logistic Regression (including LASSO and Ridge Regression), Naïve Bayes (Gaussian), Decision Tree, Neural Network (Multi-layer Perceptron (MLP)), K-Nearest Neighbors, Support Vector, and Extra Trees. The algorithms were selected based on their demonstrated utility in healthcare prediction models [Citation54–57] and evaluated using Scikit-Learn [Citation58]. Random Forest, which has been commonly used for predicting hospitalizations [Citation59] and ICU admission [Citation60–62], is a decision tree-based model that can run multiple parallel trees. This technique averages predictions of all trees generated, allowing for a reduction of overfitting [Citation63]. Simple decision trees and extra trees were also tested [Citation64]. Gradient Boosting, which has been shown to be effective for predicting ICU admission in the emergency department [Citation65], utilizes multiple weak prediction models, typically decision trees, and combines them to generate a more robust model [Citation66]. Logistic regression predicts the dependent categorical variable as a function of the independent variables. Regularization techniques for logistic regression such as Lasso and Ridge regressions can help with overfitting and were also tested [Citation67, Citation68]. Lasso in particular has been used by multiple groups for predicting the disease severity of hospitalized patients [Citation69, Citation70]. Ridge regression has been used to predict hospitalization outcomes for COVID-19 [Citation71] among other clinical applications. Naïve Bayes (Gaussian) utilizes the Bayes theorem with the “naïve” assumption of feature independence [Citation72]. Support vector classification [Citation73] identifies a hyperplane that optimizes the separation of classes in a feature space and can handle high-dimensional data and manage overfitting. K-Nearest Neighbors [Citation74] (KNN) classifies a data point based on the majority class among its nearest neighbours in a feature space. Finally, multi-layer Perceptron (MLP) is a neural network-based technique that forms multiple layers of interconnected neurones to learn complex relationships and can learn non-linear models [Citation75].

Parameter tuning

We tuned hyperparameters using GridSearchCV in sklearn with 10-fold cross-validation, allowing us to test all parameter combinations in a predefined parameter grid (Supplementary Table 4) [Citation58]. We adapted parameter grids in consistency with Subudhi et al. [Citation65].

Model training and evaluation

Each of the datasets was split into stratified (for balanced output classes) and shuffled 80% training and 20% testing datasets. After feature selection and resampling of the training data, we evaluated the machine learning classifiers described in the preceding section to identify and optimize the best models for predicting each outcome. Model development and evaluation were performed using Python v3.7.10, numpy v1.21.6, pandas v1.3.5, and sklearn v1.0.2 [Citation58]. Visualizations for AUROCs were generated using matplotlib v3.3.3.

Model training

Each model was subjected to a grid search over a range of hyperparameters listed in Supplementary Table 4 using GridSearchCV with 10-fold cross-validation and optimized for the AUROC. To evaluate the performance of the optimized model, we used the predictors from the test dataset to predict the outcomes and compare them with the target variables.

Model evaluation

The primary evaluation metric used to select the best performing model was the Area Under the Receiver Operating Characteristic (AUROC) score,[Citation76] which measures a model’s ability to distinguish between classes with an AUROC score of 0.5 suggesting no discrimination ability, 0.7-0.8 is considered acceptable, 0.8-0.9 is excellent, and >0.9 as outstanding discrimination ability. Additional performance metrics included accuracy, sensitivity/recall, specificity, precision, and F1-score were calculated. Performance scores for each combination of models are reported in Supplementary Table 5 for hospitalization, Supplementary Table 6 for ICU admission, and Supplementary Table 7 for long COVID. The best-performing model was selected based on the model with the highest AUROC score and specificity and sensitivity > 0.6.

Statistical analysis

To improve the interpretability of our model, we performed multivariable logistic regression analyses for each outcome by examining features selected from the best-performing model using the “sm” module from the Python statsmodels.api package and visualized as forest plots described below. The odds ratios (OR) and adjusted odds ratios (aOR) were calculated with 95% confidence intervals (CI). Significance was defined as p-value < 0.05 and 95% confidence intervals that do not include the value of 1. Subsequent forest plots were constructed using the “plt” module from the Python matplotlib.pyplot package.

Results

Study population

During the study period between July 2020 and July 2022, we collected 4,471 COVID-19-positive nasopharyngeal specimens representing 4,450 individuals. The characteristics of our study population are summarized in . The average age of our study population was 38.3 years (standard deviation [SD] = 21.4). Females represented 55.2% (2,453 individuals). The vast majority (N = 3,830, 88.6%) of our study population was White, with 350 (8.1%) Black, and 43 (1%) Asian, representative of the statewide demographics [Citation77]. Only 454 (17.9%) individuals had completed the primary COVID-19 vaccine series prior to their swab date. Individuals living in urban areas comprised 63.0% (2,782) of the study population, those in large rural towns comprised 17.5% (771), small rural towns comprised 10.4% (458), and isolated small rural towns comprised 9.2% (407).

Table 1. Study population.

Between July 15, 2020–July 31, 2022, Missouri reported a total of 1,054,060 hospitalizations and 241,979 ICU admissions with hospital encounters peaking in December 2020, August 2021, and January 2022 (Supplementary Figure 1A) [Citation78]. Our study population followed the statewide trends; 223 individuals were hospitalized and 100 were further admitted to the ICU (Supplementary Figure 1A). The predominant variants detected among sequences submitted to GISAID [Citation79, Citation80] and among our study samples were wild type, Alpha, Delta, and Omicron, and the peaks of each variant among our study samples were likewise consistent with those in the public database (Supplementary Figure 1B, Supplementary Table 2), highlighting the representative and generalizable nature of our dataset. Overall, our study population represents individuals from two hospital systems in Missouri with disease and virus characteristics that reflect statewide trends, allowing us to use this study population to generate generalizable predictive models.

ViHOP model and risk factors associated with COVID clinical outcomes

Using this representative study population, we identified 4450 individuals with clinical data, 2128 individuals with sufficient virus sequencing for hospitalization and ICU admission prediction, and 1798 individuals who had sufficient genome data to predict long COVID. We applied ViHOP to these datasets and obtained AUROC scores of 0.87, 0.98, and 0.79 for hospitalization, ICU admission, and long COVID outcomes, respectively. Detailed performance metrics for these models are presented in , and statistical analyses of each selected feature are represented in . Additional details for all tested models are available in Supplementary Figure 2 and Supplementary Tables 5–7.

Figure 2. Statistical Analyses of Predictors. Evaluation of statistical associations between the selected predictive features and (A) hospitalization, (B) ICU admission, and (C) long COVID. Each dot represents the odds ratio, and each line represents the 95% confidence interval (CI) for each feature. Statistically significant associations are highlighted in red and defined as a 95% CI that does not overlap 1 (illustrated by the vertical dotted line). The left panel illustrates unadjusted odds ratios, and the right panel illustrates odds ratios adjusted for age, sex, and COVID-19 vaccination. ARDS, acute respiratory distress syndrome.

Figure 2. Statistical Analyses of Predictors. Evaluation of statistical associations between the selected predictive features and (A) hospitalization, (B) ICU admission, and (C) long COVID. Each dot represents the odds ratio, and each line represents the 95% confidence interval (CI) for each feature. Statistically significant associations are highlighted in red and defined as a 95% CI that does not overlap 1 (illustrated by the vertical dotted line). The left panel illustrates unadjusted odds ratios, and the right panel illustrates odds ratios adjusted for age, sex, and COVID-19 vaccination. ARDS, acute respiratory distress syndrome.

Table 2. Best performing model metrics.

Our first objective was to construct the best model to predict hospitalization due to COVID-19 for an individual infected with SARS-CoV-2. The best-performing algorithm was a Support Vector Classifier (SVC) trained on the combined host, virus substitutions (normalized using 1-norm, see Methods), and environment dataset after random oversampling, which performed with an AUROC score of 0.87 (). A set of clinical factors were identified to be positively associated with hospitalization, including immunosuppression (aOR = 5.66; a 95% CI [3.10, 10.33]), cardiovascular disease (3.24; [2.01, 5.22]), older age (2.48; [2.02, 3.04]), and the presence of cardiac (3.00; [1.77, 5.06]), gastrointestinal (2.48; [1.68, 3.67]), and constitutional (2.68; [1.59 to 4.51]) symptoms ((A)). Rural residence (based on ZIP Code) (1.23; [1.04, 1.47]), was positively associated with hospitalization, compared to urban residence. Genetic factors such as amino acid substitutions membrane (M):I82 T (2.13; [1.28, 3.56]), spike (S):T19R (2.11; [1.35, 3.31]), open reading frame(ORF)1a:V2930L (2.79; [1.66, 4.70]), ORF7a:V82A (2.02; [1.33, 3.06]), and ORF8:S24L (0.60; [0.36, 1.00]) were also positively associated with hospitalization.

Next, we optimized our model to predict ICU admission among individuals testing positive for COVID-19. The best-performing algorithm was NaiveBayes on the combined host, environment, and virus variant dataset after feature selection on the training data and no resampling, which performed with an AUROC score of 0.98 (). For ICU admissions, the clinical factors most strongly associated with a positive correlation include experiencing acute respiratory distress syndrome (ARDS) during hospitalization (257.75; [116.88, 568.40]), requiring ventilation (151.49; [82.35, 278.70]), and co-infection with bacteria (16.49; [8.88, 30.62]) ((B)). Similar to hospital admissions, residing in more rural areas was associated with a higher likelihood of ICU admission (1.44; [1.21, 1.72]) compared to residing in more urban areas. Interestingly, SARS-CoV-2 wild-type viruses were negatively associated with ICU admissions (0.24; [0.14 to 0.43]) in comparison to the other variants (Alpha, Beta, Delta, Gamma, Omicron, and Epsilon).

Finally, we predicted long COVID among individuals testing positive for COVID-19. The best-performing model utilized the combined host, environment, and virus variant dataset after feature selection on the training data, random oversampling, followed by ridge regression. The AUROC score was 0.79 (). Significant predictive features of long COVID included hospital admission (5.02; [3.36, 7.48]), ventilation (4.74; [3.12, 7.20]) and male sex (0.69; [0.50, 0.96]). The COVID-19 epsilon variant (aOR = 0.40, 95% CI = 0.18, 0.86) was less associated with long COVID ((C)). In contrast to hospitalization and ICU admission, urban-rural classification did not show a statistically significant association with long COVID.

By combining these findings, our results showed common significant features across outcomes (). For example, factors such as age and comorbidities associated with metabolic syndrome are common attributes of SARS-CoV-2 infections influencing hospitalization, ICU admission, or long COVID. Chronic lung disease is associated with increased likelihood of hospitalization and long COVID; immunosuppression and rural residence are associated with increased likelihood of both hospitalization and ICU admission; virus variant and ventilation requirements are associated with long COVID and ICU admission.

Figure 3. Significant predictors associated with disease outcomes. Features selected from the best-performing models for each outcome (hospitalization, ICU admission, and long COVID) were further analyzed using conventional logistic regression, and significant features were shown. Significant predictive features of multiple outcomes are shown between the respective outcome panels. ARDS, acute respiratory distress syndrome; Symptoms listed under hospitalization included gastrointestinal, constitutional, and cardiac symptoms.

Figure 3. Significant predictors associated with disease outcomes. Features selected from the best-performing models for each outcome (hospitalization, ICU admission, and long COVID) were further analyzed using conventional logistic regression, and significant features were shown. Significant predictive features of multiple outcomes are shown between the respective outcome panels. ARDS, acute respiratory distress syndrome; Symptoms listed under hospitalization included gastrointestinal, constitutional, and cardiac symptoms.

Discussion

Personalized medicine, a precision medicine approach, integrates human and environmental factors to tailor disease prevention and treatment and improve patient outcomes. Predominantly implemented in oncology, personalized medicine has had limited implementation in infectious disease [Citation81]. The widely heterogenous constellation of symptoms and complications of microbial infections depends on a multitude of host, virus, and environmental factors, making predicting at-risk individuals challenging. In this study, we developed COVID-19 ViHOP, an ensemble of machine learning algorithms that effectively assesses an individual’s clinical outcome risks, of which we focused on hospitalization, ICU admission, and long COVID. Our findings suggest that the ViHOP portfolio could be useful for clinicians to identify at-risk COVID-19 patients for targeted interventions and lay the groundwork at the frontiers of personalized medicine in infectious diseases.

It is well documented that host factors (e.g. age, comorbidities, and immune status), [Citation11–14] urban-rural classification, [Citation15–24] and viral factors (e.g. SARS-CoV-2 variants and mutations) [Citation25, Citation26, Citation82–85] have been linked to patient outcomes. However, most studies tend to focus exclusively on either host or virus factors, or they incorporate only a limited combination of human and viral data [Citation27–33]. In this study, we developed and applied the ViHOP model and attempted to assess whether integrating clinical, viral, and environmental characteristics enhances the predictive capabilities for three key outcomes: hospitalization, ICU admission, and long COVID. Our holistic analyses revealed that all three types of factors – clinical, viral, and environment – are associated with hospitalization and ICU admission, and both clinical and viral factors are associated with all three outcomes, although the specific combination of factors varied depending on the outcome ( and ). Immunosuppression was associated with hospitalization and ICU admission, whereas cardiovascular disease and the appearance of COVID-19 symptoms uniquely predicted hospitalization. Chronic lung disease predicted hospitalization and long COVID. Hospitalization, mental health disorders, female sex, and respiratory and cardiac symptoms were associated with long COVID. Finally, requiring ventilation predicted both ICU admission and long COVID, whereas having ARDS or a bacterial co-infection uniquely predicted ICU admission. The clinical and environmental factors we identified align with those previously reported in the literature [Citation7, Citation86, Citation87]. The consistency of our findings with previously identified factors further supports the strength and validity of our models.

Through further analyses, the amino acid substitutions associated with hospitalization risks, such as ORF1a:V2930L, ORF7a:V82A, M:I82 T, and S:T19R, were predominantly located within Delta variants and ORF8:S24L was predominantly located within Wild Type and Epsilon variants in our dataset although these amino acid substitutions were still circulating in human populations until late 2023 (Supplementary Figure 3) Of these amino acid substitutions, S:T19R is mapped to the “supersite” of the N-terminal domain (NTD), which is targeted by anti-NTD neutralizing antibodies, suggesting a potential role in immune evasion [Citation88]. Of interest, the occurrence of amino acid substitutions coincides with the number of hospitalizations in Missouri (Supplementary Figures 1 and 3). Nevertheless, additional studies are needed to understand the mechanisms by which these mutations influence clinical outcomes. In contrast to predicting hospitalization, the COVID-19 variant was sufficient for predicting long COVID and ICU admission. Overall, these findings indicate that, by combining with clinical and residential location factors, basic variant subtyping may be effective for predicting ICU admissions and long COVID, whereas high-resolution whole genome sequencing provides more benefits in assessing hospitalization risks.

Of importance, we saw that urban-rural classification was selected as a predictor for all three outcomes, and living in more urban areas was less associated with both hospital and ICU admission compared to rural residency, supporting that residential aspects such as less access to healthcare and a culture of less healthcare-seeking behaviours may serve an important role in disease outcomes [Citation89, Citation90]. Rural populations have experienced higher incidences of COVID-19 disease and mortality [Citation18–24, Citation89, Citation91, Citation92]. Rural communities have also been shown to be an important source of virus evolution and spread [Citation93]. Our findings indicate the need for targeted attention and interventions specifically tailored to rural communities.

Our study has potential limitations inherent in the use of electronic health data, such as small sample sizes, possible misclassification, and instances of missing data [Citation94]. To mitigate these potential issues, we employed a combination of automatic data extraction and manual validation to enhance the robustness and reliability of the data used in the analyses. In addition, the majority of data used in this study are from the state of Missouri, and the total number of patients, particularly ICU cases, is relatively small. To minimize the impacts of imbalances in datasets, we implemented oversampling and data imputation methods during the model development phase, as detailed in the Supplementary Methods. Additionally, it is important to note that our study relied on secondary sampling of COVID-19-positive nasopharyngeal swabs, leading to a bias towards individuals who sought healthcare. This group often includes those who are symptomatic, have sufficient healthcare coverage, and reside near healthcare centres. Consequently, this approach may lead to an underestimation of the total number of cases particularly from rural areas. Further refinement of the ViHOP models would benefit from additional longitudinal and prospective multi-centre cohort studies. These studies could provide more comprehensive data, enhancing the accuracy and generalizability of the models.

In summary, this study introduces COVID-19 ViHOP, a portfolio of personalized clinical risk assessment models that can predict the probability of hospitalization, ICU admission, and long COVID for individuals who test positive for COVID-19. This information has the potential to allow clinicians to identify patients with COVID-19 who are at high risk of developing poor outcomes and make informed and personalized decisions for additional monitoring and interventions. While we focused on three primary outcomes, this approach has the potential to be expanded to additional outcome measures such as mortality. Furthermore, this approach can be applied to emerging COVID-19 variants, and we expect to continue updating the models as new SARS-CoV-2 variants emerge. Finally, this model may have utility for other infectious agents. With its ability to integrate multiple data types to predict an individual’s probability of poor outcomes, this tool may serve as a valuable platform in the field of personalized medicine for infectious diseases.

Data sharing statement

All virus genome data is publicly available on GISAID as the following accession IDs: [Submission in progress]. De-identified clinical data used in this study, the data dictionary, and relevant codes are accessible in the supplementary files.

Supplemental material

Supplemental Material

Download Zip (11.7 MB)

Supplemental Material

Download MS Word (1.1 MB)

Supplemental Material

Download MS Excel (144.7 KB)

Acknowledgements

CRediT Author Statement: Conceptualization: Xiu-Feng Wan, Cynthia Y. Tang; Methodology: Cynthia Y. Tang, Xiu-Feng Wan; Software: Cynthia Y. Tang, Cheng Gao; Validation: Cynthia Y. Tang, Xiu-Feng Wan; Formal analysis: Cynthia Tang, Cheng Gao; Investigation: Cynthia Tang, Cheng Gao, Kritika Prasai, Tao Li, Shreya Dash; Resources: Xiu-Feng Wan; Data curation: Cynthia Y. Tang; Writing-Original Draft: Cynthia Y. Tang; Writing-Review & Editing: Xiu-Feng Wan, Jane A. McElroy, Cheng Gao, Kritika Prasai, Shreya Dash, Tao Li, Jun Hang; Visualization: Cynthia Y. Tang; Supervision: Xiu-Feng Wan; Project administration: Xiu-Feng Wan, Cynthia Y. Tang, Jun Hang, and Jane A. McElroy; Funding acquisition: Xiu-Feng Wan, Cynthia Y. Tang.

All authors had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. The authors have no conflicts of interest to report.

We would like to acknowledge Detlef Ritter, Richard D. Hammer, Christopher Schulze, Robin Trotman, Michelle Beckwith, and Simone Camp for their assistance in sample identification and sample collection. The authors are also grateful to Naser Ashiekh, Paige Beauparlant, Valerie Brownfield, Kate Burke, Paul Cho, Natalie Collins, Sabrina Duong, Federica Federzoni, John French, Christina Frymire, Minhui Guan, Tricia Haynes, Tyler Hartman, Ying He, Nicole Hitchcock, Yaswitha Jampani, Lindsay Juriga, Kyle Kaletka, Stefan Keller, Grace Lidl, Lydiah Mpyisi, Peyton Ogle, Rebecca Patterson, Brittany Pendergraft, Samantha Peters, Hailey Ramirez, Madeline Robertson, George Sarafianos, Christopher Schulze, Karen Segovia, Jamie Smith, Lindsay Staudt, Megan Sweeney, Skylar Sutton, Breanna Tuhlei, Nuha Wareg, Qiongying Yang, and Mingyi Zhou for their help with sample collection, RNA extraction, genomic sequencing, data collection, data pre-processing, and/or visualization. Additionally, we thank David Gozal, Daniel Hoft, Eduardo Simoes, and Timothy Matisziw for their discussion and support.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This project was supported by startup funding to Xiu-Feng Wan from the University of Missouri-Columbia. Cynthia Tang was supported by the National Institute of Allergy and Infectious Diseases and National Library of Medicine of the National Institutes of Health under award numbers F30AI172230 and 5T32LM012410. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors have no conflicts of interest to report.U.S. National Library of Medicine

References

  • WHO Coronavirus (COVID-19) Dashboard. Updated June 28, 2023. [accessed July 4, 2023]. https://covid19.who.int/.
  • COVID Data Tracker. Updated June 24, 2023. [accessed 2023 July 4]. https://covid.cdc.gov/covid-data-tracker/#datatracker-home.
  • Hu B, Guo H, Zhou P, et al. Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol. 2021;19(3):141–154. doi:10.1038/s41579-020-00459-7.
  • da Rosa Mesquita R, Francelino Silva Junior LC, Santos Santana FM, et al. Clinical manifestations of COVID-19 in the general population: systematic review. Wien Klin Wochenschr. 2021;133(7-8):377–382. doi:10.1007/s00508-020-01760-4.
  • Long COVID. Updated May 17, 2023. [accessed 2023 July 4]. https://www.cdc.gov/nchs/covid19/pulse/long-covid.htm.
  • Sullivan DJ, Gebo KA, Shoham S, et al. Early outpatient treatment for COVID-19 with convalescent plasma. N Engl J Med. 2022;386(18):1700–1711. doi:10.1056/NEJMoa2119657.
  • Yek C, Warner S, Wiltz JL, et al. Risk factors for severe COVID-19 outcomes Among persons aged ≥18 years Who completed a primary COVID-19 vaccination series — 465 health care facilities, United States, December 2020–October 2021. MMWR Morb Mortal Wkly Rep. 2022;71(1):19–25. doi:10.15585/mmwr.mm7101a4.
  • Shah MM, Joyce B, Plumb ID, et al. Paxlovid associated with decreased hospitalization rate among adults with COVID-19 — United States, April–September 2022. American Journal of Transplantation: Official Journal of the American Society of Transplantation and the American Society of Transplant Surgeons. 2023;23(1):150–155. doi:10.1016/j.ajt.2022.12.004.
  • Crotty BH, Dong Y, Laud P, et al. Hospitalization outcomes among patients with COVID-19 undergoing remote monitoring. JAMA Network Open. 2022;5(7):e2221050, doi:10.1001/jamanetworkopen.2022.21050.
  • Goyal DK, Mansab F, Iqbal A, et al. Early intervention likely improves mortality in COVID-19 infection. Clin Med (Lond). 2020;20(3):248–250. doi:10.7861/clinmed.2020-0214.
  • Russo A, Pisaturo M, Zollo V, et al. Obesity as a risk factor of severe outcome of COVID-19: a pair-matched 1:2 case-control study. J Clin Med. 2023;12(12):4055, doi:10.3390/jcm12124055.
  • Jacob J, Tesch F, Wende D, et al. Development of a risk score to identify patients at high risk for a severe course of COVID-19. Z Gesundh Wiss. 2024: 1–10. doi:10.1007/s10389-023-01884-7.
  • Ko JY, Danielson ML, Town M, et al. Risk factors for coronavirus disease 2019 (COVID-19)-associated hospitalization: COVID-19-associated hospitalization surveillance network and behavioral risk factor surveillance system. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America. 2021;72(11):e695–e703. doi:10.1093/cid/ciaa1419.
  • Rathod D, Kargirwar K, Patel M, et al. Risk factors associated with COVID-19 patients in India: a single center retrospective cohort study. J Assoc Physicians India. Jun 2023;71(6):11–12. doi:10.5005/japi-11001-0263.
  • Gina Turrini DKB, Chen L, Conmy AB, et al. Access to affordable care in rural America: Current trends and key challenges (Research ReportNo. HP-2021-16). July 2021. [accessed 2022 November 1]. https://aspe.hhs.gov/sites/default/files/2021-07/rural-health-rr.pdf.
  • Kadri SS, Simpson SQ. Potential implications of SARS-CoV-2 delta variant surges for rural areas and hospitals. JAMA. 2021;326(11):1003–1004. doi:10.1001/jama.2021.13941.
  • Cuadros DF, Moreno CM, Musuka G, et al. Association between vaccination coverage disparity and the dynamics of the COVID-19 delta and omicron waves in the US. Front Med (Lausanne). 2022;9:898101, doi:10.3389/fmed.2022.898101.
  • Cuadros DF, Branscum AJ, Mukandavire Z, et al. Dynamics of the COVID-19 epidemic in urban and rural areas in the United States. Ann Epidemiol. 2021;59:16–20. doi:10.1016/j.annepidem.2021.04.007.
  • Mueller JT, McConnell K, Burow PB, et al. Impacts of the COVID-19 pandemic on rural America. Proc Natl Acad Sci USA. 2021;118(1):2019378118, doi:10.1073/pnas.2019378118.
  • Huang Q, Jackson S, Derakhshan S, et al. Urban-rural differences in COVID-19 exposures and outcomes in the South: a preliminary analysis of South Carolina. PLoS One. 2021;16(2):e0246548, doi:10.1371/journal.pone.0246548.
  • Melvin SC, Wiggins C, Burse N, et al. The role of public health in COVID-19 emergency response efforts from a rural health perspective. Prev Chronic Dis. 2020: 17, doi:10.5888/pcd17.200256.
  • Dunne EM, Maxwell T, Dawson-Skuza C, et al. Investigation and public health response to a COVID-19 outbreak in a rural resort community—Blaine County, Idaho, 2020. PLoS One. 2021;16(4):e0250322, doi:10.1371/journal.pone.0250322.
  • Ramírez IJ, Lee J. COVID-19 emergence and social and health determinants in Colorado: a rapid spatial analysis. Int J Environ Res Public Health. 2020;17(11):3856, doi:10.3390/ijerph17113856.
  • Sylvia KO, Chigozie AO, Seoyon K, et al. SARS-CoV-2 transmission potential and rural-urban disease burden disparities across Alabama, Louisiana, and Mississippi, March 2020–May 2021. Ann Epidemiol. 2022;71:1–8. doi:10.1016/j.annepidem.2022.04.006.
  • SARS-CoV-2 variants of concern as of 4 August 2022. European Centre for Disease Prevention and Control. Updated August 5, 2022. [accessed 2022 August 9]. https://www.ecdc.europa.eu/en/covid-19/variants-concern.
  • Lin L, Liu Y, Tang X, et al. The disease severity and clinical outcomes of the SARS-CoV-2 variants of concern. Front Public Health. 2021;9:775224, doi:10.3389/fpubh.2021.775224.
  • Xiong Y, Ma Y, Ruan L, et al. Comparing different machine learning techniques for predicting COVID-19 severity. Infect Dis Poverty. 2022;11(1):19, doi:10.1186/s40249-022-00946-4.
  • Iaccarino G, Grassi G, Borghi C, et al. Age and multimorbidity predict death among COVID-19 patients. Hypertension. 2020;76(2):366–372. doi:10.1161/HYPERTENSIONAHA.120.15324.
  • Chieregato M, Frangiamore F, Morassi M, et al. A hybrid machine learning/deep learning COVID-19 severity predictive model from CT images and clinical data. Sci Rep. 2022;12(1):4329, doi:10.1038/s41598-022-07890-1.
  • Nagy Á, Ligeti B, Szebeni J, et al. COVIDOUTCOME-estimating COVID severity based on mutation signatures in the SARS-CoV-2 genome. Database: The Journal of Biological Databases and Curation. 2021: 2021, doi:10.1093/database/baab020.
  • Nagy Á, Pongor S, Győrffy B. Different mutations in SARS-CoV-2 associate with severe and mild outcome. Int J Antimicrob Agents. 2021;57(2):106272, doi:10.1016/j.ijantimicag.2020.106272.
  • Huang F, Chen L, Guo W, et al. Identifying COVID-19 severity-related SARS-CoV-2 mutation using a machine learning method. Life (Basel, Switzerland). 2022;12(6):806, doi:10.3390/life12060806.
  • Beguir K, Skwark MJ, Fu Y, et al. Early computational detection of potential high-risk SARS-CoV-2 variants. Comput Biol Med. 2023;155:106618, doi:10.1016/j.compbiomed.2023.106618.
  • Rinette B, Kierste M, Chris P, et al. Challenges in reported COVID-19 data: best practices and recommendations for future epidemics. BMJ Global Health. 2021;6(5):e005542, doi:10.1136/bmjgh-2021-005542.
  • Gozashti L, Corbett-Detig R. Shortcomings of SARS-CoV-2 genomic metadata. BMC Res Notes. 2021;14(1):189), doi:10.1186/s13104-021-05605-9.
  • RUCA Data [accessed June 12, 2023]. https://depts.washington.edu/uwruca/ruca-uses.php
  • Long COVID or Post-COVID Conditions. Updated December 16, 2022. [accessed July 4, 2023]. https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html.
  • Li T, Chung HK, Pireku PK, et al. Rapid high-throughput whole-genome sequencing of SARS-CoV-2 by using one-step reverse transcription-PCR amplification with an integrated microfluidic system and next-generation sequencing. 2021;59(5):e02784–20. doi:10.1128/JCM.02784-20
  • BBtools. Version 39.01. Joint Genome Institute [accessed 2023 February 21]. sourceforge.net/projects/bbmap/.
  • Shepard SS, Meno S, Bahl J, et al. Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler. BMC Genomics. 2016;17(1):708, doi:10.1186/s12864-016-3030-6.
  • Hadfield J, Megill C, Bell SM, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34(23):4121–4123. doi:10.1093/bioinformatics/bty407.
  • Rural-Urban Commuting Area Codes. Updated March 22, 2023. [accessed 2023 July 4]. https://www.ers.usda.gov/data-products/rural-urban-commuting-area-codes/.
  • Magazine N, Zhang T, Wu Y, et al. Mutations and evolution of the SARS-CoV-2 spike protein. Viruses. 2022;14(3):640, doi:10.3390/v14030640.
  • Cowley LE, Farewell DM, Maguire S, et al. Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature. Diagnostic and Prognostic Research. 2019;3(1):16, doi:10.1186/s41512-019-0060-y.
  • Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (Clinical Research ed). 2009;338:b2393, doi:10.1136/bmj.b2393.
  • Azur MJ, Stuart EA, Frangakis C, et al. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40–49. doi:10.1002/mpr.329.
  • Mera-Gaona M, Neumann U, Vargas-Canas R, et al. Evaluating the impact of multivariate imputation by MICE in feature selection. PLoS One. 2021;16(7):e0254720, doi:10.1371/journal.pone.0254720.
  • Santos MS, Soares JP, Abreu PH, et al. Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput Intell Mag. 2018;13(4):59–76. doi:10.1109/MCI.2018.2866730.
  • Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter. 2004;6(1):20–29. doi:10.1145/1007730.1007735.
  • He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–1284. doi:10.1109/TKDE.2008.239.
  • Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357. doi:10.1613/jair.953.
  • Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267–288. doi:10.1111/j.2517-6161.1996.tb02080.x.
  • Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017; {Red Hook, NY, USA}.
  • Mavrogiorgou A, Kiourtis A, Kleftakis S, et al. A catalogue of machine learning algorithms for healthcare risk predictions. Sensors. 2022;22(22):8615, doi:10.3390/s22228615.
  • Keser SB, Keskin K. A gradient boosting-based mortality prediction model for COVID-19 patients. Neural Computing and Applications. 2023;35(33):23997–24013. doi:10.1007/s00521-023-08997-w.
  • Uddin S, Khan A, Hossain ME, et al. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281, doi:10.1186/s12911-019-1004-8.
  • Zapata RD, Huang S, Morris E, et al. Machine learning-based prediction models for home discharge in patients with COVID-19: development and evaluation using electronic health records. PLoS One. 2023;18(10):e0292888, doi:10.1371/journal.pone.0292888.
  • Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research. 2011;12:2825–2830.
  • Fau LC, Lyu T, Weissman S, et al. Early prediction of COVID-19 associated hospitalization at the time of CDC contact tracing using machine learning: towards pandemic preparedness. LID - rs.3.rs-3213502 [pii]. doi:10.21203/rs.3.rs-3213502/v1
  • Fernandes FT, de Oliveira TA, Teixeira CE, et al. A multipurpose machine learning approach to predict COVID-19 negative prognosis in São Paulo, Brazil. (2045-2322 (Electronic)).
  • Cheng FY, Joshi HA-O, Tandon P, et al. Using machine learning to predict ICU transfer in hospitalized COVID-19 patients. LID - 1668. (2077-0383 (Print)). doi:10.3390/jcm9061668
  • Huang HF, Liu Y, Li JX, et al. Validated tool for early prediction of intensive care unit admission in COVID-19 patients. (2307-8960 (Print)).
  • Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi:10.1023/A:1010933404324.
  • Breiman L, Jerome HF, Richard AO, et al. Classification and regression trees. Biometrics. 1984;40:874.
  • Subudhi S, Verma A, Patel AB, et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Digit Med. 2021;4(1):87, doi:10.1038/s41746-021-00456-x.
  • Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001: 1189–1232.
  • Kim SJ, Koh K, Lustig M, et al. An interior-point method for large-scale $\ell_1$-regularized least squares. IEEE J Sel Top Signal Process. 2007;1(4):606–617. doi:10.1109/JSTSP.2007.910971.
  • Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67. doi:10.1080/00401706.1970.10488634.
  • Fu Y, Zhong W, Liu T, et al. Early Prediction Model for Critical Illness of Hospitalized COVID-19 Patients Based on Machine Learning Techniques. (2296-2565 (Electronic)).
  • Zhang S, Huang S, Liu J, et al. Identification and validation of prognostic factors in patients with COVID-19: a retrospective study based on artificial intelligence algorithms. Journal of Intensive Medicine. 2021;1(2):103–109. doi:10.1016/j.jointm.2021.04.001.
  • Koutroulos MV, Bakola SA, Kalpakidis S, et al. The MaD-CLINYC score: an easy tool for the prediction of the outcome of hospitalized COVID-19 patients. Hippokratia. 2021;25(3):119–125.
  • Harry Z. The optimality of naive bayes. 2004; https://api.semanticscholar.org/CorpusID:8891634.
  • Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):1–27. doi:10.1145/1961189.1961199.
  • Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–27. doi:10.1109/TIT.1967.1053964.
  • David ER, Geoffrey EH, Ronald JW. Learning representations by back-propagating errors. Nature. 1986;323:533–536. doi:10.1038/323533a0.
  • Jayawant NM. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5(9):1315–1316. doi:10.1097/JTO.0b013e3181ec173d.
  • QuickFacts Missouri. [accessed July 15, 2023]. https://www.census.gov/quickfacts/fact/table/MO/PST045222
  • Data from: COVID-19 Hospitalization Metrics, by Day. Weekly COVID-19 Activity Report: Data for Download.
  • Gangavarapu K, Latif AA, Mullen JL, et al. Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. Nat Methods. 2023;20(4):512–522. doi:10.1038/s41592-023-01769-3.
  • Data from: Missouri, United States Variant Report.
  • Equils O, Bakaj A, Wilson-Mifsud B, et al. Restoring trust: the need for precision medicine in infectious diseases, public health and vaccines. Hum Vaccin Immunother. 2023;19(2):2234787, doi:10.1080/21645515.2023.2234787.
  • Patone M, Thomas K, Hatch R, et al. Mortality and critical care unit admission associated with the SARS-CoV-2 lineage B.1.1.7 in England: an observational cohort study. Lancet Infect Dis. 2021;21(11):1518–1528. doi:10.1016/s1473-3099(21)00318-2.
  • Iacobucci G. COVID-19: New UK variant may be linked to increased death rate, early data indicate. BMJ. 2021;372:n230. doi:10.1136/bmj.n230.
  • Volz E, Mishra S, Chand M, et al. Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature. 2021;593(7858):266–269. doi:10.1038/s41586-021-03470-x.
  • Dey L, Chakraborty S, Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J. 2020;43(5):438–450. doi:10.1016/j.bj.2020.08.003.
  • Notarte KI, de Oliveira MHS, Peligro PJ, et al. Age, sex and previous comorbidities as risk factors not associated with SARS-CoV-2 infection for long COVID-19: a systematic review and meta-analysis. J Clin Med. 2022;11(24):7314), doi:10.3390/jcm11247314.
  • Gentilotti E, Górska A, Tami A, et al. Clinical phenotypes and quality of life to define post-COVID-19 syndrome: a cluster analysis of the multinational, prospective ORCHESTRA cohort. (2589-5370 (Electronic))
  • Planas D, Veyer D, Baidaliuk A, et al. Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization. Nature. 2021;596(7871):276–280. doi:10.1038/s41586-021-03777-9.
  • Anzalone AJ, Horswell R, Hendricks BM, et al. Higher hospitalization and mortality rates among SARS-CoV-2-infected persons in rural America. J Rural Health. 2023;39(1):39–54. doi:10.1111/jrh.12689.
  • Dixon BE, Grannis SJ, Lembcke LR, et al. The synchronicity of COVID-19 disparities: statewide epidemiologic trends in SARS-CoV-2 morbidity, hospitalization, and mortality among racial minorities and in rural America. PLoS One. 2021;16(7):e0255063, doi:10.1371/journal.pone.0255063.
  • Peters DJ. Community susceptibility and resiliency to COVID-19 across the rural-urban continuum in the United States. J Rural Health. 2020;36(3):446–456. doi:10.1111/jrh.12477.
  • Kleynhans J, Tempia S, Wolter N, et al. SARS-CoV-2 seroprevalence in a rural and urban household cohort during first and second waves of infections, South Africa, July 2020–March 2021. Emerg Infect Dis. 2021;27(12):3020–3029. doi:10.3201/eid2712.211465.
  • Tang CY, Li T, Haynes TA, et al. Rural populations facilitated early SARS-CoV-2 evolution and transmission in Missouri, USA. Npj Viruses. 2023;1(1):7, doi:10.1038/s44298-023-00005-1.
  • Gianfrancesco MA, Tamang S, Yazdany J, et al. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544–1547. doi:10.1001/jamainternmed.2018.3763.