Full article: Development of an algorithm for finding pertussis episodes in a population-based electronic health record database

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

While tetanus-diphtheria-acellular pertussis (Tdap) vaccines for adolescents and adults were licensed in 2005 and immunization strategies proposed, the burden of pertussis in this population remains under-recognized mainly due to atypical disease presentation, undermining efforts to optimize protection through vaccination. We developed a machine learning algorithm to identify undiagnosed/misdiagnosed pertussis episodes in patients diagnosed with acute respiratory disease (ARD) using signs, diseases and symptoms from clinician notes and demographic information within electronic health-care records (Optum Humedica repository [2007–2019]). We used two patient cohorts aged ≥11 years to develop the model: a positive pertussis cohort (4,515 episodes in 4,316 patients) and a negative pertussis (ARD) cohort (4,573,445 episodes and patients), defined using ICD 9/10 codes. To improve contrast between positive pertussis and negative pertussis (ARD) episodes, only episodes with ≥7 symptoms were selected. LightGBM was used as the machine learning model for pertussis episode identification. Model validity was determined using laboratory-confirmed pertussis positive and negative cohorts. Model explainability was obtained using the Shapley additive explanations method. The predictive performance was as follows: area under the precision–recall curve, 0.24 (SD, 7 × 10⁻³); recall, 0.72 (SD, 4 × 10⁻³); precision, 0.012 (SD, 1 × 10⁻³); and specificity, 0.94 (SD, 7 × 10⁻³). The model applied to laboratory-confirmed positive and negative pertussis episodes had a specificity of 0.846. Predictive probability for pertussis increased with presence of whooping cough, whoop, and post-tussive vomiting in clinician notes, but decreased with gastrointestinal bleeding, sepsis, pulmonary symptoms, and fever. In conclusion, machine learning can help identify pertussis episodes among those diagnosed with ARD.

GRAPHICAL ABSTRACT

KEYWORDS:

Introduction

Pertussis (whooping cough) is a highly contagious respiratory infection caused by Bordetella pertussis that affects all ages, though the greatest burden occurs among those aged <1 year.^Citation1 Vaccination has considerably reduced pertussis incidence in childhood.^Citation2 However, incomplete historical vaccine coverage combined with waning immunity induced by vaccination or previous infection have led to an increased number of the population becoming susceptible to pertussis over time and, in large part, to the potential for periodic endemic transmission and the resurgence of the disease in adolescents and adults.^{Citation3,Citation4} Moreover, close-contact adolescents and adults are a recognized source of pertussis transmission to vulnerable infants too young to be vaccinated or not fully vaccinated.^{Citation5,Citation6}

Pertussis diagnosis in practice is largely based on clinical presentation in infants and children, but the need to tailor diagnosis to age is recognized.^Citation7 Pertussis diagnosis in adolescents and adults is hindered by atypical presentation.^Citation8 The most common pertussis symptoms in children include paroxysmal cough, post-tussive vomiting, and inspiratory whoop,^{Citation9–11} though the severity of illness is reduced in those who are vaccinated.^Citation11 In adolescents and adults, pertussis infections can often be asymptomatic,^Citation12 though overt disease is usually characterized by persistent cough ≥3 weeks’ duration, with mostly paroxysmal coughing that can severely disturb sleep.^Citation13 Definitive laboratory diagnosis is often limited because many patients delay seeking medical care.^{Citation14,Citation15} The atypical clinical characteristics of cases and the lack of laboratory confirmation have also, in part, contributed to the underreporting of pertussis in adolescents and adults.^{Citation8,Citation16} In addition, ascertainment bias in this group may be considerable as those affected are less likely to seek medical attention because of the often milder disease manifestation and those who do may not be appropriately diagnosed because of limited physician’s awareness of the disease in the non-pediatric population.^Citation17 Nonetheless, pertussis may lead to severe health outcomes in those with underlying morbidities, in particular the elderly who suffer from multiple or more advanced chronic diseases.^{Citation12,Citation13}

Here, we described the development of a machine learning algorithm to identify undiagnosed/misdiagnosed pertussis episodes in adolescent and adult patients with reported acute respiratory disease (ARD) using clinician notes and demographic information within electronic health-care records (EHR). Such an approach could facilitate the identification of undiagnosed/misdiagnosed pertussis episodes through improved characterization of the disease in adolescents and adults. This project was further used for the quantification of incidence rates of pertussis, their age and time trends, as well as risk factors for pertussis disease and complications (Macina et al. published separately in the same issue of Human Vaccines & Immunotherapeutics).

Methods

Data sources

This retrospective, observational, cohort study was conducted in the Optum Humedica longitudinal clinical repository, a proprietary claims and enrollment database. Optum Humedica combines data from more than 50 US integrated delivery networks and includes more than 700 hospitals and 7000 clinics. It comprises about 80 million patients from all US census regions and with all types of insurances (i.e., commercial insurance, public insurance, or other insurance) and those uninsured. The EHR data are obtained directly from providers, integrating multiple EHRs from across the continuum of care, both inpatient and ambulatory.

Source populations

The study source population included patients with EHR data in the Optum Humedica repository from January 2007 to December 2019. Those included in the analysis were aged 11 years or older at the date of the first diagnosis (index date) with available information regarding case features in the physicians’ notes. Two cohorts were initially built. The first, the positive pertussis cohort, encompassed those with pertussis International Classification of Diseases ninth and tenth (ICD9 and ICD10) codes (Table S1). Diagnoses within a time gap of 6 weeks (42 d) were considered as the same pertussis episode, the first diagnosis date being the index date. The 6 weeks before and after the index date was considered as the positive pertussis episode. A 6-week period was chosen because a typical pertussis episode includes a catarrhal phase of 1–2 weeks, a paroxysmal phase at weeks 3–6, and a convalescent phase at >6 weeks, and it was our intention to capture the pertussis symptoms during catarrhal and paroxysmal phases.^Citation18 The second cohort was built using a list of ICD codes for ARDs other than pertussis (Table S2). This cohort was named negative pertussis (ARD) cohort. Negative pertussis (ARD) episodes were defined by acute upper respiratory infections, influenza, and acute lower respiratory infections diagnoses (i.e. other ARDs) with corresponding ICD 9/10 codes. Diagnoses with a time gap of up to 7 d were considered as belonging to the same episode, the first diagnosis date being the index date. The week before and after the index date was considered as the negative pertussis episode. This definition does not definitively exclude those with undiagnosed/misdiagnosed pertussis.

The study population for developing the pertussis event finding algorithm comprised two cohorts, those with positive pertussis or negative pertussis (ARD) episodes (). The negative pertussis (ARD) cohort comprised those diagnosed with ARD episodes; one randomly chosen episode per patient was considered. To ensure the two cohorts were mutually exclusive, patients with an episode diagnosed as positive pertussis were excluded from the negative pertussis (ARD) cohort (these were removed before randomly choosing episodes). As the pertussis clinical picture varies by age and sex, taking these two demographic characteristics into account was important. Thus, episodes representing patients with “Unknown” sex were removed. In addition, information on patients with a pertussis diagnosis who were later confirmed to be pertussis-negative with a laboratory test was excluded.

Figure 1. Cohorts identification flow chat for developing a pertussis event finding algorithm.

To improve the contrast between positive pertussis episodes and episodes in the negative pertussis (ARD) cohort, only well-described episodes (with at least seven symptoms) were selected. A threshold of seven symptoms was chosen to provide sufficient contrast/differentiation without losing too many episodes, which would have limited the validity of the algorithm. Episodes with the presence of at least one cough symptom were used for model training and testing; cough is considered to be a typical characteristic symptom of pertussis.

To further validate the machine learning algorithm, laboratory-confirmed positive pertussis and laboratory-confirmed negative pertussis cohorts were developed (Figure S1) using the Logical Observation Identifiers Names and Codes (LOINC) codes associated with Bordetella pertussis laboratory assay in the Claims Lab Results table (Table S3). Multiple laboratory test results for the same patient that occurred on the same day were considered as a single event. If the results of at least one of these tests within 6 weeks of the index date was positive, the event was considered positive; otherwise, the event was considered negative.

Feature development

The signs, diseases, and symptoms (SDS) table was generated by Optum using Named Entity Recognition (NER) natural language processing (NLP) of clinician notes, and was leveraged to capture details of the symptoms that characterize pertussis episodes. The information from provider notes allowed us to capture the clinical picture as described by the physician in routine clinical practice. A symptom list was developed using the information from the SDS tables, supplemented by additional pertussis-related symptoms identified in a literature review. The symptoms derived from the literature review included persistent cough, paroxysmal cough, whooping, vomiting, notably, post-tussive vomiting, and cyanosis, which are used in the existing clinical case definitions of pertussis.^Citation7 The most prevalent SDS terms and combinations of the SDS terms, attributes, and locations were analyzed separately for the positive pertussis cohort, negative pertussis (ARD) cohort as a whole and by each ARD within the negative pertussis (ARD) cohort. Cramer’s V test was used to identify SDS terms that were the most differentiating between the positive pertussis cohort and negative pertussis (ARD) cohort. Medically relevant symptoms identified were added to the list of pertussis symptoms. Synonyms and clinically close symptoms were grouped, e.g. the term ‘paroxysmal cough’ was grouped with terms such as ‘severe cough,’ ‘frequent cough,’ and ‘significant cough’ to form the feature ‘severe, frequent cough.’ The list of pertussis symptoms constructed is summarized in Table S4.

The following variables of interest were created per symptom: symptom’s presence (Boolean) and days since the index date of symptoms occurrence. The variables of symptom presence/absence in the course of the pertussis positive or negative episode were constructed based on the combination of SDS terms, locations, and attributes from the SDS tables (with positive or absent SDS attribute) for all symptoms apart from fever. The presence/absence of fever was constructed based on the SDS table and the Observations and Measurements table.

The symptom date of occurrence was defined based on the earliest observation date and occurrence dates registered in the SDS tables (the date of the maximum body temperature during the episode for fever, if available). Symptoms occurring in the time frame of the pertussis episode (6 weeks before and after the index date) and negative pertussis episodes (1 week before and after the index date) were captured. The number of symptoms per episode was calculated (continuous). Demographic features assessed included age (years, and age subgroup [11–18, 19–44, 45–64, and ≥65 y); sex (female, male, unknown), race (Caucasian, African American, Asian, other/unknown); ethnicity (Hispanic, not-Hispanic, other/unknown); region (Midwest, South, Northeast, West, other/unknown); and coverage (years).

Pertussis event finding algorithm

Several machine learning algorithms were selected and compared for detection of pertussis episodes; these included Logistic Regression, Random Forest, and Light Gradient Boosting Machine (LightGBM). The LightGBM (Python LightGBM, package version 3.2.1) performed better than the two other models based on area under the precision and recall curve (AUPRC) scores (see ALGORITHM BENCHMARKING section in supplementary materials), and thus was used as the machine learning algorithm for pertussis identification described here.^Citation19

The data set including the positive pertussis and negative pertussis (ARD) cohorts was randomly divided into a training set (70% of the dataset) to guide the building of the model, and a test set (the remaining 30% of the dataset) to assess its performance. The training and test sets were compared with the data set as a whole in terms of demographic and symptom features in order to ensure no bias was introduced by the split. Model performance was evaluated using the AUPRC. The AUPRC is considered particularly informative for assessing binary classifiers on imbalanced outcome variables.^Citation20 Additional performance metrics assessed included specificity, precision (aka positive predictive value) and recall (aka sensitivity).

specificity = \frac{\begin{matrix} Algorithm identified negative episodes \\ from negative pertussis (ARD) cohort \end{matrix}}{All episodes from negative pertussis (ARD) cohort}

precision = \frac{\begin{matrix} Algorithm identified positive episodes \\ from positive pertussis cohort \end{matrix}}{All episodes identified as positive by the algorithm}

recall = \frac{\begin{matrix} Algorithm identified positive episodes \\ from positive pertussis cohort \end{matrix}}{All episodes from positive pertussis cohort}

The hyper-parameters of the model were optimized with a grid search by K-fold cross-validation before their performance was evaluated on the test set. Grid search allows for the assessment of different combinations of hyper-parameters in order to identify the combination with the highest performance. To select model parameters, fivefold cross-validation with a 1:10 matching ratio was conducted to identify the combination with the highest averaged AUPRC. At each training iteration of the cross-validation, the negative pertussis (ARD) episodes of the training set were under-sampled by matching with the positive pertussis episodes on sex and age because of the original imbalance. This allowed bias removal and improved the learning of the model. Several matching ratios were also evaluated by grid search (1:1, 1:10, 1:100, and 1:1000). If the training set remained unbalanced after matching, the loss function of the model was modified in order to give more weight to errors on positive pertussis episodes (less frequent) and less weight to errors on negative pertussis (ARD) episodes (more frequent).

Once the optimal hyper-parameters were selected, the model was trained on the training set and evaluated on the test set. Of note, the matching (on age and sex) was performed on the training set (with a ratio of 1:10 identified via hyperparameters tuning) but not on the test set. To evaluate the robustness of the model, training and evaluation were performed 20 times, each time changing the sampling of the negative pertussis (ARD) episodes matched (on age and sex) in the training set. The performance metrics were averaged over the 20 iterations and their standard deviations computed. Model validation (via the assessment of specificity, precision and recall) was performed using cohorts with laboratory-confirmed positive pertussis and laboratory-confirmed negative pertussis episodes.

Feature importance in the models was evaluated using the Shapley additive explanations (SHAP), which estimates SHAP values of the effect each feature has on a prediction probability for an outcome. Each feature of the model was considered an actor and its contribution to the model output was evaluated (both effect size and direction of contribution). Boolean features (presence/absence) were included for each symptom; only the impact of the presence of the symptoms was assessed because the absence of a symptom in the Optum Humedica database cannot be assumed to be definitive because of inconsistency in reporting.

Identification of undiagnosed/misdiagnosed pertussis episodes

The developed pertussis identification algorithm rendered a probability score of the likelihood of each episode being pertussis. A decision boundary for the probability score was obtained by examining the recall and false-positive rate across various thresholds. A threshold was chosen such that it maximized recall while minimizing the false-positive rate. The developed pertussis identification algorithm was applied to the negative pertussis (ARD) cohort to identify those with a high probability of pertussis as predicted by the model (i.e. undiagnosed/misdiagnosed pertussis episodes). The episodes with a probability score above the identified threshold were considered pertussis episodes. Descriptive analysis of the identified undiagnosed/misdiagnosed pertussis episodes was performed, including the description of the demographic and clinical characteristics (symptoms). The identified undiagnosed/misdiagnosed pertussis episodes were compared with the diagnosed pertussis episodes.

Statistical calculations were performed using Python version 3.6.

Results

Study populations

The positive pertussis cohort comprised 14,086 episodes reported for 10,816 patients, and the negative pertussis (ARD) cohort comprised 39,874,115 episodes reported for 15,544,519 patients. The positive pertussis cohort had a mean age of 48 ± 25 y and was predominantly female (64%) and Caucasian (85%). In comparison, the negative pertussis (ARD) cohort had a mean age of 44 ± 21 y and was also predominantly female (63%) and Caucasian (82%). The most common causes of episodes in the negative pertussis (ARD) cohort were acute pharyngitis (24%), acute sinusitis (22%), acute upper respiratory infections of multiple and unspecified sites (21%), acute bronchitis (13%), and pneumonia (11%). Other diseases, such as influenza and acute epiglottitis were the cause of less than 5% of episodes.

After applying the inclusion criteria (to focus on well-described episodes with at least seven symptoms), the cohort with positive pertussis episodes comprised 4,515 episodes reported for 4,316 patients and the cohort with negative pertussis episodes (ARD) comprised 4,573,445 episodes and patients (only one episode per patient included) (). Compared with the entire ARD cohort, there were fewer acute pharyngitis episodes (24% vs 17%) and more pneumonia episodes (11% vs 16%) in the cohort with pertussis negative (ARD) episodes.

The cohort with positive pertussis episodes was found to be significantly younger at index date than the ARD cohort (43 ± 21 y vs. 47 ± 21 y, p < .001). The highest proportion of episodes in the positive pertussis cohort were in the 11–18-y age-group. There were regional differences between the two cohorts; the cohort with positive pertussis episodes had a notable overrepresentation of episodes occurring in the West (14 vs. 9) whereas episodes were underrepresented in the South (22 vs. 27) (p < .001 for both). Coverage in the database was found to be longer by 2 y in the cohort with positive pertussis episodes than in the cohort with negative pertussis (ARD) episodes (9 ± 3 y vs. 7 ± 4 y, p < .001). There was a higher number of symptoms per episode in the cohort with positive pertussis episodes than the cohort with negative pertussis (ARD) episodes (15 ± 7 vs. 12 ± 6, p < .001).

The frequency of the nine symptoms representing cough and the two related disease symptoms (see Table S4) were significantly different (p < .001 for all) between the positive pertussis and negative pertussis (ARD) cohorts. Among the most represented symptoms in the positive pertussis episodes compared with the cohort with negative pertussis (ARD) episodes were known pertussis symptoms (see Table S5): whoop (3% vs 0%), whooping cough (42% vs 0%) and post-tussive vomiting (13% vs 1%). In contrast, gastrointestinal bleeding (1% vs 2%), sepsis (4% vs 7%), earache (15% vs 25%) and fever (9% vs 14%) were the most represented symptoms in the cohort with negative pertussis (ARD) episodes compared with the positive pertussis episodes. The most common causes of episodes in the negative pertussis (ARD) cohort were acute upper respiratory infections of multiple and unspecified sites (22%), acute sinusitis (18%), acute pharyngitis (17%), pneumonia (16%) and acute bronchitis (15%). Splitting the dataset to training (70%) and testing (30%) sets did not introduce bias between the two sets with respect to the variables.

Model optimization, training and testing

Hyper-parameters of the LightGBM model were optimized with a grid search by K-fold cross-validation. Averaged cross-validation performance with the optimized hyper-parameters, along with its performance on the training set (under-sampled with a matching 1:10) and test set (with no under-sampling) are summarized in . The training and test sets were matched on age and sex. The average cross-validation performance (computed over 20 iterations) on the training set was consistent with that of the test sets. Given that the training set was much larger than the test set and changed with each iteration, the small associated standard deviations between the results of iteration on training sets and consistency with the test set results demonstrate that the model was robust.

Table 1. Summary of model performance: the model was trained on an under-sample training set with a 1:10 ratio, and then tested on the dataset with real-world ratio of 1:1000.

Download CSV Display Table

Model validation and interpretation (SHAP values)

The cohorts with laboratory-confirmed positive pertussis and laboratory-confirmed negative pertussis episodes included only 13 positive pertussis episodes in 13 patients and 1,416 negative pertussis (ARD) episodes in 1,271 patients (Figure S1). Even before the application of filters related to the algorithm application, only 273 positive pertussis episodes were identified compared with 4,405 negative pertussis episodes. Indirectly, this might indicate that the specificity of health-care professionals in pertussis diagnosis is rather low (the majority of patients suspected to have Bordetella pertussis and prescribed a laboratory test by their physicians then test negative). The assumed precision of the physicians was 0.058 (only 5.8% of all the episodes are positive).

The algorithm, when applied to the cohorts with laboratory-confirmed positive pertussis and laboratory-confirmed negative pertussis episodes, had a precision of 0.031 (i.e. 3.1% of episodes detected as pertussis by the algorithm were indeed pertussis episodes) and a recall of 0.538 (i.e. algorithm identified 53.8% pertussis episodes out of all laboratory-confirmed positive pertussis episodes). These two indicators, however, were not interpretable as the number of laboratory-confirmed positive pertussis episodes were disproportionally low. The specificity was 0.846 (algorithm identified 84.6% negative episodes out of all laboratory-confirmed negative pertussis episodes). This high specificity suggests that our model was of sufficient quality.

The developed algorithm for pertussis episode identification was interpreted using SHAP values. According to this interpretation, the presence of whooping cough, whoop, and post-tussive vomiting appeared the most predictive symptom of pertussis. Similarly, the presence of persistent, severe, frequent cough as well as the notion of cough improvement or worsening (which likely indicates long-term cough) and cough suppression had a positive impact on the prediction of an episode as a pertussis case. The presence of sleep disturbance (supposedly indicating nocturnal cough), as well as air-trapping, also increased the algorithm predicted probability of pertussis. The presence of spasms, nodules, cramps, and numbness, as well as falls also had a positive impact on the pertussis probability predicted by the algorithm (). In contrast, the presence of gastrointestinal bleeding, sepsis, pulmonary symptoms (e.g. fluid in lungs), and fever had a negative impact on the probability of the episode being pertussis ().

Figure 2. Presence of symptoms with a positive impact on the probability of a pertussis episode.

Figure 3. Presence of symptoms with a negative impact on the probability of a pertussis episode.

Decision threshold and characteristics of algorithm-identified episodes

shows model performance according to the decision threshold, characterized by different levels of recall and false-positive rate when applied to the test dataset. Increasing the decision threshold decreased the number of episodes classified as pertussis. Thus, fewer pertussis cases are identified as such and the recall drops, whereas the number of negative pertussis (ARD) cases classified as pertussis decreases and the precision increases. The decision threshold impacts the assessment of the incidence of undiagnosed/misdiagnosed pertussis directly. Taking into consideration the criteria of recall and false-positive rate, the threshold of 0.5 was chosen as a good compromise. Indeed, the threshold interval of 0.5–0.6 provides limited gain in false-positive rate for the loss in recall, whereas the threshold interval of 0.4–0.5 provides better gain in false-positive rate compared with recall. The choice of the 0.5 threshold was also linked to the fact that it can be interpreted as the probability of an episode being pertussis according to the algorithm. Thus, episodes with probability >0.5 were chosen. Of note, the threshold can be redefined based on the algorithm application.

Figure 4. Recall and precision as a function of the decision threshold.

The developed pertussis identification algorithm was applied to the ARD cohort to identify undiagnosed/misdiagnosed pertussis episodes. At 0.5 probability threshold, 1,053,946 episodes in 924,304 patients were identified as undiagnosed/misdiagnosed pertussis episodes (Figure S2). The demographic characteristics of algorithm-identified episodes were similar to those of the negative pertussis (ARD) episodes (). However, there were more adolescent and elderly patients in the cohort with positive pertussis episodes; 69% of the algorithm-identified pertussis episodes occurred in patients aged 19–64 y. The prevalence of symptoms in the pertussis episodes identified by the algorithm was overall higher than that of the positive pertussis cohort. For instance, the prevalence of cough was 97% in episodes identified by the algorithm, and 44% in the positive pertussis cohort. This difference was largely driven by the filters applied to the cohort for algorithm testing on cough and the requirement of seven symptoms. The patients with algorithm-identified episodes were more likely than those in the negative pertussis (ARD) cohort to have had a diagnosis of acute bronchitis (30% vs. 18%) or pneumonia (17% vs. 12%). Conversely, 8% and 11% of the patients with algorithm-identified episodes had a previous diagnosis of acute pharyngitis and acute sinusitis, respectively, versus 16% and 21% in the negative pertussis (ARD) cohort.

Table 2. Demographic characteristics of the algorithm-identified cases with pertussis episodes compared with those with negative pertussis (ARD) and positive pertussis episodes.

Download CSV Display Table

Discussion

We used real-world data to develop an algorithm to identify pertussis episodes with the intention being to better estimate the disease burden and risk in adolescents and adults (Macina et al. published separately in the same issue of Human Vaccines & Immunotherapeutics) that can inform estimation of cost effectiveness of immunization strategies to optimize protection through vaccination.^Citation21 The presence of whoop, whooping cough, other types of cough (especially persistent, severe, and long-lasting), alongside post-tussive vomiting were found to be among the key drivers for prediction of a pertussis episode. These findings are largely consistent with the existing literature. The suggested pertussis clinical case definition for surveillance purposes in patients aged ≥10 y with cough or illness with minimal or no fever include nonproductive paroxysmal cough of ≥2 weeks’ duration and whoop, apnea or sweating episodes between paroxysms, post-tussive vomiting, and worsening of symptoms at night.^Citation7 Indeed, in adults, pertussis is frequently accompanied by vomiting,^Citation18 and the presence of whooping cough and post-tussive vomiting have been shown to increase the likelihood of Bordetella pertussis. ^Citation22 Adults with pertussis are commonly reported to suffer from mild cough or prolonged cough, sleep disturbance, weight loss, pharyngeal symptom, influenza-like symptoms and sneezing attacks.^Citation17

In contrast, the presence of gastrointestinal bleeding, sepsis, pulmonary symptoms (e.g. fluid in the lung), as well as fever, had a negative impact on the identification of pertussis episodes. The absence of fever as a sign of pertussis is consistent with the existing clinical evidence. Although apnea is included in the suggested clinical case definition of pertussis in patients aged ≥10 y,^Citation7 and reported at a frequency of 19–37%,^Citation17 in our study other breathing issues including apnea appeared to decrease the algorithm probability of identifying pertussis episodes. Nonetheless, the impact of other breathing issues on the algorithm prediction was low (−0.01 probability points).

We implemented relevant techniques to test and validate the algorithm amidst the high-class imbalance of data (positive 14,086 episodes in 10,816 patients vs. negative 39,874,115 episodes in 15,544,519 patients) by training it on an under-sampled data set with a 1:10 ratio, and then testing it on the dataset with real-world ratio of 1:1000. Such high-class imbalance is a common occurrence in real-world studies. Validation using a cohort of laboratory-confirmed pertussis positive and negative episodes was also performed to assess the model quality on a new dataset. Finally, considering patients’ age and gender as variables allowed us to limit the potential associated bias.

Yet, our study has a number of limitations that need to be considered when interpreting and using our results. In Optum Humedica, missing symptom information is interpreted as an absence of the symptom. This is a considerable limitation, as misreporting or missing information in real-world data is frequent. However, it is assumed that missing information affects equally the pertussis and ARD groups in our study. The temporal granularity of the Optum EHR data was also a limiting factor. Symptoms were reported via health-care provider notes and captured in SDS tables. Although diagnosis and note dates were reported, the actual date of occurrence (of signs/diseases/symptoms) was almost never reported, which has an impact on the definition of pertussis event onset. False-positive and -negative episodes are possible as some patients may have had a pertussis diagnosis outside the networks captured in Optum Humedica or because of pertussis misdiagnosis/not diagnosed. Thus, the negative pertussis (ARD) cohort likely contained undiagnosed/misdiagnosed pertussis events (with no means available to verify), which limits the distinction between pertussis and non-pertussis ARD episodes.

In addition, pertussis and acute upper respiratory infections, influenza, and acute lower respiratory infections can be defined differently under ICD 9 and ICD 10 codes, leading to inconsistencies with patient cases. To better differentiate the episodes of pertussis and ARDs, other features such as physician specialty, recurrent visits because of persistent cough, medications, and procedures could improve algorithm performance. The inclusion of minimum cough duration might also improve the algorithm’s ability to identify undiagnosed/misdiagnosed pertussis episodes since cough duration is an important clinical feature of the disease. It is also possible that the use of episodes described by at least seven symptoms for algorithm development might have biased the model toward more severe episodes and may not be representative of the real-world scenario. Finally, the algorithm, developed using a US population EHR database, is likely to be biased toward clinical practice in the USA, which may render generalizability to other countries difficult.

In summary, in patients diagnosed with ARDs, our algorithm for pertussis episode prediction developed through machine learning using comprehensive EHRs and claims data collected in routine practice shows promise. Such an approach may facilitate identification of pertussis episodes in adolescents and adults that may lead to better estimates of the disease burden and risk in this group. Further, with assumptions regarding underreporting correction identified as a crucial determinant of pertussis immunization strategy cost-effectiveness estimates, such an approach would also inform optimization strategies for protection through vaccination.^Citation21

Author contributions

CD, SM, MD, SE, and PH contributed to the conceptual design of the study and/or data analyses. All authors (CD, SM, MD, SE, PH, MM, and DM) contributed to the interpretation of the data and participated in the drafting and critical revision of the article, approved the final version and are accountable for its accuracy and integrity.

Supplemental material

Supplemental Material

Download MS Word (247.8 KB)

Acknowledgments

Editorial assistance with the preparation of the manuscript was provided by Richard Glover, inScience Communications, Springer Healthcare Ltd, Chester, UK, and was funded by Sanofi. The authors also thank Roopsha Brahma, PhD, for editorial assistance and manuscript coordination on behalf of Sanofi.

Disclosure statement

CD, SM, and DM are employees of Sanofi and may hold shares and/or stock options in the company. MD, SE, PH and MM are employees of Quinten who were contracted by Sanofi to conduct this research.

Supplemental data

Supplemental data for this article can be accessed on the publisher’s website at https://doi.org/10.1080/21645515.2023.2209455.

Additional information

Funding

This study was funded and sponsored by Sanofi. Sanofi was involved in the study design, accessing the electronic health-care records database, analysis, and interpretation of data, the writing of the report; and in the decision to submit the paper for publication. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

References

Center for Disease Control and Prevention. Pertussis (whooping cough) surveillance and reporting. 2019 [accessed 2021 May 26]. https://www.cdc.gov/pertussis/surv-reporting.html.
Google Scholar
Havers FP, Moro PL, Hariri S, Skoff T. Pertussis. In: Hamborsky J, Kroger A, and Wolfe C, editors. Centers for Disease Control and Prevention epidemiology and prevention of vaccine-preventable diseases. 13th ed. Washington (DC): Public Health Foundation ; 2015. p. 239–10.
Google Scholar
Tan T, Dalby T, Forsyth K, Halperin SA, Heininger U, Hozbor D, Plotkin S, Ulloa-Gutierrez R, Wirsing von König CH. Pertussis across the globe: recent epidemiologic trends from 2000 to 2013. Pediatr Infect Dis J. 2015;34:e222–32. doi:10.1097/INF.0000000000000795.
PubMed Web of Science ®Google Scholar
Domenech de Celles M, Magpantay FMG, King AA, Rohani P. The impact of past vaccination coverage and immunity on pertussis resurgence. Sci Transl Med. 2018;10(434). doi:10.1126/scitranslmed.aaj1748.
Web of Science ®Google Scholar
Wendelboe AM, Njamkepo E, Bourillon A, Floret DD, Gaudelus J, Gerber M, Grimprel E, Greenberg D, Halperin S, Liese J, et al. Transmission of Bordetella pertussis to young infants. Pediatr Infect Dis J. 2007;26:293–9. doi:10.1097/01.inf.0000258699.64164.6d.
PubMed Web of Science ®Google Scholar
Katfy K, Diawara I, Maaloum F, Aziz S, Guiso N, Fellah H, Slaoui B, Zerouali K, Belabbes H, Elmdaghri N. Pertussis in infants, in their mothers and other contacts in Casablanca, Morocco. BMC Infect Dis. 2020;20:43. doi:10.1186/s12879-019-4680-1.
PubMed Web of Science ®Google Scholar
Cherry JD, Tan T, Wirsing von Konig CH, Forsyth KD, Thisyakorn U, Greenberg D, Johnson D, Marchant C, Plotkin S. Clinical definitions of pertussis: summary of a global pertussis initiative roundtable meeting, February 2011. Clin Infect Dis. 2012;54:1756–64. doi:10.1093/cid/cis302.
PubMed Web of Science ®Google Scholar
Gabutti G, Rota MC. Pertussis: a review of disease epidemiology worldwide and in Italy. Int J Environ Res Public Health. 2012;9:4626–38. doi:10.3390/ijerph9124626.
PubMed Web of Science ®Google Scholar
Narkeviciute I, Kavaliunaite E, Bernatoniene G, Eidukevicius R. Clinical presentation of pertussis in fully immunized children in Lithuania. BMC Infect Dis. 2005;5:40. doi:10.1186/1471-2334-5-40.
PubMed Web of Science ®Google Scholar
Wu DX, Chen Q, Yao KH, Li L, Shi W, Ke JW, Wu A-M, Huang P, Shen K-L. Pertussis detection in children with cough of any duration. BMC Pediatr. 2019;19:236. doi:10.1186/s12887-019-1615-3.
PubMed Web of Science ®Google Scholar
Preziosi MP, Halloran ME. Effects of pertussis vaccination on disease: vaccine efficacy in reducing clinical severity. Clin Infect Dis. 2003;37:772–9. doi:10.1086/377270.
PubMed Web of Science ®Google Scholar
Ward JI, Cherry JD, Chang SJ, Partridge S, Keitel W, Edwards K, Lee M, Treanor J, Greenberg DP, Barenkamp S, et al. Bordetella pertussis infections in vaccinated and unvaccinated adolescents and adults, as assessed in a national prospective randomized Acellular Pertussis Vaccine Trial (APERT). Clin Infect Dis. 2006;43:151–7. doi:10.1086/504803.
PubMed Web of Science ®Google Scholar
De Serres G, Shadmani R, Duval B, Boulianne N, Dery P, Douville Fradet M, Rochette L, Halperin S. Morbidity of pertussis in adolescents and adults. J Infect Dis. 2000;182:174–9. doi:10.1086/315648.
PubMed Web of Science ®Google Scholar
Senzilet LD, Halperin SA, Spika JS, Alagaratnam M, Morris A, Smith B, Sentinel Health Unit Surveillance System Pertussis Working Group. Pertussis is a frequent cause of prolonged cough illness in adults and adolescents. Clin Infect Dis. 2001;32:1691–7. doi:10.1086/320754.
PubMed Web of Science ®Google Scholar
de Graaf H, Gbesemete D, Gorringe AR, Diavatopoulos DA, Kester KE, Faust SN, Read RC. Investigating Bordetella pertussis colonisation and immunity: protocol for an inpatient controlled human infection model. BMJ Open. 2017;7:e018594. doi:10.1136/bmjopen-2017-018594.
PubMed Web of Science ®Google Scholar
Kandeil W, Atanasov P, Avramioti D, Fu J, Demarteau N, Li X. The burden of pertussis in older adults: what is the role of vaccination? A systematic literature review. Expert Rev Vaccines. 2019;18:439–55. doi:10.1080/14760584.2019.1588727.
PubMed Web of Science ®Google Scholar
Rothstein E, Edwards K. Health burden of pertussis in adolescents and adults. Pediatr Infect Dis J. 2005;24:S44–7. doi:10.1097/01.inf.0000160912.58660.87.
PubMed Web of Science ®Google Scholar
von Konig CH, Halperin S, Riffelmann M, Guiso N. Pertussis of adults and infants. Lancet Infect Dis. 2002;2:744–50. doi:10.1016/S1473-3099(02)00452-8.
PubMed Web of Science ®Google Scholar
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.
Google Scholar
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10:e0118432. doi:10.1371/journal.pone.0118432.
PubMed Web of Science ®Google Scholar
Fernandes EG, Rodrigues CCM, Sartori AMC, De Soarez PC, Novaes HMD. Economic evaluation of adolescents and adults’ pertussis vaccination: a systematic review of current strategies. Human Vacc Immunother. 2019;15:14–27. doi:10.1080/21645515.2018.1509646.
PubMed Web of Science ®Google Scholar
Ebell MH, Marchello C, Callahan M. Clinical diagnosis of Bordetella pertussis infection: a systematic review. J Am Board Fam Med. 2017;30:308–19. doi:10.3122/jabfm.2017.03.160330.
PubMed Web of Science ®Google Scholar

Development of an algorithm for finding pertussis episodes in a population-based electronic health record database

ABSTRACT

GRAPHICAL ABSTRACT

Introduction