185
Views
0
CrossRef citations to date
0
Altmetric
Original Research

Deriving a Standardised Recommended Respiratory Disease Codelist Repository for Future Research

, , , ORCID Icon, , , , ORCID Icon, , , ORCID Icon, & ORCID Icon show all
Pages 1-8 | Published online: 16 Feb 2022

Abstract

Background

Electronic health record (EHR) databases provide rich, longitudinal data on interactions with healthcare providers and can be used to advance research into respiratory conditions. However, since these data are primarily collected to support health care delivery, clinical coding can be inconsistent, resulting in inherent challenges in using these data for research purposes.

Methods

We systematically searched existing international literature and UK code repositories to find respiratory disease codelists for asthma from January 2018, and chronic obstructive pulmonary disease and respiratory tract infections from January 2020, based on prior searches. Medline searches using key terms provided in article lists. Full-text articles, supplementary files, and reference lists were examined for codelists, and codelists repositories were searched. A reproducible methodology for codelists creation was developed with recommended lists for each disease created based on multidisciplinary expert opinion and previously published literature.

Results

Medline searches returned 1126 asthma articles, 70 COPD articles, and 90 respiratory infection articles, with 3%, 22% and 5% including codelists, respectively. Repository searching returned 12 asthma, 23 COPD, and 64 respiratory infection codelists. We have systematically compiled respiratory disease codelists and from these derived recommended lists for use by researchers to find the most up-to-date and relevant respiratory disease codelists that can be tailored to individual research questions.

Conclusion

Few published papers include codelists, and where published diverse codelists were used, even when answering similar research questions. Whilst some advances have been made, greater consistency and transparency across studies using routine data to study respiratory diseases are needed.

Introduction

Electronic health record (EHR) databases include rich, longitudinal data on an individual’s interactions with health care providers. They comprise part of the clinical information systems which health care providers use during clinical consultations across primary, secondary and tertiary care. From these systems, data can then be extracted to enhance patient care through clinical research, healthcare planning, decision-making, and clinical audit. These routine data have been used to make significant advances in research into the epidemiology, burden, and natural history of respiratory disease, leading to improved prevention, detection and management, and to inform health service planning and policy.Citation1,Citation2 The scale of these data facilitates a wide range of research with high statistical power due to the in-depth variety of variables recorded and the number of patients contributing to the data, particularly as they are increasingly linked with other data sources.Citation28 However, EHRs are primarily populated to support health care delivery rather than research. This gives rise to challenges, including high volume and irregularly collected,Citation2 informatively observed (where data collection is driven by clinical requirements), missing, and incorrectly coded data.Citation3,Citation4

To study a health condition in EHR databases, an operational definition based on clinical codes is often used. Clinical codes are alphanumerical codes ascribed to specific clinical events or descriptions. Numerous code systems exist, and each diagnosis can have multiple clinical codes associated with it. Therefore, in order to search for a particular diagnosis, multiple clinical codes are required, constituting a clinical codelist.Citation5 The choice of codes requires clinical and epidemiological expertise and knowledge about data quality and provenance in addition to knowledge about the databases being interrogated.Citation6 However, there is often significant variation in the clinical codes used to define respiratory conditions.Citation7,Citation8 This can result in considerable differences in study findings,Citation9 such as incidence and prevalence across studies and limits the generalisability and comparability of findings.Citation10,Citation11 As an example, Mukherjee et al examined UK asthma prevalence and reported that annual prevalence of clinician-reported-and-diagnosed asthma was 5.7% (3.6 M individuals) when derived from primary care databases and 6.8% (4.3 M individuals) when derived from the financial incentive-based Quality and Outcomes Framework in UK primary care, whereas annual prevalence of patient-reported clinician-diagnosed-and-treated asthma was 9.6% (6.0 M individuals) derived from national health surveys.Citation12 Therefore, it is important that standardised codelists exist to support research reproducibility, translation of findings between institutions and reduce duplication of work.Citation13

Previously published codelists for respiratory diseases have included specific codes relating to study-specific research questions, and few validation studies have been published.Citation14,Citation15 We sought to respond to this knowledge gap by developing a systematically derived collection of published codes from EHRs for three common respiratory disease categories: asthma, chronic obstructive pulmonary diseaseCitation1 and selected respiratory infections (Box S1). We aimed to amalgamate all codelists into one document and from that produce a recommended list for each disease, which can be used by researchers to identify relevant respiratory-related codes. We describe our methodology for this work in detail to allow researchers to replicate this methodology as appropriate to ensure transparency and reproducibility of codes using EHRs.Citation16

Methods

We systematically searched the literature and existing code repositories to identify all codes relating to asthma, COPD, and respiratory infections. We used a similar approach to previously published respiratory-related validation studies and built on this work.Citation7,Citation14Citation17 Reviewers were split into three groups to search for codes related to asthma, COPD, and respiratory infections, respectively. Each group comprised at least three epidemiological researchers and one clinician researcher to evaluate disease codes.

Search for Codes Published in the Literature

The Medline database was used because of its comprehensive coverage of clinical medicine research. Searches were performed using key search terms for asthma, COPD, and respiratory infections separately () and abstracts and full text articles were screened by at least two researchers for each disease. We included full-text studies that reported codelists for asthma, COPD, and respiratory infections which were published in January 2020 (), published in English language. These dates were chosen in order to identify up-to-date codes that could be added to existing systematic reviews and codelists already available in the Health Data Research UK (HDR UK) Phenotype Library.Citation18 Our objective was to identify which codelists are being used in research, rather than comment on validity; therefore, the risk of bias analysis was not performed. Supplementary material from the included studies was also reviewed to ensure all codes were identified. Specific codelists of interest included: Read version 2 codes, Clinical Terms version 3 (CTV3), SNOMED CT codes or Clinical Practice Research Datalink (CPRD) medcodeid codes, International Classification of Primary Care (ICPC) codes, International Classification of Diseases (ICD) 9, ICD 10, and ICD 11, and UK Biobank self-diagnosis codes.

Table 1 Search Terms Used to Identify EHR-Related Articles on Asthma, COPD and Respiratory Infections

Figure 1 Study selection, PRISMA diagram.

Abbreviation: PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-analyses.
Figure 1 Study selection, PRISMA diagram.

For asthma, studies that were published from 1 January 2018 were also included to capture work published since the last primary care validation study.Citation6 For COPD and respiratory infections, we searched full-texts from the 1st January 2020. We excluded codes related to SARS-CoV-2 infections because of the rapidly evolving evidence base in this area. There were no restrictions on study design, or age of populations studied. Studies that met the search criteria were screened by at least two researchers using a reference manager (Covidence, Zotero & Excel) for the three diseases separately and codelists that were published were compiled into a Microsoft Excel spreadsheet.

Search for Codes Published in Code Repositories

In addition to a literature search, we searched existing UK code repositories: CALIBER HDR UK Phenomics portal, the Cambridge University repository, LSHTM Data Compass, QReseach, Oxford-RCGP RSC, Manchester Clinical Codes, and OpenSAFELY. Relevant code lists for asthma, COPD, and respiratory infections were added to the list of codes identified from the literature search.

Once all codes were complied, recommended lists of codes for asthma, COPD, and respiratory infections were created based on validation studies, previous publications and multidisciplinary clinical expertise and consensus. The recommended lists of codes were labelled “BREATHE recommended codes” for diagnosis codes and were further categorised into phenotypes as appropriate. Phenotype lists were created by two non-clinical authors, and a second review was performed by a clinical author. Asthma-related phenotypes included incident and prevalent asthma and exacerbation of asthma. COPD-related phenotypes included emphysema, chronic bronchitis, incident COPD, prevalent COPD, and exacerbations of COPD. Finally, respiratory infections included 20 different types of infection (supplement).

Results

For asthma, 1126 articles were identified, of which 37 (3%) included published asthma codes and were therefore included. The COPD search retrieved 240 articles of which 53 (22%) included published COPD codes. Respiratory infection pneumonia search retrieved 75 articles for pneumonia where four (5%) of the identified articles included published codes, 15 articles for aspergillosis where four (5%) of identified pneumonia articles included published codes, one (1%) of acute bronchitis articles included published codes, and no articles on aspergillosis included published codes.

In terms of codes identified in repositories, 12 codelists were identified for asthma from Cambridge University,Citation19 Keele University,Citation20 LSHTM Data Compass,Citation21 Manchester clinical codes,Citation22 NHS England,Citation23 OpenSAFELY,Citation24 and UK Biobank.Citation25 Twenty-three codelists were identified for COPD from the Cambridge University repository, LSHTM Data Compass, Manchester Clinical Codes, and OpenSAFELY. In total, 64 codelists were identified for respiratory infections (pneumonia: 21, acute bronchitis: 12, aspergillosis: 0) from the Manchester Clinical Codes, CALIBER HDR UK Phenomics portal, OpenSAFELY, LSHTM Data Compass and Oxford-RCGP RSC repositories).

illustrates the total number of codes published in the included articles for each disease. Most diagnosis codes for asthma were Read v2, whereas the majority of diagnosis codes for all other diseases were SNOMED CT codes for the corresponding CPRD medcodeid codes. For all diseases, very few Biobank and ICPC codes were found.

Figure 2 Number of codes according to code set terms published in included articles for asthma, COPD, and respiratory infections.

Figure 2 Number of codes according to code set terms published in included articles for asthma, COPD, and respiratory infections.

All codelists for asthma, COPD, and respiratory infections can be found on the HDR UK Phenomics portal: https://phenotypes.healthdatagateway.org/about/breathe/#collections.

Discussion

We undertook a systematic literature search of articles that published codelists along with their manuscript and searched codelist repositories to create a comprehensive list of codes used in previous studies related to specific respiratory diseases. From this, we derived a recommended list for each disease for future use either to use as they are or as a starting point for derivation of a list for future work.

Relatively few published papers include, or reference published codelists. The majority of included studies used asthma codelists, which are likely to relate to worldwide asthma prevalence being more than twice that of COPD.Citation26 Of all codelists identified in the literature and repositories, the range of codes used to define specific disorders varied and it was common for research groups to reuse their own codes in each study. Codelists for a variety of databases are continuously updated and made available to the wider scientific community to allow up-to-date and transparent epidemiological research. Codes are added and removed from specific databases (such as SNOMED CT codes) over time and researchers should be aware of this and update their codelists as needed. Other nuances of these codes are that SNOMED CT codes and Read V2 codes do not always directly map across and independent code searches for each type of code should be conducted separately in the database being used for a study in order to identify all possible codes. Researchers also need to be aware that local SNOMED CT codes and Read V2 codes exist, so not all code browsers may include all possible codes. Overall, our work highlights the importance of systematic search strategies and clinical input to identify codes relevant to specific diseases.

Limitations and Use of Codes for Research

Not only do authors often not publish codelists with their work but there are also few validation studies of codelists. To date, our team has created inclusive codelists for COPD, asthma, and respiratory infections that can be used to find these respiratory diseases within UK datasets of routinely collected electronic health records, including sources such as the CPRD GOLD and Aurum, Hospital Episode Statistics, and the SAIL Databank. These codes are relatively broad and cover a range of phenotypes as well as incident and prevalent disease codes.

Researchers must be cautious when using these codes to identify specific populations of individuals with these diseases and the choice of codes will depend on researcher’s specific research question. For example, researchers should only consider incident codes when identifying an incident disease population or if the study group wish to identify, for example, emphysema, a specific emphysema codelist should be used rather than a more general COPD codelist. Furthermore, EHRs are primarily maintained to support health care rather than for research and specific codes may be preferred by clinicians, some codes may not be coded correctly, and some may not be used at all.Citation27 A study examining the usage of disease codes in primary care found in two million consultations performed over a seven-year period, 50% of EHRs were populated with only eight codes out of the 352 (2.3%) possible codes, and in 95% of cases only 36 codes out of 352 (10.2%) were used. Twenty-one percent of all possible allergy codes were never used.Citation28 This highlights the challenges of using EHR data and the importance of creating robust codelists to identify all possible events. The choice of codes for the same clinical condition may also need to be reviewed locally and tailored to the population or dataset as coding practices and data quality may differ between UK regions. In addition, these codes may have limited utility outside the UK and knowledge of local healthcare systems is essential to appreciate why certain codes are used and when.

In addition to the use of specific codes for case definition, other important parameters should be considered depending on the clinical condition and database. This is because specific codes for various conditions may be underused by health care providers and captured by other variables in the database such as prescriptions, tests (such as spirometry to diagnose COPD) and symptoms. One example is the Quality and Outcomes Framework (QOF) indicator for asthma, AST001, (currently suspended in 2021) which uses a 12-month lookback period for prescriptions (in addition to every diagnosis) to identify individuals with active/treated asthma. Similarly, when identifying patients with exacerbations of COPD, other parameters such as prescriptions for respiratory-related oral corticosteroids and antibiotics and symptoms should be used in addition to exacerbation and lower respiratory tract infection codes to identify all possible events and patients.

Transparency of Codes Used for Research

Transparency in research is vital and increasingly expected by funding organisations. Whilst initiatives such as RECORD have helped to increase transparency in reporting of studies undertaken using observational routinely collected data and have led to an extension of STROBE for this purpose, journals do not often mandate that codes or methodologies used for deriving codes for use in routine sources of data are published. It is important for studies to disclose the codelists used to allow the methods to be fully understood, findings interpreted clearly and for analysis to be replicated. One way in which to do this is to include or reference exact lists of codes in published manuscripts rather than only including vague code ranges. We aim to build on this work and create codelists for specific phenotypes (as well as incident and prevalent codes) for asthma, COPD, and respiratory infections with input from respiratory clinicians. These codes could be used to identify sub-populations of patients with respiratory diseases (such as severe asthma) or specific disease-related events (such as an exacerbation of COPD).

Conclusions

We have compiled codelists with the intention of helping researchers find the most up-to-date codes relevant to their study, which will ultimately help comparative respiratory research. Our standardised codelists for respiratory diseases address these issues by creating comprehensive lists that can be used to research respiratory disease, leading to new and clinically important research insights to improve respiratory health. Since lists of codes vary by research questions, these lists of codes might need to be tailored to the exact research question being addressed and can be seen as a starting point for defining respiratory diseases in EHRs. More transparency in reporting is needed, as are validation studies for phenotypes that have not yet been validated, given these data are only going to be continued to be used more frequently and by more people.

Author Contributions

JKQ conceptualised the study and all authors contributed to study design, searching the literature and collating and deriving recommended codelists. CM, HW and MM drafted the original manuscript, with critical revision of the manuscript by all authors. All authors approved the final manuscript. All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. The corresponding author is also the guarantor for this manuscript and accepts full responsibility for the work, had access to all the data and was responsible for the decision to publish.

Acknowledgments

SAIL team for creating the website pages.

Disclosure

CM, HW, MM, LD, AM, CI, MA, EV, EOR, ATW, PWS have nothing to declare. JKQ reports grants from AUK-BLF, The Health Foundation, MRC, grants and personal fees from AZ, BI, GSK, Bayer, grants from Chiesi, outside the submitted work. AS reports grants from AUK-BLF and HDR UK.

Additional information

Funding

This work is supported by BREATHE – The Health Data Research Hub for Respiratory Health [MC_PC_19004]. BREATHE is funded through the UK Research and Innovation Industrial Strategy Challenge Fund and delivered through Health Data Research UK. The funder had no role in study design, data collection, analysis or interpretation, or manuscript writing. All authors had full access to all the data in the study and had final responsibility for the decision to submit for publication.

References