642
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Data Science Using the Human Epigenome for Predicting Multifactorial Diseases and Symptoms

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Pages 273-276 | Received 08 Sep 2023, Accepted 18 Jan 2024, Published online: 05 Feb 2024

Tweetable abstract

This article reviews machine learning models that leverages epigenomic data for predicting multifactorial diseases and symptoms as well as how such models can be utilized to explore new research questions.

Data science for developing machine-learning models for classifying diseases and predicting biomarkers has become a growing field of study in life science and other research fields. This trend was pioneered by the DNA methylation age model developed by Horvath in 2013 [Citation1]. Subsequently, epigenetic drift, in which DNA methylation becomes more likely with advancing aging, leading to a vulnerability in gene regulation, was also identified [Citation2], providing further biological evidence for the methylation age model. It has become increasingly recognized that the biological basis for how healthy or unhealthy we are involves a substantial amount of epigenomic information, which is influenced by environmental factors and our innate genetic sequence. This article will examine the emerging applications of methylation risk scores (MRS) as biomarkers of common multifactorial and lifestyle-related diseases and symptoms. It will also describe how such models can be utilized to explore new research questions. This article is intended to attract interest in data science using epigenomic data to a wide range of potential users, including medical and life science researchers and psychologists, rather than directing it to epigenomic research experts.

Genetic sequence variations have been associated for many congenital diseases. Thus, establishing a diagnosis is crucial in determining the therapeutic strategy for a patient. However, in many cases, it is difficult to make a definitive diagnosis for patients suspected of having a specific congenital disease if they do not have known mutations or have variants of uncertain significance (VUS), whose pathological significance is unknown. In this context, the fact that epigenomic modifications, such as DNA methylation, are influenced by genetic sequences and that they change the 3D structure of chromatin and play a role in regulating transcriptional activities has begun to gain attention. In such cases, the distinctive features of DNA methylation patterns can serve as an intermediate between genetic sequences and phenotypes. Consequently, large-scale case-control studies of specific congenital disease have been conducted, and the refined methylation clusters were described as the episignature for each disease [Citation3]. From the accumulation of these considerable research results, algorithms have been implemented to classify diseases by machine learning models using episignatures of various diseases, and for congenital (Mendelian) neurodevelopmental diseases, an adjunctive diagnostic service named EpiSign has already begun. The service is based on a microarray of blood-derived DNA, which can comprehensively classify the presence or absence of more than 90 congenital neurodevelopmental disorders. This is a breakthrough technology in data science using epigenomic data for clinical practice. As more patients use the service and more actual data accumulates, classification accuracy is expected to improve further. Although this technology is focused on qualitative performance for classification and diagnosis, it is expected that this technology can be applied to the study of multifactorial and lifestyle-related diseases that cannot be explained by single gene mutations.

Currently, data science using epigenomic data for common multifactorial and lifestyle-related diseases and symptoms is also being built for various clinical and behavioral factors. For example, MRS have been developed for predicting disease risk or severity for Alzheimer’s disease [Citation4], schizophrenia [Citation5], depression [Citation6], osteoarthritis progression [Citation7], and coronary heart disease [Citation8]. For Alzheimer’s disease, in addition to the classification model constructed from cross-sectional data, a model that predicts symptom progression from longitudinal data has been proposed based on its potential clinical utility [Citation4]. Many studies attempt to identify predictive models from proxy tissues considering that DNA methylation patterns are tissue-specific. For example, Gunasekara and colleagues made a classification model for schizophrenia based on 9,926 correlated regions of systemic interindividual variation (CoRSIVs), which are not considered to be tissue-specific [Citation5]. Similar to schizophrenia, a classification model for major depression has been constructed from a large dataset and is comparatively effective relative to a polygenic risk score (PRS), demonstrating the value of combining genetic and epigenetic data for classification [Citation6]. For osteoarthritis of the knee, a model constructed with baseline methylation data was investigated to predict the structural abnormality and pain progression typically assessed through radiographic images and clinical evaluations [Citation7]. This could help monitor the progression of osteoarthritis and guide treatment decisions in the future.

In addition to disease classification models, there are now predictive models of lifestyle or behavioral habits such as smoking history (DNAmPACKYRS) [Citation9], alcohol consumption [Citation10], obesity [Citation11], and psychological resilience [Citation12], which can enable more precise modeling of disease risk. MRS have been developed to predict plasma hormones and cytokines levels as well. For example, in the development of the GrimAge clock, Lu and colleagues developed methylation-based models for seven plasma proteins (e.g., adrenomedullin, beta-2 microglobulin, cystatin C, growth differentiation factor 15, leptin, plasminogen activation inhibitor 1, and tissue inhibitor metalloproteinase 1) that were defined as clinical indicators of lifespan [Citation9]. Recently, Thompson and colleagues leveraged electronic health records across a large cohort (n  =  831) of diverse samples in the UCLA Health biobank to construct MRS for 607 phenotypes, including clinical lab tests, medication use and medical diagnoses [Citation13], demonstrating the exceptional potential of MRS as clinical and research tool.

A key advantage of such data science classifications and predictions is for use in secondary analysis of archived data. In a prospective study, hormones, cytokines, plasma proteins, and more can be measured with enzyme-linked immunosorbent assay (ELISA), high-performance liquid chromatography (HPLC), and luminescence. In contrast, when conducting retrospective studies using public archival data, such as a dataset available on the Gene Expression Omnibus (GEO) or similar platforms, the data available to researchers is limited to whatever is provided in the archive. However, MRS can now predict additional information, enabling researchers to analyze the archived data from a new perspective retrospectively. For example, the epigenome of ancient human DNA can be used to predict their age at death [Citation14].

Progressive data-sharing policies have resulted in publicly available datasets from large epidemiological cohorts to well-controlled intervention trials. This offers an unprecedented opportunity to leverage data science to accelerate hypothesis generation, perform rigorous replication studies, and develop novel methodologies. Here is a specific example. Hack and colleagues developed an MRS for 17β-estradiol (E2), the major female sex steroid hormone, based on epigenomic data from blood [Citation15]. As an example, we applied it to a recently published dataset (GEO: GSE176394) (https://github.com/snishit/Hacks_E2_predictor) (Supplementary Data 1) [Citation16]. This dataset comes from a longitudinal study of DNA methylation in blood before and during gender-affirming hormone therapy (GAHT) [Citation17]. 13 transgender women (assigned male at birth) received estradiol therapy, while 13 transgender men (assigned female at birth) received testosterone therapy. In all cases, the hormone therapy dosage and delivery method (transdermal, oral, etc.) varied from patient to patient, and this public dataset alone provides limited phenotype information. E2 concentrations are shown as averages for each group and time point in the manuscript [Citation17], but it is not publicly available the values for individual patients. This is also the case for age. To calculate E2 levels by data science and examine their change over time in this cohort, we first used Hovarth’s method [Citation1] to impute age (because the following model requires age as a variable in addition to methylation sites) and then applied Hack’s method [Citation15] to calculate E2 from 35 CpG sites and the predicted methylation age (in this case, methylation age was used, but usually use actual age if available). The predicted E2 (pg/ml) was plotted for each group at baseline, 6 months, and 12 months (Supplementary Figure 1). The results show that the predicted E2 calculated from the model decreases at 6 and 12 months for transgender men and increases for transgender women. This seems to confirm that the model developed by Hack and colleagues is able to predict E2 levels in the body over the course of a gender transition. The development of the E2 prediction model was originally conducted on adult, cisgendered women [Citation15], but it appeared to capture the longitudinal E2 changes associated with GAHT in transgender women and the decrease in E2 associated with masculinizing therapy in transgender men. In the future, larger datasets will enable the development and refinement of more MRS and inspire new applications for data science in this field.

Building a predictive model for multifactorial diseases and symptoms using epigenomic data will be challenging but is quite promising in terms of its potential impact. As many investigators have already done, making the model weights and code available to the public will be important so that others can use and even improve on new methods. While many funders and journals require archiving genome-wide data in public databases, such as GEO, analysis models and code, shared through platforms like GitHub, it is equally critical to accelerate research across biological and medical fields.

Author contributions

Conceptualization and design: S Nishitani; writing, and editing of the manuscript: S Nishitani, AK Smith; data analysis: S Nishitani.; critical revisions and supervision: AK Smith, A Tomoda, TX Fujisawa.

Competing interests disclosure

The authors have no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Writing disclosure

No writing assistance was utilized in the production of this manuscript.

Supplemental material

Supplemental Document

Download MS Word (209.2 KB)

Supplementary data

To view the supplementary data that accompany this paper please visit the journal website at: www.futuremedicine.com/doi/suppl/10.2217/epi-2023-0321

Financial disclosure

This study was supported by JSPS KAKENHI Fund for the Promotion of Joint International Research (Fostering Joint International Research (A) (20KK0280 to TX Fujisawa), Scientific Research (B) (23H00947 to TX Fujisawa), Research Grants from the University of Fukui (FY 2023 to S Nishitani) and the National Institute of Mental Health (R01MH108826 to AK Smith) in USA. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Additional information

Funding

This study was supported by JSPS KAKENHI Fund for the Promotion of Joint International Research (Fostering Joint International Research (A) (20KK0280 to TX Fujisawa), Scientific Research (B) (23H00947 to TX Fujisawa), Research Grants from the University of Fukui (FY 2023 to S Nishitani) and the National Institute of Mental Health (R01MH108826 to AK Smith) in USA. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

References

  • Horvath S . DNA methylation age of human tissues and cell types. Genome Biol. 14(10), R115 (2013).
  • Shah S , Mcrae AF , Marioni RE et al. Genetic and environmental exposures constrain epigenetic drift over the human life course. Genome Res. 24(11), 1725–1733 (2014).
  • Aref-Eshghi E , Kerkhof J , Pedro VP et al. Evaluation of DNA Methylation Episignatures for Diagnosis and Phenotype Correlations in 42 Mendelian Neurodevelopmental Disorders. Am. J. Hum. Genet. 106(3), 356–370 (2020).
  • Chen L , Saykin AJ , Yao B , Zhao F , (Adni) ASDNI . Multi-task deep autoencoder to predict Alzheimer’s disease progression using temporal DNA methylation data in peripheral blood. Comput. Struct. Biotechnol. J. 20, 5761–5774 (2022).
  • Gunasekara CJ , Hannon E , Mackay H et al. A machine learning case-control classifier for schizophrenia based on DNA methylation in blood. Transl. Psychiatry 11(1), 412 (2021).
  • Barbu MC , Shen X , Walker RM et al. Epigenetic prediction of major depressive disorder. Mol. Psychiatry 26(9), 5112–5123 (2021).
  • Dunn CM , Sturdy C , Velasco C et al. Peripheral Blood DNA Methylation-Based Machine Learning Models for Prediction of Knee Osteoarthritis Progression: Biologic Specimens and Data From the Osteoarthritis Initiative and Johnston County Osteoarthritis Project. Arthritis Rheumatol. 75(1), 28–40 (2023).
  • Zhang X , Wang C , He D et al. Identification of DNA methylation-regulated genes as potential biomarkers for coronary heart disease via machine learning in the Framingham Heart Study. Clin. Epigenetics 14(1), 122 (2022).
  • Lu AT , Quach A , Wilson JG et al. DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging (Albany NY) 11(2), 303–327 (2019).
  • Maas SCE , Vidaki A , Teumer A et al. Validating biomarkers and models for epigenetic inference of alcohol consumption from blood. Clin. Epigenetics 13(1), 198 (2021).
  • Lee YC , Christensen JJ , Parnell LD et al. Using Machine Learning to Predict Obesity Based on Genome-Wide and Epigenome-Wide Gene-Gene and Gene-Diet Interactions. Front. Genet. 12, 783845 (2021).
  • Lu AK , Hsieh S , Yang CT , Wang XY , Lin SH . DNA methylation signature of psychological resilience in young adults: constructing a methylation risk score using a machine learning method. Front. Genet. 13, 1046700 (2022).
  • Thompson M , Hill BL , Rakocz N et al. Methylation risk scores are associated with a collection of phenotypes within electronic health record systems. NPJ Genom. Med. 7(1), 50 (2022).
  • Pedersen JS , Valen E , Velazquez AM et al. Genome-wide nucleosome map and cytosine methylation levels of an ancient human genome. Genome Res. 24(3), 454–466 (2014).
  • Hack LM , Nishitani S , Knight AK et al. Epigenetic prediction of 17β-estradiol and relationship to trauma-related outcomes in women. Compr. Psychoneuroendocrinol 6, 100045 (2021).
  • Nishitani S . Hacks_E2_predictor. https://github.com/snishit/Hacks_E2_predictor (2023).
  • Shepherd R , Bretherton I , Pang K et al. Gender-affirming hormone therapy induces specific DNA methylation changes in blood. Clin. Epigenetics 14(1), 24 (2022).