835
Views
9
CrossRef citations to date
0
Altmetric
Review

The landscape for epigenetic/epigenomic biomedical resources

, &
Pages 982-986 | Published online: 09 Aug 2012

Abstract

Recent advances in molecular biology and computational power have seen the biomedical sector enter a new era, with corresponding development of Bioinformatics as a major discipline. Generation of enormous amounts of data has driven the need for more advanced storage solutions and shared access through a range of public repositories. The number of such biomedical resources is increasing constantly and mining these large and diverse data sets continues to present real challenges. This paper attempts a general overview of currently available resources, together with remarks on their data mining and analysis capabilities. Of interest here is the recent shift in focus from genetic to epigenetic/epigenomic research and the emergence and extension of resource provision to support this both at local and global scale. Biomedical text and numerical data mining are both considered, the first dealing with automated methods for analyzing research content and information extraction, and the second (broadly) with pattern recognition and prediction. Any summary and selection of resources is inherently limited, given the spectrum available, but the aim is to provide a guideline for the assessment and comparison of currently available provision, particularly as this relates to epigenetics/epigenomics.

Introduction

The Human Genome Project (HGP) in 2003 led to identification of more than 20,000 genes and determined the three billion chemical base pairs of human DNA. With the tremendous advances in medical technologies, corresponding development in computational power, storage capacity, inter-connectivity and cost effectiveness, this explosive growth has resulted in the generation and collection of all aspects of biomedical data and, in the past decade, the importance of bioinformatics has been recognized.Citation1 Data warehousing,Citation2 as a way of dealing with large data set size, combines databases across an entire enterprise, whereas independent or federated systems seek to integrate multiple autonomous databases into a single federation, with constituent databases interconnected via a network and often geographically decentralised.Citation3,Citation4 One example is the many bioinformatics data sources linked by the Entrez Life Sciences search engine.Citation5

Biomedical data cover a wide range, from patient records to information from pharmaceutical studies, specific disease research and different ‘omics’ studies, including genomics, proteomics and transcriptomics. Resource types can be classified by two key features: first, the means or method by which access is provided to entities; second, the nature of the entities themselves. The repository or web service that provides access to these data are a vital component of biomedical data resourcing.Citation6 An example is PubMed, the NLM’s web-based interface to MEDLINE, the premier bibliographic index to journal articles in the Life Sciences. In general, resource providers, such as PubMeth and MutationDB, review research papers from the domain and mine these for information relevant to the scientific audience. Typically, non-profit research institutes, such as the Sanger Institute, University of California Santa Cruz (UCSC), National Center for Biotechnology Information (NCBI), National Institute of Health (NIH), European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI), among others, make such data publicly available over the internet so that these can be further analyzed/mined for knowledge discovery.

Biological/biomedical resources may be one of several types, primary, secondary or composite. Examples of primary database containing information on biological quantities themselves indicate those for sequence or structure, e.g., SwissProt, PIR (protein sequences), GenBank and DDBJ (genome sequences). Secondary resources contain derived information from primary sources and examples include eMOTIF (Stanford) and SCOP (Cambridge). Composite resources typically draw information from a variety of different databases, such as those of the NCBI genome browser and Genecards.Citation7 The most popular genome browsers today are Ensembl, NCBI Map Viewer and UCSC, which act as gateways for access to genetic and epigenetic information.

Following completion of the Human Genome Project, increased attention has been paid to processes that lead to heritable changes in gene expression, during development or across generations, without altering the nucleotide sequence within the DNA. Both epigenetics and epigenomics, the genome-wide distribution of epigenetic changes, have become major areas of research focus. Principal epigenetic phenomena encompass DNA methylation, histone modification (methylation/demethylation, acetylation/deacetylation, phosphorylation, ubiquitylation and sumoylation), gene silencing, genomic imprinting and X-chromosome inactivation. Recently-launched large-scale initiatives include, among others, IHEC (International Human Epigenome Consortium),Citation8 which plans to map up to 1,000 reference epigenomes within a decade, and the Human Epigenome Project (HEP),Citation9 which aims to identify, catalog and interpret genome-wide DNA methylation patterns of all human genes in all major tissues.Citation10

Epigenetics, cancer and other diseases

Epigenetic abnormalities have been found to be causative factors of cancer, genetic disorders and pediatric syndromes, as well as contributory factors of autoimmune diseases and aging.Citation11 The recent intensive research on cancer-epigenetics has also led to the discovery of many epigenetic markers that play an important role in disease initiation. As a consequence, cancer-related epigenetic resources preponderate over others. Two of the large-scale project initiatives for cancer research include ICGC (see “ICGC” section below) and TCGA (The Cancer Genome Atlas). TCGA has achieved comprehensive sequencing, characterization and analysis of the genomic changes in various cancers and intends to chart the genomic changes involved in more than 20 types of cancers.Citation12 All of the epigenetic resources are outlined in the following sections, with additional assessment of their data mining capabilities, intrinsic or externally accessed, and their adequacy provided where possible.

DNA methylation can induce “epigenetic silencing” (or loss of expression) of tumor suppressor genes, causing normal cells to be transformed into cancer cells and is the first and most common epigenetic alteration to be observed.Citation13,Citation14 A direct link also exists between DNA methylation and histone modification, since a number of proteins involved in DNA methylation (e.g., DNMTs and MBDs) directly interact with histone modifying enzymes, such as histone methyltransferases (HMTs) and histone deacetylases (HDACs).Citation15 Epigenetic resources incorporating methylation signatures are described in the “Methylation” section below.

Resources for Epigenetic/Epigenomic Signatures

Epigenetic/epigenomic resources are inevitably less comprehensive to date but can be broadly categorized in terms of type of data content, tools and access, and are described below.

Methylation

Pubmeth,Citation16 a cancer methylation database, provides a sorted, annotated and summarized overview of genes, reported to be methylated in various cancers, with user query based on gene or cancer type. PubMeth draws on text-mining of Medline/ PubMed abstracts, combined with manual annotation of pre-selected abstracts. The text mining approach in Pubmeth is fast and intelligent, enabling search of multiple aliases and textual variants of these aliases, and querying of multiple keyword-lists simultaneously. Pubmeth also provides the facility to browse a pre-computed gene list, without having to query the database directly.

MethDBCitation17 is also a major source for experimentally confirmed DNA methylation data but is general, more sample-oriented and not optimized to cancer-related queries. The database is designed to store and annotate information on the occurrence of methylated cytosines in DNA. It currently contains 19,905 methylation content data items and 5,382 methylation patterns or profiles for 48 species, 1,511 individuals, 198 tissues and cell lines and 79 phenotypes. MethDB also has a public online submission system available.Citation18 The resource forms part of an integrated network of biological databases through DAS (Distributed Annotation System), enabling the epigenetic data to be viewed as a layer in the human genome, and is also connected to Ensembl (for DNA sequences with available MethDB data aligned to NCBI Refseq).

A subset resource, MethPrimerDB,Citation19 is a database of primer sequences used in PCR based methylation methods. The database depends on submissions by users and administrators that guarantee the required quality of the database but not necessarily its completeness. To date, there are 29 primer sets. In 2006, the MethBLAST feature was added to MethPrimerDB oligonucleotide sequences. Further updates since 2006, however, are not found for this resource.

MethyCancerCitation20 is a disease-oriented database, specifically of human DNA methylation and cancer that aims to integrate methylation databases and has developed a meta-data format for data standardization, with manual curation still used for noisy data. Four main types of data are included in MethyCancer, namely, (1) CGI clones and global CGI predictions, (2) DNA methylation data, (3) cancer information, genes and mutations, and (4) correlations of DNA methylation, gene expression and cancer. MethyView, a visualization tool from MethyCancer, is used to facilitate the browsing of methylation data in the context of existing human genome annotations. A search engine to query different data types and interactions from the MethyCancer database provides simple keyword search and also offers advanced options namely, “methylation,” “gene,” “cancer,” “clone” and “repeat” searches. For example, Methylation search enables the user to specify and combine query options, such as methylation type (pattern, profile, content, domain), data source (BIG/UHN, MethDB,Citation17 HEP,Citation9 Columbia University), experimental methods, sample information (tissue, sex, age, phenotype) and chromosomal positions.

On similar lines, MethylogixCitation21 provides a high density DNA methylation database of human chromosomes 21 and 22, a CpG island DNA methylation database for male germ cells, enabling comprehensive analysis of DNA methylation variation between and within the germ lines of normal males, and a targeted DNA methylation database of late-onset Alzheimer disease. Similarly, Methtools is a collection of software tools for handling and analysis of DNA methylation data, generated by the Bisulfite Genomic Sequencing method.Citation22

Genomic imprinting related resources

Genomic imprinting is an important epigenetic phenomenon whereby inherited genes are ‘imprinted’ due to one copy of the gene being epigenetically marked or imprinted in either the egg or the sperm. Thus, the allelic expression of an imprinted gene depends on whether it is inherited maternally or paternally. Imprinted expression can also vary between tissues, developmental stages and species.Citation23 The Geneimprint databaseCitation24 includes genes and related information on genomic imprinting for different animals including humans and gathered from NCBI. Genes are listed by species and sorted by chromosomal location, name and imprinting status and are provided through the web-interface. Similarly, an imprinted gene and parent-of-origin effect databaseCitation25 presents imprinted genes and related effects. This consists of two sections: (i) catalog of current literature on imprinted genes in humans and animals and (ii) catalog of reports of parental origin of de novo mutations in humans alone. The addition of (ii), showing a parent-of-origin effect, expands the scope of the database and provides a useful tool for examining parental origin trends for different types of spontaneous mutations. This second section currently includes more than 1,700 mutations, found in 59 different disorders. The 85 imprinted genes are described in 152 entries from several mammalian species. In addition, more than 300 other entries describe a range of reported parent-of-origin effects in animals.Citation26 Further resource, containing information on mouse gene imprinting,Citation27 also includes an imprinting catalog, as well as chromosome anomalies on mutant mouse lines. This represents integration of curated information from the MRC Harwell stock resource and other Harwell databases, with additional information from external data resources such as IMSR (International Mouse Stain Resource).

Histone and chromatin-related resources

The Histone database,Citation28 of the National Human Genome Research Institute, provides a complete set of histone protein sequences. Nucleosomes, through various core histone post-translational modifications and incorporation of diverse histone variants, can serve as epigenetic markers to control processes such as gene expression and recombination. The Histone Sequence Database is a curated collection, assembled from major public databases, of sequences and structures of histones and non-histone proteins containing histone folds. A substantial increase in the number of sequences and taxonomic coverage for histone and histone fold-containing proteins is available. The database also provides comprehensive multiple sequence alignments for each of the four core histones (H2A, H2B, H3 and H4), the linker histones (H1/H5) and the archaeal histones. Also included is current information on solved histone fold-containing structures. The database is thus an inclusive resource for the analysis of chromatin structure and function.

Chromatin.us is another web portal that includes information on chromatin proteins, histones and nucleosome structures and non-histone chromatin protein structures, and provides links to the protein data bank (PDB) site, which provides further details.Citation29 ReplicationDomainCitation30 is an online database for storing, sharing and visualizing DNA replication timing and transcription data, along with other numerical epigenetic data types. Data are typically obtained from DNA microarrays or DNA sequencing.

Gene silencing

An important epigenetic phenomenon, gene silencing, has also attracted attention and has been well reported in the literature. Collected papers are available on Bio-Tech Info-Net.Citation31 Similarly, RNA induced epigenetics related papers on imprinting by non-coding RNAs are collated.Citation32

Other epigenetic biomedical resources

The evolution of epigenetic resources is still in its early stages, with provision associated with several specific research efforts and groups. Nevertheless, in line with genetic/genomic data examples, efforts are being made to connect information, even as new targets are emerging. The Epigenetics DatabaseCitation33 includes all known epigenetics genes/proteins discovered to date. The database is arranged in hierarchical format, based upon gene ontology. While still in its developmental (β) phase, it is expected that future developments will include user-submitted meta-data, which will be freely available for use in database and flat file format. Some sites, e.g., Epigenie,Citation34 also provide bioinformatics tools (e.g., CpG Viewer, CpG and GC Plotter and tools for CpG Island detection). NCBI supported efforts include the Epigenetics Antibody Database,Citation35 providing antibody information for researchers working in the field of epigenetics/epigenomics, and Unigene,Citation36 containing same locus-of-origin transcription sequences, protein similarities, gene expression, cDNA clone reagents, genomic location and associated epigenetic information. NARNA,Citation37 supported by Newcastle University, incorporates relationships between epigenetic events, DNA methylation, gene imprinting and X-chromosome inactivation with natural antisense RNAs. Other, locally developed or supported, current resources include StatEpigen,Citation38 with an initial focus on colon cancer, although incorporating some information on other pathologies for comparison. Data are provided on simple and conditional molecular events, since many genetic and epigenetic alterations are expected to be mutually correlated and synergistic, and drive model input at the micro-layer.Citation39 Specialized resources also exist for plant data.Citation40

Large-Scale Epigenetic Project Initiatives

European project initiatives including HEP

A number of European initiatives exist for centralized projects on DNA methylation. The Human Epigenome Project (HEPCitation9) will provide an epigenetic resource of chromosomal DNA methylation reference profiles in human tissues and cell lines. Other initiatives include chromatin profiling (HEROIC, High-Throughput Epigenetic Regulatory Organization In Chromatin), treatment of neoplastic disease (EPITRON, Epigenetic Treatment Of Neoplastic DiseaseCitation41) and the SMARTERCitation42 initiative, which aims to develop small inhibitors of chromatin-modifying enzymes. Another effort to provide structure to the epigenetic research landscape in Europe is that of the Epigenetic Network of Excellence, now known as Epigenesys, which aims to advance epigenetics toward Systems Biology.Citation43

Roadmap epigenomics program

The Roadmap Epigenomics Program (also known as Epigenomics Roadmap initiative), launched by NIH (2008), seeks to create a series of epigenome maps to study epigenetic mechanisms, develop new epigenetic analytics, generate a repository and long-term data archive, standardize procedures and practices in epigenomics and support new technologies for these. As part of the $190 million, 5-y initiative, the Roadmap Epigenomics Mapping ConsortiumCitation44 was formed to provide a public database for human epigenomics data, the Human Epigenome Atlas.Citation45 The current release, Epigenome Atlas Release 7, includes human reference epigenomes and the results of their integrative and comparative analyses.

The NIH Roadmap Epigenomics Program has also established IHEC (International Human Epigenome Consortium),Citation8 which aims to coordinate epigenome mapping and characterization worldwide, in order to ensure high data quality standards, coordination of data storage, management and analysis and free access to the epigenomes produced. To attain substantial coverage of the human epigenome, IHEC aims to decipher at least 1,000 epigenomes within the next 7–10 years. Officially launched in Paris (Jan 2010), with an initial (first phase) budget target of $130 million, IHEC intends to coordinate the mapping of epigenomes from not only the NIH’s Epigenomics Mapping Consortium but also from international efforts such as the European Epigenome Network of Excellence, the Danish National Research Foundation Centre for Epigenetics, and the Australian Epigenetic Alliance. The IHEC web portal provides links to databases, such as GEO, ARRAYEXPRESS and DDBJ, where epigenetic sequencing data will be made available.

Another significant large-scale program in epigenetics is the Encyclopedia of DNA Elements (ENCODE).Citation46 This is supported by the ENCODE Consortium, an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). This initiative aims to identify all functional elements, both at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active, in the human genome sequence.

ICGC

Genomic changes that occur in various types of cancer are being investigated by the International Cancer Genome Consortium (ICGC).Citation47 The goal is to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes. Many samples from one tumor type or subtype will be analyzed in detail so that this initiative promises to provide crucial insights on genetic-epigenetic links.

Discussion and Conclusions

The biomedical resources relating, primarily, to epigenetic data that were surveyed here are numerous and range from small- to large-scale, with considerable ongoing integration and new links still being forged. In common with many newly identified research targets, early-stage resources are often very specific and are supported locally, and this is still the case for much useful epigenetic data. Many such databases and their software tools are publicly accessible from academic/research institutions, while others are commercially available (Table S1). Major issues remain quality assurance, effective annotation and overall management, but appropriate analysis must also keep pace and is typically uneven (Table S2). Clearly, the generation of a centralized repository for epigenetics-related data are desirable and currently lacking, but new technologies offer increased potential for processing solutions down the line. Notably, biomedical needs are an important focus for federated database development, health-grid technology and, of course, Cloud computing.

Major initiatives to ensure quality and standards for genetic and epigenetic research do exist and some, such as IHEC and HEP, are described in this review. With improved technology, these should lead to improved data mining tools where those currently available for epigenetic/epigenomic analyses are limited and predominantly sequence-oriented, ranging from identification, through PCR and initial pattern matching (Table S2 presents the current summary).

Abbreviations:
BIG=

Beijing Institute of Genomics

BRO=

biomedical resource ontology

DDBJ=

DNA data bank of Japan

EBI=

European Bioinformatics Institute

ENCODE=

Encyclopaedia of DNA Elements

EPITRO=

epigenetic treatment of neoplastic disease

HEP=

Human Epigenome Project

HEROIC=

high-throughput epigenetic regulatory organization in chromatin

HGP=

Human Genome Project

ICGC=

International Cancer Genome Consortium

IHEC=

International Human Epigenome Consortium

NCBI=

National Center for Biotechnology Information

NHGRI=

National Human Genome Research Institute

NIH=

National Institute of Health

NLM=

National Library of Medicine

PDB=

protein data bank

SCOP=

structural classification of proteins

UHN=

University Health Network

Supplemental material

Additional material

Download Zip (73.2 KB)

Acknowledgments

The authors would like to acknowledge funding from the Daniel O’ Hare Scholarship program DCU, which made it possible to carry out this study.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.