1,412
Views
0
CrossRef citations to date
0
Altmetric
Research Paper

Whole human genome 5’-mC methylation analysis using long read nanopore sequencing

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Pages 1961-1975 | Received 22 Oct 2021, Accepted 29 Jun 2022, Published online: 20 Jul 2022

ABSTRACT

Methylation microarray and bisulphite sequencing are often used to study 5’-methylcytosine (5’-mC) modification of CpG dinucleotides in the human genome. Although both technologies produce trustworthy results, the evaluation of the methylation status of CpG sites suffers from the potential side effects of DNA modification by bisulphite and/or the ambiguity of mapping short reads in repetitive and highly homologous genomic regions, respectively. Nanopore sequencing is an attractive alternative for the study of 5’-mC since it allows sequencing of native DNA molecules, whereas the long reads produced by this technology help to increase the resolution of those genomic regions. In this work, we show that nanopore sequencing with 10X coverage depth, using DNA from a human cell line, produces 5’-mC methylation frequencies consistent with those obtained by 450k microarray, digital restriction enzyme analysis of methylation, and reduced representation bisulphite sequencing. High correlation between methylation frequencies obtained by nanopore sequencing and the other methodologies was also noticeable in either low or high GC content regions, including CpG islands and transcription start sites. We also showed that a minimum of five reads per CpG yields strong correlations (>0.89) in replicate nanopore sequencing runs and an almost uniform linearity of the methylation frequency variation between zero and one. Furthermore, nanopore sequencing was able to correctly display methylation frequency patterns based on genomic annotations of CpG regions. These results demonstrate that nanopore sequencing is a fast, robust, and reliable approach to the study of 5’-mC in the human genome with low coverage depth.

Introduction

The study of 5’-mC profile was performed over many years through chemical treatment of DNA with bisulphite, followed by amplification with strand-specific and bisulphite-specific primers, and sequencing by the dideoxynucleotide chain-termination method[Citation1]. Due to the strong development that was witnessed in the microarray and next-generation sequencing technologies, and in the bioinformatics methods for high-throughput data analysis, the study of chemical changes at the DNA sequence level has moved from an approach focused on individual loci to a genome-wide scale in recent years[Citation2–4]. Bisulphite converts each unmethylated cytosine into uracil, while 5’-methylcytosine is resistant to conversion. During amplification, each uracil residue is amplified as thymine while the 5’-mC residues are amplified as cytosines, which allows the distinction of methylated cytosines from the unmethylated ones through sequencing of the amplified fragments. Although bisulphite DNA modification technique is widely used in different methylation study protocols[Citation5], and bisulphite sequencing is considered as a ‘gold standard’ in DNA methylation research, chemical modification of DNA has several side effects on sequence structure and integrity that can potentially affect the results[Citation6–8]. When testing different bisulphite conversion kits, Leontiou et al. (2015)[Citation9] demonstrated that although the conversion efficiency reaches values close to 100% with all kits, the conversion specificity is variable and the yield of DNA recovered after the conversion is of about 50% maximum. Moreover, it was also estimated that in the human genome about 10% of CpG sites could not be accurately mapped after modification by bisulphite using short-read technologies, so the option for long-read sequencing may be beneficial[Citation9]. Taking advantage of the long-read sequence lengths provided by the PacBio technology, Yang et al. (2015)[Citation10] developed a technique called single-molecule real-time bisulphite sequencing that uses bisulphite treatment and amplification of DNA to sequence ~1.5 kb long amplicons that target the vast majority of CpG islands in the human genome.

More recently, nanopore sequencing technology, which generates very long reads using native DNA molecules as input, has emerged as an alternative approach to the study of methylation. The electrical signals registered when the DNA strand crosses the nanopore are also sensitive to the presence of modifications in the DNA molecule, and can be used to differentiate methylated cytosines from unmethylated cytosines using a hidden Markov model[Citation11]. However, there are still very few studies in which nanopore sequencing was used to analyse the 5’-mC profile in the human genome[Citation11–14]. In the present work, we describe the complete genome sequencing of a human cell line with nanopore technology using low coverage depth, and present a benchmark comparison of 5’-mC methylation quantification to other methods, including the Illumina Infinium HumanMethylation450 BeadChip microarray assay (450k microarray)[Citation15], Digital Restriction Enzyme Analysis of Methylation (DREAM)[Citation16], and reduced representation bisulphite sequencing (RRBS)[Citation17]. To the best of our knowledge, there are yet no studies that have sought to evaluate the applicability of nanopore sequencing for the detection and quantification of 5’-mC compared to the other three methylation detection approaches. The 450k microarray uses bisulphite-treated DNA and allows the analysis of 485,512 CpG sites, which constitute ~1.7% of the existing CpGs in the human genomic sequence, with a focus in CpGs located in genes and in CpG islands[Citation18]. The DREAM method is based on short read sequencing of DNA fragments resulting from sequential digestion with a pair of neoschizomeric restriction enzymes that recognize the CCCGGG sequence. It allows quantifying 374,170 CpG sites, corresponding to ~1.3% of the CpGs present in the hg18 human genome reference sequence[Citation16], of which approximately 10% are located in CpG islands. RRBS relies on DNA restriction enzyme digestion and bisulphite treatment to sequence ~1% of the human genome in regions enriched for CpG dinucleotides[Citation19]. We obtained methylation frequencies for more than 28 million CpG sites based on nanopore sequencing and show that these are highly correlated with those obtained by 450k microarray, DREAM, and RRBS at the level of whole genome and/or specific genomic contexts. We conclude that this approach has a high potential for the study of 5’-mC changes in the human genome.

Materials and methods

Cell line

The human erythroleukemia (HEL) cell line, established from the peripheral blood of a patient with erythroleukemia developed after treatment for Hodgkin’s disease[Citation20], was purchased directly from DSMZ – German Collection of Microorganisms and Cell Cultures GmbH (Braunschweig, Germany).

Preparation of genomic DNA

Cell culture and lysis, and phenol-chloroform extraction and dialysis of DNA were performed according to standard procedures (Supplementary File 1).

Library preparation and sequencing

Genomic DNA was placed for 15 minutes at 55°C in the dry bath without shaking immediately before library preparation. Four genomic libraries were prepared using the Rapid Sequencing Kit SQK-RAD004 (Oxford Nanopore Technologies, ONT). Two libraries were prepared with an initial DNA input of 400 ng, as indicated by the manufacturer, whereas the other two were prepared with an input of 280 ng and 1500 ng (7.5 μl of solution after vacuum drying). Sequencing runs were performed on the MinION device (ONT) using one FLO-MIN106D flow cell per library for a duration of 45 or 48 hours. The MinION device (ONT) was connected via USB to a portable computer with an Intel i7 processor and 16 GB of RAM. The MinKNOW software (MinION Release 19.06.8, ONT) was used to program and configure run parameters, and to acquire the raw data (FAST5 files) during sequencing.

Sequencing data processing and analysis

The FAST5 files were transferred to a Linux server with 48 CPUs and 384 GB of RAM. The raw data was analysed with NanoPlot[Citation21] (v1.30.1) to obtain various run statistics. Methylation analysis was performed using the Nanopype pipeline[Citation22] (v1.0.0). Briefly, basecalling was carried out with Guppy (ONT)[Citation23], read alignment was performed using ngmlr[Citation24] and minimap2[Citation25] against the hg38 reference human genome sequence (Genome Reference Consortium Human Build 38 patch 12 – GRCh38.p12, Ensembl Release 96) and the calling of 5’-mC was performed with nanopolish11. The Nanopype pipeline was applied to each run independently (using minimap2-base mappings only) and to the combined data of the four runs (using ngmlr and minimap2 mappings). Six frequency files containing methylation frequencies (corresponding to the ratio between the number of methylated sites and the number of called sites) for each mapped CpG dinucleotide were obtained. Coverage statistics were obtained with the bamqc tool of Qualimap2[Citation26] software (v2.2.1), after merging, sorting, and indexing the BAM files with samtools[Citation27] (v1.9).

Comparative analysis of methylation frequencies

The methylation frequencies obtained using nanopore sequencing were compared with previously available data of 450k microarray[Citation15], DREAM[Citation16] and RRBS [Citation17] for the HEL cell line. The 450k microarray and DREAM data are available in the Gene Expression Omnibus series records GSE68379 and GSE39787 (samples GSM1669874 and GSM979022, respectively). The RRBS data for the HEL cell line are available within the Cancer Cell Line Encyclopaedia 2019 (CCLE 2019) release at the Dependency Map (DepMap) portal[Citation28]. The data are split into four files comprising CpG methylation frequencies of CpG islands, enhancers, or transcription start sites (full regions or 1 kb regions upstream of the transcription start site). The LiftOver tool[Citation29] was used to convert the genomic coordinates of the DREAM data from hg18 to hg19, and from hg19 to hg38, and from the 450k microarray and RRBS data from hg19 to hg38 genome builds. The genomic positions of CpGs in the microarray data were modified to n-1, and in the DREAM and RRBS data to n-2, for matching the positions obtained in the methylation frequency files generated by nanopolish, and then randomly confirmed in the hg38 reference sequence between all files using the flanking sequences of the CpG sites. The methylation frequency in the DREAM dataset was calculated by dividing the number of methylated reads by the total number of reads for each CpG site. The BEDTools[Citation30] (v2.27.1) intersect function was used to intersect the coordinates of the CpG sites between the various methylation data files. The resulting intersection files were analysed with several R packages (V4.0.2). The ‘corrr’ (v0.4.3)[Citation31] and ‘corrgram’ (v1.13)[Citation32] packages were used to calculate the product moment Pearson correlation between the methylation values obtained in nanopore sequencing, DREAM, RRBS, and 450k microarray for CpG sites distributed throughout the genomic sequence. In RRBS methylation files, the methylation frequencies associated with clustered CpG sites were not considered in the correlation analysis. In addition, the ‘corrr’ package was used to calculate the correlation between nanopore and 450k methylation frequencies for different genomic stratifications[Citation33]. These were based on BED files for GC content, coding, and mappability regions obtained from the Global Alliance for Genomics and Health Benchmarking Team and the Genome in a Bottle Consortium[Citation34]. The methylation data files obtained by nanopore sequencing were also filtered according to a minimum read support per CpG site, using the sed and awk commands in the Linux shell, and the resulting files were then intersected with the 450k microarray data file to assess the pairwise correlation of methylation values for CpG sites. The genomic annotations of CpG sites (CpG islands, shores, shelves, and open sea regions) were obtained for chromosome 1 using the ‘annotatr’[Citation35] (version 1.14.0) R package.

The workflow that was used in the various analysis steps is represented as a diagram in .

Figure 1. Diagrammatic representation of the general workflow used for benchmarking 5’-mC methylation values obtained by nanopore sequencing versus 450k microarray, DREAM, and RRBS. The numbers indicate the total number of CpG sites available at each step within the workflow. The analysis tools (liftOver[Citation29], Bedtools[Citation30], corrr[Citation31], corrgram[Citation32], AnnotatR[Citation35], VennDiagram[Citation36], and ggplot2[Citation37]) used in the workflow are represented by grey boxes.

Figure 1. Diagrammatic representation of the general workflow used for benchmarking 5’-mC methylation values obtained by nanopore sequencing versus 450k microarray, DREAM, and RRBS. The numbers indicate the total number of CpG sites available at each step within the workflow. The analysis tools (liftOver[Citation29], Bedtools[Citation30], corrr[Citation31], corrgram[Citation32], AnnotatR[Citation35], VennDiagram[Citation36], and ggplot2[Citation37]) used in the workflow are represented by grey boxes.

Results

Sample and sequencing quality analysis

The genomic DNA of the HEL cell line showed a single band with a minimum molecular size of 23 kb and no evidence of smear. The 260 nm/280 nm ultraviolet absorption ratio was identical (1.95) in the sample without dialysis and in the dialysed sample, while the 260 nm/230 nm ratios were, respectively, 2.06 and 2.14, indicating a potential benefit of dialysis in removing traces of phenol. Three nanopore sequencing runs were performed with DNA without dialysis (runs 1 to 3) and one with dialysed DNA (run 4). The four runs had a total yield of 33.31 gigabases (Gb), ranging between 5.66 and 12.59 Gb per run, encompassing a total of 4249514 reads. The longest read had a length of 292,471 base pairs (bp). The read length N50 was 14.94 kilobases, and the mean and median of the length of the reads were 7841 bp and 4366 bp, respectively (Supplementary Table I). The mean read quality value (QV) was of 10.1 and the majority of the reads (67.4%) had a very high quality (QV > 10) (Supplementary Figure 1). The mean read quality was higher in run 4 (10.3) compared to any of the other runs (10), which may be attributed to the dialysis procedure. These results demonstrate that the quality of the DNA prepared by the phenol-chloroform method associated with the rapid method of library preparation allows the generation of a large amount of high-quality sequencing data.

Genomic coverage and annotation of CpG sites

The reads of the HEL cell line mapped in the hg38 reference sequence of the human genome, using minimap2 and ngmlr, constituted 93.6% and 88.6% of the total reads, which resulted in an average coverage depth of 10.6X (standard deviation: 28.6X) and 10.0X (standard deviation: 20.9X), respectively. Methylation frequencies were obtained for a total of 29,093,016 and 28,114,726 CpG dinucleotides, corresponding to an average coverage of 9.5X and 7.6X in CpG sites mapped with minimap2 and ngmlr, respectively. These results indicate that it was possible to obtain information on the status of 5’-mC in practically all CpGs (~28 M) present in the hg19 human genome reference sequence[Citation38]. The genomic annotation of the CpG dinucleotides located in the chromosome 1 sequence, as representative of the entire genomic sequence, revealed the expected pattern according to the predicted length of the annotated regions. These are characterized by a similar proportion of CpG dinucleotides located in the CpG islands, CpG shores, and CpG shelves, and a vast majority of CpG dinucleotides situated in the inter-CG island (inter-CGI) regions, also known as open sea (Supplementary Figure 2). The mapping of long reads with minimap2 or ngmlr did not affect the distribution of CpG sites according to the respective genomic annotations. Contrary to the results produced by nanopore sequencing, the 450 K microarray and DREAM methodologies showed a tendency for CpG annotations in CpG islands, shores, and shelves. This is because, in the former, the CpG sites were chosen manually when the array probes were designed and, in the latter, there is a biological bias towards specific restriction enzymes in these regions.

Distribution of methylation frequencies in distinct genomic contexts

An analysis of the methylation values in different genomic contexts allows a deeper biological insight into the data produced by the different methodologies and could help to identify differences in the methylation profiles between the various datasets. For this end, we compared the distribution of methylation values in each methodology according to the genomic context (CpG islands, shores, shelves, and inter CpG island regions) using 21 bins between values of zero and one. We chose to use only the CpGs on chromosome 1 as representative of the distribution of methylation values at the whole genome level, in order to reduce the computational effort associated with the analysis of the complete datasets. Comparative analysis of the methylation frequency distribution obtained with minimap2- and ngmlr-based nanopore sequencing data showed an almost complete overlap of profiles for each genomic context, characterized by a predominance of completely methylated and completely unmethylated CpG sites (). In contrast, the 450k microarray profile showed a more balanced distribution of methylation values, with a decrease in both extreme bins compared to the profiles obtained with both sequencing approaches.

Figure 2. Distribution of methylation values obtained in nanopore sequencing, 450k microarray and DREAM, according to the CpG genomic annotations. a, nanopore_ngmlr; b, nanopore_minimap2; c, 450k microarray; d, DREAM. In CpG islands the majority of interrogated CpGs are unmethylated. In CpG shores, there is a balanced distribution between unmethylated and methylated CpG sites, whereas in CpG shelves and inter-CGI regions there is a predominance of methylated CpG sites. Plots were generated using package ggplot2[Citation37] in R.

Figure 2. Distribution of methylation values obtained in nanopore sequencing, 450k microarray and DREAM, according to the CpG genomic annotations. a, nanopore_ngmlr; b, nanopore_minimap2; c, 450k microarray; d, DREAM. In CpG islands the majority of interrogated CpGs are unmethylated. In CpG shores, there is a balanced distribution between unmethylated and methylated CpG sites, whereas in CpG shelves and inter-CGI regions there is a predominance of methylated CpG sites. Plots were generated using package ggplot2[Citation37] in R.

We also observed in the various methodologies a continuum in the variation of the distribution of methylation values from CpG islands to inter-CGI regions. In CpG islands, the vast majority of CpGs are in unmethylated state, whereas in the CpG shelves and inter-CGI regions, there is a predominance of methylated CpGs. In contrast, there is a more balanced distribution between complete unmethylated and complete methylated sites in CpG shores among all methodologies. In particular, nanopore sequencing showed a tendency towards a predominance of methylated CpGs, whereas DREAM revealed the opposite scenario. This profile analysis also emphasized some details about the calculation of methylation in the various methodologies. In the case of DREAM, almost 43% of all mapped CpG sites have five or less support reads (data not shown), indicating that for many CpGs there was not enough depth of coverage to sufficiently sample each site. The consequence of this is that there may be an excess of completely methylated and unmethylated sites, relative to sites where there is coexistence of methylated and unmethylated CpGs. In the 450k microarray, the number of CpG sites in the extreme bins is much lower than the corresponding bins in the other methodologies. This difference is because the Infinium II HumanMethylation450 methylation array assays demonstrate a right shift for fully unmethylated CpGs and a left shift for fully methylated CpGs[Citation18].

Correlation analysis of the methylation status between nanopore sequencing, 450k microarray and DREAM in the whole genome sequence

We performed a correlation analysis between the methylation frequencies obtained in this study with the corresponding values obtained in the 450k microarray and DREAM studies for the HEL cell line. The methylation status of each CpG site is calculated for each method using different formulas and can vary between zero (each copy of the CpG site is unmethylated) and one (each copy of the CpG site is methylated). In nanopolish, the frequency of methylation is calculated for each CpG site as ‘called_sites methylated’/’total called sites.’ In the case of DREAM, we used an equivalent approach, that is, we calculated the frequency of methylation as the ‘number of methylated reads’/’number of total reads’ that cover each CpG site. In the case of 450k microarray, the beta-value was used to quantitatively describe the methylation status of each CpG site, using the following formula: Beta-value = ‘methylated_signal intensity’/(‘methylated_signal_intensity’ + ‘unmethylated_signal_intensity’ + α), where α (which has a fixed value of 100) is used when the intensities of the methylation and non-methylation signals are low[Citation18]. Although the beta-value does not constitute a methylation frequency, as well as the M-value that is also used for calculating the methylation of CpG sites in arrays[Citation39], it can be interpreted as the proportion of methylated fragments in a given CpG site and, in this way, be legitimately compared with the methylation frequencies calculated based on sequencing data.

Initially, the methylation frequencies produced based on the mapping of nanopore reads, using minimap2 or ngmlr, and the methylation values obtained by the 450k microarray or DREAM, were correlated with each other at the level of whole genome. For 28,098,525 CpG dinucleotides mapped by minimap2 and ngmlr (which include 99.9% of CpGs mapped by ngmlr and 96.6% of CpGs mapped by minimap2), the Pearson correlation was 0.98 (), showing that either mapping tool does not globally impact the results of methylation-calling carried out by nanopolish. The intersection of the methylation frequency files obtained based on ngmlr and minimap2, with the 450k microarray data file after conversion of coordinates from hg19 to hg38, produced a total of 478,633 and 480,281 overlapping CpGs, corresponding to 99.2% and 99.6% of the CpGs contained in the methylation array, respectively. In the case of DREAM, after converting the genomic coordinates with methylation values from hg18 to hg19, and then from hg19 to hg38, the intersection with the methylation data obtained by ngmlr and minimap2 mapping resulted in a total of 148,346 and 152,686 overlapping CpGs, corresponding to 94.8% and 97.6% of the CpGs contained in DREAM, respectively. The lower percentage of overlap in the coordinates of DREAM in relation to those of the 450k microarray may be because two rounds of conversion between genome coordinates were performed, which potentially led to a lower number of correctly converted CpG sites. High correlations were observed between the methylation values of nanopore sequencing and 450k microarray (0.84 and 0.85), whereas lower correlations were obtained between the methylation values of nanopore sequencing and DREAM (0.79 and 0.80).

Figure 3. Correlation analysis of methylation values between nanopore sequencing, DREAM and 450k microarray. The CpG sites interrogated by nanopore_ngmlr, nanopore_minimap2, 450k microarray, and DREAM were intersected pairwise as well as in a serial manner to generate a set of 5416 CpG sites common to all datasets. Pearson correlations were determined for each pair of methylation datasets. The p-value obtained in each correlation was <2.2e-16. a, Venn diagram showing the number of CpG sites resulting from the intersection of the various datasets. b, Heat plot indicating Pearson correlation values (left) and the corresponding number of CpGs (right) for pairwise comparisons of distinct datasets. c, Heat plot indicating Pearson correlation values between pairwise datasets for the same set of overlapping CpG sites (5416). The plots were generated using packages VennDiagram[Citation36] (v1.6.20) and ggplot2[Citation37] in R.

Figure 3. Correlation analysis of methylation values between nanopore sequencing, DREAM and 450k microarray. The CpG sites interrogated by nanopore_ngmlr, nanopore_minimap2, 450k microarray, and DREAM were intersected pairwise as well as in a serial manner to generate a set of 5416 CpG sites common to all datasets. Pearson correlations were determined for each pair of methylation datasets. The p-value obtained in each correlation was <2.2e-16. a, Venn diagram showing the number of CpG sites resulting from the intersection of the various datasets. b, Heat plot indicating Pearson correlation values (left) and the corresponding number of CpGs (right) for pairwise comparisons of distinct datasets. c, Heat plot indicating Pearson correlation values between pairwise datasets for the same set of overlapping CpG sites (5416). The plots were generated using packages VennDiagram[Citation36] (v1.6.20) and ggplot2[Citation37] in R.

These correlations could suggest that were the result of comparing, on the one hand, a different number of CpGs in each case and, on the other hand, a significant proportion of different CpG sites, because of the intersection between the different methylation datasets. To test this possibility, the coordinates of the CpG sites of the various datasets were serially intersected until a set of 5,416 overlapping CpG sites was obtained. The resulting correlations calculated for each paired datasets revealed that most values were similar to those obtained at the whole genome level, indicating that the methylation values are well correlated even when comparing only ~0.02% of the total CpGs in the human genome. Likewise, the nanopore data obtained with the minimap2-mapped reads showed correlation values slightly higher than those obtained with ngmlr. In contrast, DREAM produced slightly stronger correlation values than the 450k microarray when compared to nanopore data. One possible explanation for this finding is a sample effect size since the 5,416 CpGs represent 3.5% of CpGs present in the DREAM data in contrast to only 1.1% of those present in the microarray. Taken together, these data indicate that methylation-calling based on low coverage nanopore sequencing produces methylation values comparable to those obtained with other methodologies.

Correlation analysis of the methylation status between nanopore sequencing and 450k microarray in selected genomic regions

In the previous section, a high correlation between the methylation frequencies obtained with nanopore sequencing and the beta-values obtained by the 450k microarray was achieved for CpG sites independently of the genomic sequence context or its GC content. It is therefore important to understand whether the correlation remains high for regions in which DNA methylation plays an important regulatory role in gene expression. In this perspective, we intersected the overlapping CpG sites of the two nanopore datasets and 450k microarray data with genomic regions of the hg38 reference sequence that contain coding sequences or variable GC content. Using a total of 6,242 and 67,378 CpG sites, the correlation between the nanopore and the 450k microarray data was also high (≥0.83) for either GC-poor or GC-rich regions (). Similarly, high correlation values (0.85) were obtained in coding regions for a total of 44,609 CpG sites investigated. In contrast, the correlation dropped by ~10% for 11,468 CpG sites present in short read low mappability regions, i.e., regions that have other homologous regions in the reference genome for a small read length and high error rate, thus reflecting a lower performance of sequencing relative to hybridization arrays.

Figure 4. Correlation analysis of methylation frequencies between nanopore sequencing and 450k microarray resulting from the overlapping CpG sites within different genomic regions of the hg38 human genome reference sequence. Pearson correlation values and corresponding plots for four different genomic contexts are presented. The plots were generated using package corrgram[Citation32] in R.

Figure 4. Correlation analysis of methylation frequencies between nanopore sequencing and 450k microarray resulting from the overlapping CpG sites within different genomic regions of the hg38 human genome reference sequence. Pearson correlation values and corresponding plots for four different genomic contexts are presented. The plots were generated using package corrgram[Citation32] in R.

Correlation between nanopore sequencing and RRBS in genomic regions involved in the regulation of gene expression

Bisulphite sequencing is a widely used methodology for obtaining the methylation profile of CpG sites at the whole genome level (WGBS) or selected genomic regions (RRBS)[Citation17]. Given the unavailability of WGBS data for the HEL cell line, we focused on evaluating the correlation between the data obtained by nanopore sequencing (with ngmlr mapping) and RRBS, for CpG sites that are part of CpG islands (cgi) or that associate with transcription start sites (tss), using the RRBS methylation data available at the DepMap portal[Citation28]. After selection of single CpG sites with corresponding methylation frequency values, and conversion of the respective genomic coordinates to the hg38 reference sequence, a total of 10,167 and 10,512 useful CpG sites in the cgi and tss regions were obtained, respectively. Analysis of the distribution of methylation frequencies for each set of CpG sites showed a majority of sites tending towards a total absence of methylation (~0) or complete methylation (~1), and a continuous decrease in methylation frequencies to intermediate values at the remaining CpG sites (Supplementary Figure 3). Likewise, the distribution of methylation frequencies obtained through nanopore sequencing for the same CpG sets also showed an excess of sites with absence of methylation or complete methylation. In contrast, the distribution of intermediate frequency values did not show a steady variation as in the case of RRBS, probably because of the low coverage of CpG sites, which affects the quantification of 5’-mC mainly at partially methylated sites. The correlation of methylation frequencies obtained by RRBS and nanopore sequencing, with data obtained after ngmlr mapping, for the cgi and tss regions, was 0.83 and 0.84, respectively. Similarly, the correlation of the methylation values between RRBS and 450k microarray was 0.85 for either of the two regions. The high correlation obtained between the RRBS and nanopore sequencing data could suggest that it was uniquely due to the presence of a high proportion of CpG sites with extreme values (0 or 1) concurrently obtained in both methodologies. To assess the possibility of correlation also at intermediate methylation levels, we separated the methylation frequencies obtained by RRBS into four bins according to the percentage of 5’-mC at each CpG site, i.e., low methylated (0–25%), preferably unmethylated (25–50%), preferably methylated (50–75%) and highly methylated (75–100%). Subsequently, we plotted the distribution of methylation frequencies of corresponding CpG sites obtained in nanopore sequencing according to each bin. Furthermore, we replicated this analysis by filtering the CpG sites according to a minimum read support of 5, 10, and 15 reads, for both cgi and tss regions, to access consistency of the data (Supplementary Figure 4). The results showed that the median value of the methylation frequencies obtained by nanopore sequencing, with a single bin exception, are within the respective intervals of the RRBS methylation frequencies, indicating the existence of a linear relationship between the overall spectrum of methylation frequencies obtained by both sequencing approaches.

Correlation of methylation values as a function of read support and depth

We asked if the correlation of methylation values between nanopore sequencing and the other methods could be stronger because of an increase in minimum coverage depth. To simulate this effect, we filtered the positions of CpG sites in the methylation frequency files according to a read support between 2X and 20X, at 1X unit intervals. Pearson correlation was calculated for each minimum coverage level and plotted as a function of read support (Supplementary Figure 5). This analysis showed that the correlation between nanopore and 450k microarray data increased by up to 0.89 and 0.90 at 17X read support, using minimap2- and ngmlr-based methylation frequencies, respectively. From this point on, the corresponding correlation values began to decrease until the maximum read support of 20X. This likely results from the fact that the number of available CpGs approaches zero when the read support is highest and that the corresponding methylation frequencies are calculated on likely incorrectly mapped reads and/or secondary read alignments.

Conversely, we also sought to see the impact of a diminished average depth of coverage on the correlation values. For this purpose, we analysed separately the data obtained in each of the four nanopore sequencing runs, which produced an average coverage depth of 1.8X/1.7X, 2.3X/2.2X, 2.5X/2.4X, and 4.0X/3.8X, using minimap2/ngmlr, respectively. The total number of CpGs mapped is directly proportional to the average depth of coverage, whereas the size of the intersection CpG sets is also proportional to the correlation values between each pair of runs (Supplementary Table II). When the number of CpGs was lowest, the correlation between nanopore data (based on minimap2-mapping) and DREAM or 450k microarray was of 0.73 or 0.76, respectively (Supplementary Table III). The correlation increased as a function of the number of available CpGs, reaching the maximum values of 0.79 and 0.76, respectively. However, because of the low coverage depth in each of the sequencing runs, these results suggest that the correlation values are mostly influenced by the existence of CpG sites that show methylation frequency of one (fully methylated) or zero (completely unmethylated) between each pair of methodologies. Thus, a more reliable quantitative assessment of the methylation frequency in regions characterized by partial CpG methylation will require higher coverage depths as shown above.

Reproducibility and linearity of methylation frequency data

When implementing nanopore sequencing technology to study 5’-mC, it is important to evaluate the reproducibility of the methylation data. For this purpose, we consecutively intersected the coordinates of the CpG sites aligned with minimap2 in the four sequencing runs, and correlated the corresponding methylation frequencies between each pair of runs, so that the various correlations were directly comparable. Moreover, in order to avoid comparing CpG sites with a very different number of mapped reads, due to the different average coverage depth obtained in each run, we used only the methylation frequencies of the sites that were supported by a minimum of 5 reads, resulting in a total of 255,176 CpG sites. Although the average coverage depth in the four runs was very low (1.8–4.0X), high correlation values were still obtained between different sequencing runs (Supplementary Figure 6).

The existence of technical replicates of the sequencing runs also allowed us to analyse the linearity of the methylation frequency data between zero and one. The purpose of this analysis was to determine if the low depth of coverage introduced a bias in the methylation frequencies in the ends of the scale. For this aim, we used the sets of 255,176, 81,287, and 53,083 CpG sites overlapping to all runs and that were supported by at least 5, 10, or 15 reads, respectively. For each of these sites, we calculated the mean and standard deviation of the methylation frequencies. We then calculated the median of the standard deviations for each of 10 bins between the values of zero and one and plotted these values against the scatter plot of mean versus standard deviation of the methylation frequencies. The results showed that there is an almost linearity of the data between zero and one with only a slight compression at the ends. The difference between the maximum and minimum median values at 15X was half than that at 5X read support (0.03 versus 0.06 respectively), indicating that complete linearity may be reached at higher depths of coverage.

Discussion

Several technologies such as methylation microarrays and bisulphite sequencing were already well established when it was shown that nanopore sequencing allowed the detection of 5’-mC directly in native DNA molecules[Citation11]. Compared to other sequencing technologies, nanopore sequencing still faced a similar disadvantage given that sequencing the entire human genome (~3 gigabases) at a reasonable depth implies a high sequencing cost, despite the fact that bisulphite treatment and/or restriction enzyme digestion is no longer necessary for methylation studies. We thus hypothesized that human genome sequencing in the MinION, using a low depth of coverage, would be able to produce reliable and robust 5’-mC data. This approach has the advantages of speed and simplicity of the library preparation protocol and does not require sophisticated laboratory conditions or expensive equipment. In this context, we carried out a benchmark study to compare the performance of nanopore sequencing with other methodologies such as methylation microarray, DREAM, and RRBS.

At a depth of coverage of approximately 10X, the correlation between the methylation frequencies obtained by nanopore sequencing and the methylation values of matched CpG sites distributed throughout the genome sequence, obtained by DREAM or the 450k microarray, varied between 0.79 and 0.85, respectively. Comparison of nanopore sequencing data with data obtained by RRBS for genomic regions susceptible to methylation changes (e.g., CpG islands) also revealed a high correlation of methylation frequencies (≥0.83). Moreover, high correlations were obtained when comparing between nanopore-based methylation frequencies and corresponding values in 450k microarray data for either low (25–30%) or high (55–60%) content GC regions. The simulation of higher coverage depths, based solely on CpG sites with a minimum number of support reads, confirmed the progressive and linear increase in correlation up to 0.90 at 17X, demonstrating that methylation-calling based on low coverage depth nanopore sequencing provides reliable methylation frequency data. Conversely, we showed that even at a coverage level <2X, the lowest correlation obtained was 0.73 between nanopore sequencing and DREAM. Although high correlations have been obtained between the various methylation datasets in all cases, it is important to emphasize that exposure of HEL cells to different cell culture conditions, as well as the use of different methods for the extraction and preparation of genomic DNA, carried out in distinct laboratories at different time points, may have contributed to minor changes in the methylation pattern of the HEL cells. Moreover, the error rate of nanopore sequencing in the MinION, using the same flow cell chemistry and basecaller that we used in the current study, was shown to vary between 6% and 8% depending on the GC content of reads[Citation40]. It is envisaged that future improvements in sequencing chemistry and basecaller algorithms will provide more reliable methylation frequencies for 5’-mC.

We observed a very high correlation (≥0.89) between methylation frequencies of HEL cells obtained between any of replicate nanopore sequencing assays. In addition, we observed an almost linearity of the methylation frequencies between the minimum (fully unmethylated CpGs) and maximum (fully methylated CpGs) values at low levels of read support (5–15X). These results contrast with the intrinsic lack of homoscedasticity reported to occur in the 0–0.2 and 0.8–1.0 intervals of beta-values in methylation arrays[Citation37].

The most used methodologies for studying 5’-mC in the human genome require modification of DNA prior to sequencing. In the case of 450k microarray, WGBS[Citation41] and RRBS, genomic DNA is treated with bisulphite, which converts unmethylated cytosines to uracils. This converting action may not be completely efficient, leaving some unconverted cytosines in an unmethylated state that are later interpreted as being methylated. This phenomenon is particularly relevant when the genomic DNA is not fully denatured before bisulphite treatment because only the cytosines that are found in single-stranded DNA are susceptible to conversion[Citation6]. In addition, the high temperature (99°C) used in some bisulphite treatment protocols can lead to the degradation of genomic DNA, which is manifested by random breaks in the sequence as a result of depurinations[Citation7]. These breaks can give rise to low molecular size fragments, which can be lost in the next steps of the procedure, reducing the sequence representativeness of the original sample. Additional steps of sample manipulation can also cause a representation bias through the phenomena of preferential sequence selection during PCR, loss of amplified fragments during purification and incomplete denaturation of the DNA. Moreover, bisulphite treatment cannot distinguish between 5’-mC and 5’-hydroxymethylcytosine epigenetic modifications. Bisulphite transforms 5’-hydroxylmethylcytosine into cytosine-5-methylsulfonate, which is read as a C during sequencing, making the presence of cytosines indistinguishable with either form of modification[Citation8]. Moreover, in short-read sequencing technologies such as Illumina, sequence diversity is one of the critical aspects that allows high-quality sequences to be obtained. In WGBS or RRBS, unmethylated cytosines are converted into uracils, thus artificially reducing sequence diversity through a diminished proportion of cytosines that may affect data quality. Short reads, such as those used in the DREAM method, are also less adapted than long reads to map repetitive and highly homologous regions that exist in the human genome.

Compared to the methylation array, RRBS and DREAM, which only query a small fraction of CpG sites distributed essentially by gene regions and CpG islands, nanopore sequencing and WGBS offer a global view of the methylation profile of the human genome. The analysis of CpGs at a global level translates into an important advantage in terms of knowledge of epigenetic mechanisms, as it may allow the detection of CpG sites with novel functional relevance. In the present study, we were able to demonstrate that 10X coverage nanopore sequencing can provide methylation frequency data for more than 28 million CpG sites. In addition to the long-read capabilities of nanopore sequencing, genomic DNA can be sequenced in its native form without any type of in vitro modification that can alter its integrity, sequence, and representativeness, thus maintaining the original epigenetic landscape of the sample. However, to achieve the desired sequencing coverage that we obtained in this study, more than one sequencing run may be necessary, thus affecting the overall cost of nanopore sequencing compared to the other traditional methodologies. However, the higher cost may be partially overcome with future improvements on sample quality, library preparation, and/or sequencing chemistry, thereby being able to generate a larger amount of sequence data with few sequencing runs.

Another important advantage of nanopore sequencing is the fact that it has the potential to follow novel scientific developments in the field of epigenomics, without necessarily implying upgrades in the technological platform or changes in the sequencing methods. This reflects the fact that different types of epigenetic modification of DNA or RNA, including various forms of methylation, can in theory be detected using the same primary signal data, without the need to repeat the sequencing of the samples. To achieve this, several computational methods have been developed capable of identifying different types of epigenetic modifications[Citation42]. These can generally be divided into two main types, which include methods based on statistics and methods based on models. The former do not require the existence of a training model for DNA or RNA modifications, but they require the sequencing of a matched control sample to detect the various types of methylation[Citation43,Citation44]. Non-statistical methods can further be divided into mapping-dependent methods[Citation11,Citation45–47] and basecalling-only methods such as Guppy[Citation23] or Flappie[Citation48] (https://github.com/nanoporetech/flappie). Model-based methods may be more advantageous as they reduce the extra costs of sequencing a control sample, but they need to train a model based on the species genome, type of epigenetic modification and flow cell version. In this work, nanopolish has shown a good performance in calling 5’-mC in the human genome using R9.4 MinION flow cells. It is nonetheless desirable to carry out benchmarking studies using other methylation callers, in order to identify the best tool for determining 5’-mC in human genome samples. In summary, we showed in this study that methylation calling, based on nanopore sequencing, is a fast, robust, and sensitive approach for determining the status of 5’-mC in the human genome. Sequencing the human genome in MinION at low coverage depth is an alternative to be taken into account in studies aiming to assess methylation changes associated with human diseases[Citation49–54].

Data availability

The datasets presented in this study are available from the European Nucleotide Archive (ENA) at EMBL-EBI under the accession number PRJEB45078 (https://www.ebi.ac.uk/ena/browser/view/PRJEB45078).

Supplemental material

Supplemental Material

Download Zip (4.4 MB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/15592294.2022.2097473

Additional information

Funding

This work is a result of the GenomePT project (POCI-01-0145-FEDER-022184), supported by COMPETE 2020 – Operational Programme for Competitiveness and Internationalisation (POCI), Lisboa Portugal Regional Operational Programme (Lisboa2020), Algarve Portugal Regional Operational Programme (CRESC Algarve2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), and by Fundação para a Ciência e a Tecnologia (FCT). This work was also supported by Fundos FEDER through the Programa Operacional Factores de Competitividade – COMPETE and by Fundos Nacionais through the FCT within the scope of the project UID/BIM/00009/2019 (Centre for Toxicogenomics and Human Health-ToxOmics).

References

  • Frommer M, McDonald LE, and Millar DS. A genomic sequencing protocol that yields a positive display of 5- methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A. 1992;89:1827–1831.
  • Schatz MC. Nanopore sequencing meets epigenetics. Nat Methods. 2017;14:347–348.
  • Chatterjee A, Rodger EJ, Morison IM, et al. Tools and strategies for analysis of genome-wide and gene-specific DNA methylation patterns. Methods Mol Biol. 2017;1537:249–277.
  • Yong WS, Hsu, FM, Chen, PY. Profiling genome-wide DNA methylation. Epigenetics & Chromatin. 2016;9:26. DOI:10.1186/s13072-016-0075-3.
  • Gouil Q, Keniry A. Latest techniques to study DNA methylation. Essays Biochem. 2019;63:639–648.
  • Fraga MF, Esteller M. Review DNA methylation: a profile of methods. Biotechniques. 2002;33:632–649.
  • Ehrich M, Zoll S, Sur S, et al. A new method for accurate assessment of DNA quality after bisulfite treatment. Nucleic Acids Res. 2007;35:e29.
  • Huang Y, Pastor, WA, Shen, Y, et al. The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLoS One. 2010;5:1–9.
  • Leontiou CA, Hadjidaniel MD, Mina P, et al. Bisulfite Conversion of DNA: performance comparison of different kits and methylation quantitation of epigenetic biomarkers that have the potential to be used in non-invasive prenatal testing. PLoS One. 2015;10(8):e0135058.
  • Yang Y, Sebra R, Pullman BS, et al. Quantitative and multiplexed DNA methylation analysis using long-read single-molecule real-time bisulfite sequencing (SMRT-BS). BMC Genomics. 2015;16:350.
  • Simpson JT, Workman RE, Zuzarte PC, et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods. 2017;14:407–410.
  • Jain M, Koren S, Miga KH, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–345.
  • Liu Q, Fang L, Yu G, et al. Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat Commun. 2019;10:2449.
  • Liu Y, Rosikiewicz W, Pan Z, et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 2021;22:295.
  • Iorio F, Knijnenburg TA, Vis DJ, et al. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166:740–754.
  • Jelinek J, Liang S, Lu Y, et al. Conserved DNA methylation patterns in healthy blood cells and extensive changes in leukemia measured by a new quantitative technique. Epigenetics. 2012;7:1368–1378.
  • Ghandi M, Huang FW, Jané-Valbuena J, et al. Next-generation characterization of the cancer cell line encyclopedia. Nature. 2019;569(7757):503–508.
  • Bibikova M, Barnes B, Tsan C, et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–295.
  • Meer A, Gnirke A, Bell GW, et al. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005;33(18):5868–5877.
  • Martin P, Papayannopoulou T. HEL cells: a new human erythroleukemia cell line with spontaneous and induced globin expression. Science. 1982;216:1233–1235.
  • De Coster W, D’Hert S, Schultz DT, et al. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018;34:2666–2669.
  • Giesselmann P, Hetzel S, Müller F-J, et al. Nanopype: a modular and scalable nanopore data processing pipeline. Bioinformatics. 2019;35:4770–4772.
  • Guppy protocol. Accessed 14 05 2021. https://community.nanoporetech.com/protocols/Guppy-protocol/
  • Sedlazeck FJ, Rescheneder, P, Smolka, M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15:461–468.
  • Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100.
  • Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32:292–294.
  • Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079.
  • Accessed 08 03 2022 https://depmap.org/portal/download/all/
  • Kuhn RM, Haussler D, James Kent W. The UCSC genome browser and associated tools. Brief Bioinform. 2013;14:144–161.
  • Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842.
  • Kuhn M, Jackson S, Cimentada J Corrr: correlations in R. R package version 0.4.3. 1–15 (2020).
  • Wright K corrgram: plot a correlogram 1.13. 1–9 (2018).
  • Olson N, Wagner J, McDaniel J, et al. precisionFDA truth challenge V2: calling variants from short- and long-reads in difficult-to-map regions. BioRxiv. 2021. DOI:10.1101/2020.11.13.380741.
  • Zook J. Genome in A bottle - v3.0 genome stratifications. National Institute of Standards and Technology, 2021. cited 2022 March 29. https://doi.org/10.18434/mds2-2499
  • Cavalcante RG, Sartor MA. Annotatr: genomic regions in context. Bioinformatics. 2017;33:2381–2383.
  • Chen H, Boutros PC. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics. 2011;12:35.
  • Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer-Verlag; 2016. DOI:10.1007/978-3-319-24277-4.
  • Youk J, An Y, Park S, et al. The genome-wide landscape of C:g>T: a polymorphism at the CpG contexts in the human population. BMC Genomics. 2020;21:270.
  • Du P, Zhang X, Huang -C-C, et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11:587.
  • Delahaye C, Nicolas J. Sequencing DNA with nanopores: troubles and biases. PLoS One. 2021;16(10):e0257521.
  • Suzuki M, Liao, W, Wos, F, et al. Whole-genome bisulfite sequencing with improved accuracy and cost. Genome Res. 2018;28:1364–1371.
  • Xu L, Seki M. Recent advances in the detection of base modifications using the Nanopore sequencer. J Hum Genet. 2020;65:25–33.
  • Stoiber AM, Quick, J, Egan, R, et al. De novo Identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv. 2017. DOI:10.1101/094672.
  • Liu Q, Georgieva DC, Egli D, et al. NanoMod: a computational tool to detect DNA modifications using Nanopore long- read sequencing data. BMC Genomics. 2019;20:78.
  • McIntyre ABR, Alexander, N, Grigorev, K, et al. Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat Commun. 2019;10:1–11.
  • Rand AC, Jain M, Eizenga JM, et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat Methods. 2017;14:411–413.
  • Ni P, Huang N, Zhang Z, et al. DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. Bioinformatics. 2019;35:4586–4595.
  • Flappie. Accessed 14 05 2021. https://github.com/nanoporetech/flappie.
  • Ferguson-Smith AC. Genomic imprinting: the emergence of an epigenetic paradigm. Nat Rev Genet. 2011;12:565–575.
  • van Dijk SJ, Tellam RL, Morrison JL, et al. Recent developments on the role of epigenetics in obesity and metabolic disease. Clinical Epigenetics. 2015;7(1). DOI:10.1186/s13148-015-0101-5
  • Liu C, Jiao C, and Wang K, et al. DNA methylation and psychiatric disorders. In: Grayson, Dennis R.ed. Progress in molecular biology and translational science. Vol. 157. Academic Press; 2018. p. 175–232.
  • Rosa-Garrido M, Chapski DJ, Vondriska TM. Epigenomes in cardiovascular disease. Circ Res. 2018;122:1586–1607.
  • Wu H, Chen Y, Zhu H, et al. The pathogenic role of dysregulated epigenetic modifications in autoimmune diseases. Front Immunol. 2019;10:1–11.
  • Wang H, Li, Y, Lv, N, et al. Predictors of clinical responses to hypomethylating agents in acute myeloid leukemia or myelodysplastic syndromes. Ann Hematol. 2018;97:2025–2038.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.