14,590
Views
1
CrossRef citations to date
0
Altmetric
Editorial

The DNA Sequencing Renaissance and its Implications for Epigenomics

Pages 5-8 | Published online: 01 Oct 2009

The Human Genome Project (HGP) was a massive undertaking costing a few hundred million US dollars and involving a consortium of hundreds of interdisciplinary researchers around the world Citation[1]. Nonetheless, optimism and scepticism about the utility of the first draft sequence were equally balanced. The scepticism was somewhat unfair, because the output of the project, a human genome reference sequence, was only originally envisaged as a substrate for scientific study from which further discoveries would be made.

In the late 1990s, when the completion of the HGP seemed imminent, a small number of visionary researchers around the world were already thinking about what would follow in the years after its publication. It was clear to a handful of such people that the appetite for DNA sequence-based science would be huge, with applications across the genomics and genetics domain, replacing or augmenting existing techniques and, most importantly, enabling new and superior experimental types that were either impossible or uneconomical at that time. The optimists began to embody this ambition in the phrase ‘The $1000 Human Genome‘, a phrase which captures the required fall in the cost of sequencing a human genome by over five orders of magnitude, implying that routine sequencing and analysis of such data would be accessible to all researchers and clinicians at this price point Citation[2].

A handful of biotechnology companies were founded entirely on this premise, each with their own completely novel DNA sequencing technology. Notable examples are 454 Inc. (CT, USA), Solexa Ltd (Cambridge, UK) and Agencourt Biosciences (MA, USA). These companies are notable in that they were able to solve the intrinsic, and very different, technical problems specific to their platforms. They launched commercially successful DNA sequencers in around 2005, 2006 and 2008, respectively. Subsequently, researchers within these companies were able to publish their technologies Citation[2–6].

All of these new technologies utilize parallelization of sequential sequencing reactions on a massive scale. All the technologies also use optical read outs, detecting the chemical incorporation of fluorescently labeled nucleotides or probes that are complementary to the target DNA substrate Citation[6,7,101,201]. A single ‘run‘ of such a system can produce over 2 gigabases of DNA sequence data per day (for comparison, a diploid human genome is approximately 6 gigabases).

In the middle of 2009, over 900 such ‘second-generation‘ DNA sequencers had been sold, and the claimed cost point for a human genome was around US$50,000–100,000 Citation[7,8]. Many high-impact publications have been written since the launch of these systems, and researchers have applied them under new experimental designs revealing hitherto undiscovered complexity in the genome and the transcriptome (for examples, see Citation[8,9,102,103]). Genome centers have ‘re-tooled‘ with next-generation sequencers, often installing many tens of systems and phasing out the older technologies on which the original HGP was executed Citation[101]. In many cases, such centers now produce a terabase of DNA sequence data every 3 weeks or so Citation[104]. Putting that in perspective, over 99.99% of all the DNA sequence data ever produced has been generated within the last few months. It is tempting to think that this statement may be perpetually true, as the current second-generation systems evolve and are optimized in terms of throughput Citation[104]. However, they have intrinsic limits in terms of throughput and cost that are largely set by their reliance on costly optical detection methods and step-wise bulk chemistries.

Already, a new crop of companies and academic groups are pioneering another generation of sequencing technologies in order to bring the cost of sequencing down to below US$1000. These include the current vendors of second-generation systems, and new players such as Oxford Nanopore Technologies (Oxford, UK), Pacific Biosciences (CA, USA) and several others. In one case, Complete Genomics (CA, USA), the low-cost sequencing is to be executed as a service rather than by selling an instrument system. In all cases it is useful to start thinking about a ‘$0 Human Genome‘, because the cost at the point of data generation, typically derived from the price of reagents, may become free or offset elsewhere.

Some of these technologies are based on so-called nanopores. These are small holes, either biological or solid-state, through which DNA or nucleotides can be passed and detected in a controllable way Citation[105]. Generally, these rely on detecting single molecules. In some cases no additional labeling of the substrate to be sequenced is required. Such sequencers could operate at advanced levels of sensitivity (single cell) and very low cost, not requiring complex reagents and expensive optical subsystems. It is also possible that the length of DNA molecule that can be read would be greater yet at comparable or even higher accuracy than current systems. In some technologies, detection of methylcytosine and other natural base analogues has already been demonstrated Citation[10]. Clearly, this would be of interest to epigenomic researchers who currently rely on chemical modification of the substrate often compromising data quality.

Over the past decade or so we have seen a common trend in the evolution of genomic technologies and their applications. Early studies into the association of mutations with disease were underpowered and based on optimistic assumptions about common variants. The International HapMap project was then conceived to remedy this Citation[11,12]. Early researchers would have asked for a detection efficiency of one mutation per ten kilobases of genome, later the requirement was raised to one per three kilobases, and subsequently more than one per kilobase, and now covering rarer variants, culminating in the 1000 Genomes project Citation[106]. Now, it seems likely, with new sequencing methods, that utter brute force detection of all mutations will become the aspiration in as many samples as economically feasible Citation[8,9,102]. Similarly, with microarray-based detection of genome rearrangements and the measure of gene transcript levels and diversity, there was a steady increase in resolution and sensitivity. These were often deemed sufficient at the time by researchers and grant applicants. Generally, these array technologies probe for the presence of known things. All such assumptions and limits have now been challenged by the new sequencing technologies, that can operate in a hypothesis-free way, discovering and counting across the full range of biological phenomena present in a cell Citation[13]. Again, here we see a relentless trend towards cheap and brute-force hypothesis-free measurement with discoveries made subsequently in silicoCitation[102]. It would seem, surveying this interplay of technological advancement and subsequent scientific discovery, that the deeper one digs into a biological system, the more that one is likely to find, and that early assumptions about inherent complexity tend to be simplistic or even naive.

Epigenomics/genetics has grown in significance over the past 30 years since the inception of the field following the discovery of 5-methylcytosine, the so-called ‘5th-base‘, modifications Citation[14]. Indeed, recently, the prospect of other significant base analogues has been raised Citation[15]. Prior to the advent of the new sequencing technologies, the potential for epigenomics in medicine was already widely recognized Citation[16,17]. Its role in cancer development, aging, gene regulation, embryogenesis and the modulation of genetic factors has been well described Citation[16,18].

The most immediate impact of the new sequencing technologies has been on so-called ‘ChIP-seq‘ experiments, where the locations of histone proteins can be mapped to the genome identifying epigenetic control of chromatin structure and gene expression Citation[19]. These proteins leave a footprint on the DNA that protects it from shearing during sample preparation. This is an easier experiment utilizing the same genomic fragmentation and sequence remapping techniques used for mutation detection in re-sequencing experiments Citation[2,20]. The significance of the identified regions can then be determined Citation[107]. This technique replaces an array-based method, ChIP-on-chip, and is generally considered hypothesis-free, more sensitive, and thus superior.

Detecting the modification of cytosine, and its location on DNA from a given sample, can currently be performed using any sequencing system following bi-sulfite treatment of the DNA. Earlier embodiments required pulling down a subset of the genome to be analyzed using an antibody precipitation method known as ‘methylated DNA immunoprecipitation‘ (MeDIP), often followed by an array based analysis Citation[21]. Many other methylation site subsetting techniques have been described Citation[102]. Bi-sulfite treatment leaves 5-methylcytosine intact, but changes cytosine (denoted as C) to a uracil analogue and during sequencing, using technologies based on complementary synthesis or probe ligation, this is recognized as the base thymine (denoted T). A minor complication here is remapping the resulting sequences to the genome in order to locate the site of the modifications. Another is the amount of DNA required for bi-sulfite treatment, and any biases or artefacts this treatment may introduce Citation[16]. Nonetheless, genome-wide surveys of methylation have recently been performed using such techniques on second-generation sequencers.

Again, in epigenomics, the scale of experimentation has increased to exploit the available bandwidth of the new technologies for generating data. We have also seen a shift towards more hypothesis-free experimental design, although it is clear that the field of epigenomics has been somewhat conservative, tending to cling to techniques that generate relatively small subsets of data, rather than brute force whole-genome surveys.

This will change with the advent of the third-generation DNA sequencers. The ability to detect all bases, from single molecules, without chemical modification or amplification, will streamline and simplify most of the experimental designs currently executed using the second-generation systems Citation[102]. More importantly, the lower costs and the arrival of mature software will enable large-scale whole-genome surveys to be performed routinely, thus obviating the need for ‘pull-down‘ or genome subsetting techniques. This will require a change of approach on the part of epigenetic researchers. They will rely more on in silico discovery and less on front-end experimental design and molecular biology. This trend is seen elsewhere in genetics/genomics, where the new sequencing has been embraced.

The new technologies will enable, in any laboratory, the sequencing of complete genomes, transcriptomes and epigenomes in one run of the instruments. Data can thus be analyzed and interpreted as an ensemble. This facilitates a convergence of the somewhat separate disciplines of genetics, transcriptomics and epigenomics, leading to a more integrated approach to understanding cellular processes.

The epigenome is a little unusual in that many changes appear to be tissue- or disease-specific – and perhaps less diverse and chaotic than those seen in cancer development. These epigenetic profiles, perhaps accessible through free DNA in body fluids, could be used as diagnostics or biomarkers once they have been mapped and catalogued using some of the sequencing-enabled techniques described above Citation[108].

Coupled with negligible costs, smaller and more clinically accessible sequencing devices, epigenomics may thus become the most important translational discipline that gives genomic technologies broad utility in the clinic Citation[8].

Financial & competing interests disclosure

The author is currently on the executive team of Oxford Nanopore Technologies, a privately owned company developing a DNA sequencing platform based on nanopores. The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

Bibliography

  • International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature431 , 931–945 (2004).
  • Bennett ST , BarnesC, CoxA, DaviesL, BrownC: Toward the US$1000 human genome.Pharmacogenomics6 , 373–382 (2005).
  • Margulies M , EgholmM, AltmanWE et al.: Genome sequencing in microfabricated high-density picolitre reactors.Nature437 , 376–380 (2005).
  • Harris TD , BuzbyPR, BabcockH et al.: Single-molecule DNA sequencing of a viral genome.Science320 , 106–109 (2008).
  • Wheeler DA , SrinivasanM, EgholmM et al.: The complete genome of an individual by massively parallel DNA sequencing.Nature452 , 872–876 (2008).
  • Bentley DR , BalasubramanianS, SwerdlowHP et al.: Accurate whole human genome sequencing using reversible terminator chemistry.Nature456 , 53–59 (2008).
  • Korbel JO , UrbanAE, AffourtitJP et al.: Paired-end mapping reveals extensive structural variation in the human genome. Science318 , 420–426 (2007).
  • House of Lords Science and Technology Committee: Genomic medicine. 2nd Report of Session 2008–2009. The Stationary Office Ltd, Norwich, UK
  • Campbell PJ , StephensPJ, PleasanceED et al.: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing.Nature Genet.40 , 722–729 (2008).
  • Clarke J , WuHC, JayasingheL, PatelA, ReidS, BayleyH: Continuous base identification for single-molecule nanopore DNA sequencing.Nat Nanotechnol.4 , 265–270 (2009).
  • Sachidanandam R , WeissmanD, SchmidtSC et al.: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.Nature409 , 928–933 (2001).
  • International Human Genome Sequencing Consortium: A haplotype map of the human genome. Nature.437 , 1299–1320 (2005).
  • Mortazavi A , WilliamsB A, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods5 , 585–587 (2008).
  • Ehrlich M , WangRY: 5-Methylcytosine in eukaryotic DNA.Science212 , 1350–1357 (1981).
  • Kriaucions S , HeintzN: The nuclear DNA base 5-hydroxymethylacytosine is present in purkinje neurons and the brain.Science324(5929) , 929–930 (2009).
  • Beck S , OlekA (Eds). The Epigenome. Wiley-VCH, Weinheim, Germany (2003).
  • Egger G , LiangG, AparicioA et al.: Epigenetics in human disease and prospects for epigenetic therapy.Nature429 , 457–463 (2004).
  • Bernstein B , MeissnerA, LanderE: The mammalian epigenomeCell128(4) , 669–681 (2007).
  • Johnson DS , MortazaviA et al.: Genome-wide mapping of in vivo protein–DNA interactions. Science316 , 1497–1502 (2007).
  • Li H , RuanJ, DurbinR : Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res.18(11) , 1851–1858 (2008).
  • Frommer M , McDonaldLE, MillarDS et al.: A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc. Natl Acad. Sci. USA89 , 1827–1831 (1992).

▪ Websites

▪ Patent

  • Milton J, Wu X, Smith M et al.: Modified nucleotides. World Intellectual Property Organization WO/2004/018497 (2004).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.