835
Views
5
CrossRef citations to date
0
Altmetric
Original Articles

How does sequence variability affect de novo assembly quality?

, , , , , , , , , & show all
Pages 901-910 | Received 10 Oct 2011, Accepted 08 Oct 2012, Published online: 03 Jan 2013

Abstract

Molecular genetic tools have become standard in biological studies of both model and non-model species. This has created a growing need for sequence information, a resource hitherto limited for many species. With new sequencing technologies this is rapidly changing, and whole genome shotgun sequencing has become a realistic goal for many species. However, present sequencing protocols require more DNA than can be extracted from single individuals of many small metazoans, potentially forcing sequencing projects to perform sequencing on samples derived from several individuals. A pertinent question thus arises: can wild samples be used or is inbreeding necessary? In the present study we compare assemblies generated using sequence data from inbred and wild Lepeophtheirus salmonis. The results indicate not only that measures to reduce the genetic variability may significantly improve the final assemblies but also that deeper coverage to some extent can compensate for the detrimental effects of natural sequence variability.

Introduction

In biological sciences molecular methods are being applied at an ever increasing rate and have become standard approaches also when working with non-model organisms. As a consequence, considerable time and resources are used to obtain sequence data for in silico analyses and as baseline information for downstream applications such as Northern blotting, quantitative real-time polymerase chain reaction and in situ hybridization. The advent of new sequencing technologies and the associated decrease in sequencing costs and increase in speed, has made whole genome shotgun sequencing (WGS) feasible for a large variety of projects (Ekblom and Galindo Citation2011).

The objective for de novo genome sequencing will often be to generate an assembly that can be annotated and analysed, and subsequently used to identify single nucleotide polymorphisms (SNPs), design primers and probes etc. The quality of the generated assemblies is of great importance for subsequent annotations (Florea et al. Citation2010) and consequently also for downstream analyses. The assembly quality in turn depends on the assembly algorithm used as well as the amount, type and quality of the data entered into the analyses (Dalloul et al. Citation2010; Florea et al. Citation2010; Lin et al. Citation2011). Consequently, a number of papers comparing assembly tools and sequencing platforms have been published (Harismendy et al. Citation2009; Bao et al. Citation2011; Glenn Citation2011; Lin et al. Citation2011; Suzuki et al. Citation2011).

Next-generation sequencing protocols generally require 1–20 μg of high-quality DNA for construction of sequencing libraries, and sequencing projects often require construction of more than one library (e.g. libraries for different sequencing platforms, paired end libraries etc.). Even for a large arthropod, such as the up to 11 mm long ectoparasitic marine copepod Lepeophtheirus salmonis (Johnson and Albright Citation1991), it may be challenging to obtain sufficient DNA for library construction from a single individual. It goes without saying that requirement of more libraries, isolation from smaller species, or the desire to use dissected tissues to reduce the risk of contamination, increase the number of individuals necessary for sequencing library construction. As the need for extraction from several individuals arises it becomes relevant to ask whether natural sequence variation in a population could affect the quality of the final assembly. However, studies directly addressing the effect of sequence variability on assemblies are absent, despite the fact that such studies could be a valuable reference when selecting sources of DNA (e.g. inbred cultures versus wild specimens) for de novo sequencing of small organisms.

The salmon louse, L. salmonis, is an economically important copepod ectoparasite with a genome size between 550 and 600 Mega-base pairs (Mbp) according to the animal genome size database (www.genomesize.com). We are presently sequencing the genome of an inbred strain of L. salmonis and simultaneously generating a resource of SNPs by sequencing wild specimens of L. salmonis sampled across several regions in the North Atlantic. These two data sets containing sequences from the same species with different degrees of genetic variability are comparable because they have been generated using the same sequencing platform (Illumina HiSeq2000; Illumina Inc., San Diego, CA, USA). To address the effect of genetic variability on sequence data assembly, equally sized data sets representing different starting materials were constructed from the available sequence data and assembled. Here we present statistics for the resulting assemblies that may serve as an information baseline when designing projects for de novo sequencing of small organisms.

Material and methods

Sequencing and tissue sampling

The raw sequence data used in the present study were obtained from two projects with different aims. Material from whole untreated wild L. salmonis was sampled in the field for an SNP detection project and material from inbred sterilized L. salmonis for a genome-sequencing project was sampled from experimental facilities as previously described (Hamre et al. Citation2009).

Material for SNP analysis and DNA isolation

Eight adult female L. salmonis were sampled from each of five localities. Four of these have been described previously (Glover et al. Citation2011): C858 (Canada), S856 (Shetland), I852 (Ireland), N849 (northern Norway). The fifth sample consisting of eight females was collected in September 2008 from an emamectin-benzoate-desensitized population in Austevoll, western Norway. For all samples, DNA was isolated in a 96-well format using the DNeasy kit according to the manufacturer's instructions (Qiagen, Hilden, Germany). Equal amounts of DNA from each of the eight individuals from each station were pooled to meet concentration demands and were sequenced by Fasteris SA using the Illumina HiSeq 2000 platform following their standard protocols.

Material for genome sequencing and DNA isolation

Inbred adult female L. salmonis were sampled for 27 generations of inbreeding of the Ls1a culture as previously described (Hamre et al. Citation2009). To reduce the amount of non-salmon louse contamination before sequencing, DNA was purified from starved (2 days) individuals treated with 3% Virkon® in sterilized seawater. The specimens were digested using ample lyophilized proteinase K in 400 μl 100 mm NaCl, 10 mm Tris–HCl pH 8, 25 mm EDTA and 5% sodium dodecyl sulphate at 37°C for 2–4 hours. DNA was extracted by addition of 400 μl phenol: chloroform: isoamylalcohol (25: 24: 1) before gentle homogenization and phase separation by centrifugation at maximum r.p.m. (16100 g) for 5 min at room temperature. The aqueous supernatant was thereafter transferred to new tubes and 2.5 volumes of ice-cold 90% ethanol was added. The DNA was then precipitated by addition of 0.1 volumes 3 m sodium acetate at pH 5.2. When visibly precipitating, the high molecular weight (HMW) DNA was spooled on shepherds' crooks prepared from glass Pasteur pipettes. The HMW-DNA was then cleaned in 70% ethanol, dried at room temperature and resuspended in water. The HMW-DNA was sequenced by Fasteris SA using the Illumina HiSeq 2000 platform following the same protocols as for sequencing of wild specimens. Additional 454 Life Sciences sequencing (not described in detail) was performed on ovaries from the same inbred strain and used to generate a draft genome assembly.

Data set preparation and analyses

Generating sequence sets from inbred and wild L. salmonis

An inherent challenge in comparing the data from the two studies was the different sources of DNA. In addition to the reduction in variability caused by inbreeding, the material chosen for sequencing of the L. salmonis genome was expected to contain less contamination than the libraries prepared for SNP detection because they had been starved and treated with Virkon® to reduce contamination. To further eliminate sequencing reads from contaminants, both data sets were mapped against a draft genome assembly based on inbred 454 reads (data not published). The mapping was performed using Burrows–Wheeler Aligner (BWA; Li and Durbin Citation2009) with default parameters. We then used a custom program (available from the authors upon request) to extract only read pairs where at least one read mapped to the genome.

Data set generation

To construct comparable data sets, we extracted a fixed number of random read pairs from each of the sequence sets. The smallest data set size was chosen to correspond to the smallest of the wild-type data sets, containing 34,393,766 read pairs, or approximately 12 × genome coverage. We therefore extracted an equal number of random read pairs from each of the other wild-type data sets, and also extracted five sets of random read pairs from the inbred runs so that sequence reads for each of the five sets were extracted from single sequencing runs. Similarly, we extracted 73,571,936 read pairs (∼ 24 ×) from the three largest wild-type data sets, and the same amount from individual sequence runs of inbred data. As for the 12 × data sets; all data in an inbred 24 × data set came from the same sequencing run. Finally, we pooled all wild-type data and used bootstrapping to extract five sets of 108,000,000 read pairs (∼ 36 ×), and similarly generated five data sets from a pool of the inbred data. These 36 × data sets contained reads from all sequencing runs.

Sequence variation within and between data sets

To compare the diversity in the data sets, a simplified variation calling procedure was used. We aligned the generated data sets against the reference using BWA. We then performed variant calling using Samtools pileup –vcf (Li et al. Citation2009). The output was filtered to remove variants called only because of disagreement with the reference sequence, but where the read data showed a unanimous consensus, and the numbers of remaining variants were counted.

Sequence data assemblies

The 26 data sets generated as described above were imported into CLC Genomics Workbench ® v. 4.6.1 and trimmed using default settings. Subsequent assembly was performed using CLC Genomics Workbench ® v. 4.6.1 using standard setting except the maximum distance for paired reads was adjusted to 450 to accommodate the larger than default insert size. Contig N50 values were calculated from an approximated genome size of 600 Mbp.

Results

Read mapping

Mapping of the original sequence reads to the best available genome assembly, and discarding all reads that did not map, resulted in variable fractions of the sequencing reads being retained from the different sequencing runs (). Even when omitting the Austevoll 1st run, a significantly larger average fraction of the reads from the wild samples were discarded compared with the inbred samples (). Furthermore the results showed a higher variability in the fraction of reads classified as contamination (i.e. discarded reads) from the wild samples compared with the inbred samples.

Table 1. Overview of the fraction of reads retained after mapping to the best available salmon louse genome assembly

Genetic variability of data sets generated for assembly

As a proxy for genetic variability we used a simple count of variable sites generated using Samtools pileup for all the generated data sets (). The results suggest that significant genetic variability remained in the inbred data sets after 27 generations of semi-intensive inbreeding (see in Hamre et al. Citation2009 for a description of the inbreeding regimen). The number of identified variant sites in the inbred data sets increased with data set size from the 12 × to 24 × () and then appeared to remain stable when increasing the size of the data set to 36 × coverage. In contrast, the variability of the wild data sets continued to increase with increasing size. It is noteworthy that the variation in variable site count in the bootstrapped 36 × data sets was extremely low compared with the variation in the smaller 12 × and 24 × data sets. Regardless of data set size, the polymorphism density derived for the data sets from sequencing of eight wild L. salmonis was significantly higher than the density found in equally sized data sets derived from inbred lice.

Table 2. Overview of generated data sets, their variability (see text for details) and assembly statistics

Assembly statistics

The results showed that increasing the data set size improved assembly statistics (). This improvement was clearly seen in higher N50 values and reduced number of contigs contributing to the N50. Notably, the results also showed that assemblies of inbred data sets with lower variation generated better assemblies than assemblies of wild data sets. Although the size of the largest contigs increased with the amount of data this figure was sufficiently variable to significantly overlap between assemblies of equally sized inbred and wild data sets. The results furthermore indicate that the effect of adding additional data is not saturated at 36 × coverage for Illumina sequencing, suggesting that assemblies will improve further at increased coverage.

Discussion

Molecular approaches based on sequence data have created an increasing need for acquiring sequence resources. For smaller organisms DNA may have to be isolated from several individuals to meet the concentration requirements of sequencing protocols. However, little information is available on the effect of reducing sequence variation on assemblies. Here we present results from assemblies of sequence data from a WGS sequencing project on inbred L. salmonis and an SNP-detection project on wild L. salmonis. The assemblies were generated from data sets of approximately 12, 24 and 36 times coverage of the L. salmonis genome, which represents a reasonable sequencing range for a small-scale WGS project on a non-model species.

The data sets were assembled with CLC Genomics Workbench. The assembly results showed that the average contigs in assemblies from data based on inbred L. salmonis were larger than average contigs generated from data from samples of eight L. salmonis individuals collected in the wild (). Hence, assembly statistics indicate that reduction of genetic variation is highly desirable. It should be noted that the population size of L. salmonis is very large and that other organisms with smaller populations are expected to exhibit lower sequence variation, which in turn could improve the results from wild samples. The results furthermore showed that adding sequence data improved common assembly quality parameters such as N50 and the number of contigs in the N50 (). It should be noted that the largest contig size is considerably more variable and consequently a less reliable indicator of assembly quality. Although the effect of adding sequence reads saturates when coverage is sufficiently high, the results indicate that expanding a data set may compensate for sequence variability in the sampled material.

The data sets were generated by mapping Illumina-paired end reads from wild and inbred sources (see Material and methods) on an L. salmonis genome assembly based on 454 Life Sciences sequencing reads of inbred ovaries and discarding all reads that did not map to ensure that the data sets were comparable. The assembly against which we mapped the reads was generated from sequences derived from dissected inbred ovaries only (454 WGS sequence reads not used for other purposes in the present study), so we are confident that the vast majority of the assembly represents genuine L. salmonis sequence. The SNP-detection project sequencing was performed on untreated whole lice, and we therefore expected that these data sets would contain more contaminants than the inbred data sets. This was supported in the mapping step where 16% of the reads derived from wild L. salmonis did not map to the best inbred genome assembly as opposed to 4% of the inbred reads. Although it cannot be ruled out that fractions of the wild population may contain genetic material not present in the inbred strain, the results strongly suggest that a significant proportion of the reads may be expected to be contamination if measures are not implemented to counter this. To this end, the results indicate that simple measures such as starvation and Virkon® treatment can significantly lower the risk for contamination.

Sequence variabilities in the constructed data sets were evaluated from simple heterogenic site counts () and showed that the natural sequence variation in pools of eight individuals sampled at the same site was higher than in the inbred data sets. This appears to confirm the reduced sequence variation in the inbred strain reported earlier (Hamre et al. Citation2009) and later supported by analyses using an expanded set of microsatellites (12 out of 13 microsatellites were fixed, the last had two alleles, data not shown) following methods previously reported (Glover et al. Citation2011). The apparent large remaining sequence variability in the inbred strain seems surprising if microsatellite variability is a reliable proxy of genetic variation. However, these estimates are based on a draft genome using a simplified variant calling procedure, and so are likely to be inflated compared with the true number of variants. For instance, sequencing errors are likely to contribute to false variant calls, and genomic repeats may be collapsed in the draft genome, causing any differences to be counted as variants. Nevertheless, 9 × and 20 × data sets originating from inbred ovaries from more than 30 adults contained 2.7 and 3.1 million variable sites, respectively, which is significantly higher than the numbers of variable sites in the generated inbred 24 × data sets originating from only three inbred individuals (approximately 1.5 million variable sites, ) indicating that considerable residual variation is still present in the inbred Ls1a strain after 27 generations of inbreeding. Hence, despite the uncertainties pertaining to the absolute numbers of variants in the data sets, these results suggest that the loss of variation in microsatellite markers is disproportionately high in comparison to the loss of genetic variation in general, and that the inbreeding regimen described by Hamre et al. (Citation2009) has not been optimal for reducing genetic variability.

The variability counts also revealed that increasing the sample size resulted in a larger number of variant sites. However, the increase in variability from the wild 24 × data set to the wild 36 × data set was surprisingly low considering that the wild 24 × data sets were generated from single localities whereas the wild 36 × data sets contained reads from all localities. This suggests that most of the variable sites are found throughout the North Atlantic, supporting earlier studies indicating that L. salmonis displays a high degree of gene-flow, consistent with a species that can disperse at both planktonic and adult stages (Glover et al. Citation2011). The homogeneous level of sequence variability in the 36 × data sets compared with the smaller data sets, that all stem from single sequencing runs, is probably the result of the bootstrapping procedure averaging variability among sequencing runs, which in turn indicates that the error rate variation between sequencing runs is noticeable.

Sequencing platforms exhibit different coverage biases, i.e. some sequence regions will be under-represented in reads from one platform but exhibit normal coverage when sequenced with another (Harismendy et al. Citation2009; Dalloul et al. Citation2010). Therefore a combination of sequencing platforms is recommended to improve results. The results presented here are based on paired end Illumina sequencing only and employing several sequencing platforms may influence the effect of sequence variability on assemblies.

The conclusion from the present study is that measures that can be taken to reduce sequence variability, e.g. inbreeding or reducing the number of individuals used, will result in better assemblies. Furthermore, the results indicate that, in small sequencing projects, the beneficial effect of adding data may compensate for the adverse effect of using samples with some genetic variation.

Acknowledgements

We acknowledge the financial contributions from Marine Harvest and The Fishery and Aquaculture Industry Research Fund for financial contributions. We also appreciate our discussions with Dr James Emmanuel Bron, which resulted in restructuring of the analyses.

References

  • Bao , SY , Jiang , R , Kwan , WK , Wang , BB , Ma , X and Song , YQ. 2011 . Evaluation of next-generation sequencing software in mapping and assembly . J Hum Genet , 56 : 406 – 414 .
  • Dalloul , RA , Long , JA , Zimin , AV , Aslam , L , Beal , K , Blomberg , L , Bouffard , P , Burt , DW , Crasta , O Crooijmans , RPMA . 2010 . Multi-platform next generation sequencing of the domestic turkey , 8 Meleagris gallopavo : genome assembly and analysis. Plos Biol .
  • Ekblom , R and Galindo , J. 2011 . Applications of next generation sequencing in molecular ecology of non-model organisms . Heredity , 107 : 1 – 15 .
  • Florea , L , Souvorov , A and Salzberg , SL. 2010 . Genes and genomes, an imperfect world: comparison of gene annotations of two Bos taurus draft assemblies . Genome Biol , 11 ( 1 ) : P13
  • Glenn , TC. 2011 . Field guide to next-generation DNA sequencers . Mol Ecol Resour , 11 : 759 – 769 .
  • Glover , KA , Stolen , AB , Messmer , A , Koop , BF , Torrissen , O and Nilsen , F. 2011 . Population genetic structure of the parasitic copepod Lepeophtheirus salmonis throughout the Atlantic . Marine Ecol Prog Ser , 427 : 161 – 172 .
  • Hamre , LA , Glover , KA and Nilsen , F. 2009 . Establishment and characterisation of salmon louse (Lepeophtheirus salmonis (Kroyer 1837)) laboratory strains . Parasitol Int , 58 : 451 – 460 .
  • Harismendy , O , Ng , PC , Strausberg , RL , Wang , X , Stockwell , TB , Beeson , KY , Schork , NJ , Murray , SS , Topol , EJ , Levy , S and Frazer , KA. 2009 . Evaluation of next generation sequencing platforms for population targeted sequencing studies . Genome Biol , 10 : R32
  • Johnson , SC and Albright , LJ. 1991 . The developmental stages of Lepeophtheirus salmonis (Kroyer, 1837) (Copepoda, Caligidae) . Can J Zool-Rev Canad Zool , 69 : 929 – 950 .
  • Li , H and Durbin , R. 2009 . Fast and accurate short read alignment with Burrows–Wheeler transform . Bioinformatics , 25 : 1754 – 1760 .
  • Li , H , Handsaker , B , Wysoker , A , Fennell , T , Ruan , J , Homer , N , Marth , G , Abecasis , G and Durbin , R. 2009 . The sequence alignment/map format and SAMtools . Bioinformatics , 25 : 2078 – 2079 .
  • Lin , Y , Li , J , Shen , H , Zhang , L , Papasian , CJ and Deng , HW. 2011 . Comparative studies of de novo assembly tools for next-generation sequencing technologies . Bioinformatics , 27 : 2031 – 2037 .
  • Suzuki , S , Ono , N , Furusawa , C , Ying , BW and Yomo , T. 2011 . Comparison of sequence reads obtained from three next-generation sequencing platforms . PLoS One , 6 : e19534