5,728
Views
11
CrossRef citations to date
0
Altmetric
Research Paper

Quasispecies of SARS-CoV-2 revealed by single nucleotide polymorphisms (SNPs) analysis

, , , , , , , , ORCID Icon, , , , , , , & show all
Pages 1209-1226 | Received 18 Nov 2020, Accepted 23 Mar 2021, Published online: 25 May 2021

ABSTRACT

New SARS-CoV-2 mutants have been continuously indentified with enhanced transmission ever since its outbreak in early 2020. As an RNA virus, SARS-CoV-2 has a high mutation rate due to the low fidelity of RNA polymerase. To study the single nucleotide polymorphisms (SNPs) dynamics of SARS-CoV-2, 158 SNPs with high confidence were identified by deep meta-transcriptomic sequencing, and the most common SNP type was C > T. Analyses of intra-host population diversity revealed that intra-host quasispecies’ composition varies with time during the early onset of symptoms, which implicates viral evolution during infection. Network analysis of co-occurring SNPs revealed the most abundant non-synonymous SNP 22,638 in the S glycoprotein RBD region and 28,144 in the ORF8 region. Furthermore, SARS-CoV-2 variations differ in an individual’s respiratory tissue (nose, throat, BALF, or sputum), suggesting independent compartmentalization of SARS-CoV-2 populations in patients. The positive selection analysis of the SARS-CoV-2 genome uncovered the positive selected amino acid G251V on ORF3a. Alternative allele frequency spectrum (AAFS) of all variants revealed that ORF8 could bear alternate alleles with high frequency. Overall, the results show the quasispecies’ profile of SARS-CoV-2 in the respiratory tract in the first two months after the outbreak.

Introduction

RNA viruses have exhibited high mutation rates primarily due to the low-fidelity of RNA polymerases [Citation1] and the absence of post-replication nucleotide repair mechanisms [Citation2,Citation3]. Therefore, RNA viruses always exist as populations of viral variants containing different mutations between the hosts (inter-host) or within an individual host (intra-host), referred to as quasispecies [Citation4,Citation5]. Quasispecies is believed to be a strategy of virus evolution [Citation3] and has previously been reported in the SARS-CoV [Citation6] and MERS-CoV [Citation7] viruses.

The phylogenetic analysis of the SARS-CoV-2 strains spread in different regions of the world confirmed its frequent recombination with other human coronaviruses or coronaviruses from pangolins and bats [Citation8,Citation9]. According to amino acid sequence mutations, an early survey classified the various virus strains across the world into three clusters: cluster A, cluster B (T8782C/C28,144T), and cluster C (G26,144T/G251V, ORF3a)[Citation10]. The mutation rate of SARS-CoV-2 has been estimated as ~6 × 10−4 (CI: 4 × 10−4 ~ 7 × 10−4) nucleotides/genome/year[Citation11]. So far, typically acquired SARS-CoV-2 only goes through one to two mutations per month[Citation12], which is largely unremarkable for an RNA virus [Citation11,Citation13,Citation14]. SARS-CoV-2 strains have also been widely distributed in the global phylogeny with high diversity. According to GISAID nomenclature, the mutations in SARS-CoV-2 genomes divide into different clades, such as S clade (L84S in ORF8, U28,144C in the genome), V clade (G251V in ORF3a, G26,144U), and G clade (characterized by D614G in the S protein, A23,403G)[Citation15]. Recently, a dynamic nomenclature proposal for SARS-CoV-2 lineages has been suggested. The sequences that share nucleotides at position 8782 (U) and 28,144 (C) with the closest known bat virus RaTG13 are defined as lineage A, while the sequences with 8782 (C) and 28,144 (T) were defined as lineage B, namely the lineage A/B nomenclature[Citation16]. A new variant, however, was just discovered in the United Kingdom, named B.1.1.7, which is a fast-spreading variant with increased interaction force between Spike-ACE2 caused by the viral N501Y mutation [Citation17,Citation18]. These new variants are alarming, suggesting that the virus is rapidly evolving and adapting to transmit rapidly in the population.

Although the evolutionary history and transmission dynamics of SARS-CoV-2 have been gradually clarified, more work is still needed to explore the inter-host and intra-host variations of SARS-CoV-2, as such variations may point to the direction of the evolution of the viral genome for the adaptation to the host’s immune response[Citation19], and can help in the development of antiviral drugs[Citation20], and other selective pressures, for instance, the widespread use of vaccines, X-ray radiation therapy[Citation21], and public health intervention strategies. Moreover, such variations contribute significantly to the design of effective strategies for disease control and prevention.

The immune system plays a vital role in the defense against SARS-CoV-2 infection[Citation22], since none of the drugs used to treat coronavirus disease 2019 (COVID-19) can directly clear the virus invivo[Citation23]. Therefore, the host’s immune system is also an important factor in determining the virus’s evolution. The humoral response and cellular response work together to inhibit the virus’s replication and prevent damage caused by excessive immunity to the host [Citation24–26]. Virus-specific serum antibodies (Abs) are important adaptive humoral immune responses against viral infection[Citation23]. Protection-specific Abs, including immunoglobulin G (IgG) Abs and neutralizing Abs (NAbs), are produced by B cells after infection with the virus, blocking the virus from entering the host cells and defending against viral reinfection[Citation27]. The structural proteins of SARS-CoV-2 are potential epitopes of Abs, especially the S protein. The S-specific NAbs have been detected in recovered COVID-19 patients’ serum [Citation28–31]. T-cell responses are also essential for adaptive immunity against viral infections in vivo. CD8+ CTL can kill virus-infected host cells by recognizing MHC-presented viral peptides from virus-infected cells. CTL epitopes have been identified in SARS-CoV-2 surface glycoprotein[Citation32]. Thus, it is rational to hypothesize that as the virus interacts with the host’s immune system, the virus may evolve and viral mutation sites may be the epitopes of the host’s immune system.

During infection, further exploration may be needed to identify whether a viral quasispecies infects the host or the virus evolves in-vivo to form quasispecies after invading the host. It is well known that genetic diversity in pathogen’s quasispecies is influenced by pathogen-host interaction to adapt to different hosts and tissues, which have been observed both in viruses and bacteria. For instance, in Helicobacter pylori in mouse samples, multiple single nucleotide polymorphisms were found in its virulence factor region through intra- and inter-genomic variation analysis[Citation33]. The quasispecies composition of Influenza A in the membrane glycoproteins hemagglutinin (HA) region changed during adaptation processes from Vero cells to Madin-Darby Canine Kidney (MDCK) cells[Citation34]. The dynamics of rabies virus quasispecies during serial passage have also been identified in heterologous hosts[Citation35]. As for SARS-CoV-2, it can cause infection in both the upper respiratory tract (URT) and the lower respiratory tract (LRT) and the viral kinetics of SARS-CoV-2 infection are related to infectiousness and disease progression[Citation36]. More so, the URT is specialized in eliminating inhaled pathogens to prevent viral invasion in the lower respiratory tract[Citation37]. Research has also aimed to identify the different protective mechanisms between the upper and lower respiratory tracts[Citation38]. In the URT, this is mainly mediated by specific IgA- and IgG2a-producing B cells. In contrast, ex-vivo active effector memory CTL was found in the LRT[Citation38]. Thus, it is necessary to study the genetic diversity of SARS-CoV-2 in the upper and lower respiratory tracts.

This study aimed to reveal the genetic diversity in SARS-CoV-2 quasispecies in human specimens and obtain new insights into the impact of distinct environments on virus evolution. To this end, we took advantage of meta-transcriptomic sequencing to perform a comparative analysis of genomic diversity of SARS-CoV-2, sampled at different time points during symptom onset in the upper (nasal swabs, throat swabs, and sputum) and lower (Bronchoalveolar Fluid (BALF)) respiratory tract of patients. These analyses assessed the microevolution profile of SARS-CoV-2 during infection and indicated mutations useful for viral adaptation.

Materials and methods

Ethics and patient sampling

This study was conducted according to the principles expressed in the Helsinki Declaration. Ethical approval was obtained from the Research Ethics Committee of the Shenzhen Third People’s Hospital (2020–192). Written consent was obtained from patients or their guardian when samples were collected. Patients were informed before providing written consent and data directly related to disease control were collected and de-identified for analysis.

Sampling, RNA extraction, and RT-qPCR

All 48 COVID-19 patients in this study were enrolled from the Shenzhen Third People’s Hospital from January 20–30, 2020. The severity of the disease was classified as mild or severe based on the Diagnosis and Treatment Scheme of SARS-CoV-2, released by the National Health Commission of China. Throat swabs, sputum, nasal swabs, and supernatant of bronchoalveolar lavage fluid (BALF) were collected from patients at various time points and were sent to the diagnostic laboratory. Total nucleic acid was extracted from different samples using QIAamp Viral RNA Mini Kit (Qiagen, Cat. No. 52,904). Real-time reverse transcription PCR (qRT-PCR) targeting ORF-1a/b and N of SARS-CoV-2 was performed using a China Food and Drug Administration (CFDA) approved commercial kit following the manufacture’s protocol (GeneoDX Co., Ltd., Shanghai, China).

Meta-transcriptome libraries preparation and sequencing

The positive nucleic acid extractions were treated by DNase I (NEB, Cat. No. M0303S) to remove the host DNA. The concentrations of all isolated RNA samples were measured with Qubit RNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA). Libraries were prepared using the MGIEasy RNA Library preparation Kit v2 (MGI, Cat. No. 1000005,953) as follows: (1) RNA was fragmented by incubating with fragmentation buffer at 87 °C for 6 minutes; (2) ds cDNA was synthesized using random hexamers with fragmented RNA; (3) ds cDNA was subjected to end repair, adaptor ligation, and 18-cycle PCR amplification; and (4) PCR products were unique dual indexed (UDI) before going through circularization and rolling circle replication (RCR) to generate DNA nanoball (DNB)-based libraries. Negative controls were prepared from nuclease-free water. DNB preps of clinical samples were sequenced on the ultra-high-throughput DNBSEQ-T7 platform (MGI, Shenzhen, China) with a paired-end 100 nt strategy, generating, on average, 321 Gb sequencing data for each sample.

SNP calling analysis

Raw reads were trimmed by trim galore v 0.6.4 [Citation39] with default parameters to remove low-quality bases with a score of less than 25. Trimmed reads were mapped to the reference SARS-CoV-2 genome (Accession: MN908947.3) using BWA-MEM[Citation40]. About 99.9% of reads that could not be aligned were removed. Duplicates in the bam files were removed by Samtools v1.10[Citation41]. Variants were called by HaplotypeCaller in the Genome Analysis Toolkit (GATK, v4.1.4.1)[Citation42], the vcf files for each separate sample were combined by GATK CombineGVCFs, and then genotyped by GATK GenotypeGVCFs. Single nucleotide variants (SNVs) were further filtered by GATK SelectVariants with the parameter of QD < 2.0, FS > 200.0, SOR > 10.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0. The VCF output generated by GATK were then parsed by a home-made R script. The allele frequencies of SNP were calculated using only those with the sequencing depth > 10x and altered base > 2x. SNP is filtered based on the following criteria: minor allele is sequenced at least 2 times and has a minimum frequency (alternate frequency) of 1%.

Positive selection analysis of SARS-CoV-2 genes

The nucleotide sequences of S, M, N, ORF3, ORF8, and ORF10 genes, which were obtained from sample sequencing were aligned using Mega5.0[Citation43]. Duplicate gene sequences were removed. Phylogenetic analysis of each gene was performed using an approximate maximum likelihood method implemented in FastTree 2.1 [Citation44] with the WAG+CAT model. Likelihood ratio test of positive selection was performed by comparing M7 (beta) and M8 (beta & ω) models using PAML4.9 software package [Citation45,Citation46]. Sites under positive selection were identified using the Bayes empirical Bayes (BEB) procedure. For evaluation of dN/dS ratio at the individual gene level, the SNP data in our dataset was used to construct the alternative sequence for each gene of SARS-CoV-2 using MN908947.3 as the reference. The homologous sequences were aligned using MUSCLE. The dN and dS were calculated using the Nei-Gojobori method [Citation47], with the JC substitution model implemented in DnaSP 5[Citation48].

The Alternative Allele Frequency Spectrum (AAFS) of SARS-CoV-2 genes

We selected the SNPs annotated as non-synonymous mutation in the SNP calling analysis. Then we grouped these SNPs according to their genomic locations on the different genes. The alternative allele frequencies of the SNPs were obtained from the SNP calling analysis and the cumulative distribution of the alternative allele frequencies of all SNPs for each gene through the stat_ecdf function in R package ggplot2 was calculated [Citation49] Additionally, the mean value of the alternative allele frequencies for each SNP shared in different samples was calculated in order to represent the frequency of each unique SNP and to avoid the dominant influence that the shared SNPs in some samples may have on the cumulative curve. Lastly, the cumulative distribution for these SNPs based on the average frequency among samples was calculated.

Network analysis

Networks of the samples sharing SNPs were constructed using Cytoscape[Citation50], with each node representing a sample. Two nodes were connected by an edge if they shared at least one SNP. The layout was determined using an edge-weighted spring-embedded model. There are 63 nodes and 1047 edges in the network. Among the 70 samples in which SNPs were identified, 7 samples did not share any SNPs. The samples collected from the same patient were treated as independent samples, as they were collected from different tissues of the respiratory tract. Among these commonly shared SNPs, SNPs at 22,638 were identified in 16 samples collected from 11 patients; SNPs at 28,144 were identified in 42 samples collected from 28 patients; and SNPs at 8782 were identified in 36 samples collected from 24 patients. The number of patients sharing the same SNPs is shown in ().

Figure 1. Distribution of SNPs among COVID-19 patients

A. Sequencing depth of SARS-CoV-2 for 94 samples. B. The plot of SARS-CoV-2 sequencing depth against genome coverage for the 94 samples. The y-axis represents genome coverage which is calculated as the propotion of length covered by mapped reads accounting for the whole length of the reference genome MN908947.3. C. The plot of SARS-CoV-2 sequencing depth against RT-PCR Ct value for the 94 samples from different organs of patients with different severity. The y-axis represents CT values of samples which reflect viral load (The CT value is inversely proportional to the viral load). The x-axis represents sequencing depth of every sample which is calculated as bases of all mapped reads divided by bases of reference genome MN908947.3. The sequencing depth of the virus in the sample is proportional to the viral load. D. Severe and mild patients have SNPs. E. The number of patients share the same SNPs. F. Box plot of the SNPs in mild patients and severe patients. The number of SNPs possessed by mild and severe patients is compared to each other with the Student’s t-test (t.test in R 3.6.2, two-sided, unadjusted for multiple comparisons).
Figure 1. Distribution of SNPs among COVID-19 patients

Coimmunoprecipitation (Co-IP) to detect the interaction of ORF8 and HLA-A2

The pierce classic IP kit (Thermo scientific 26,146) was used to perform Co-IP assay. In brief, HEK293T cells were lysed with IP lysis buffer (0.025 M Tris, 0.15 M NaCl, 0.001 M EDTA, 1% NP-40, 5% glycerol, 1% protease inhibitor cocktail, pH 7.4) (TransGen Biotech) for 15 minutes on ice with brief vertaxing every 5 minutes. Then pre-clear lysates were prepared using the control agarose resin. The pre-clear lysates were collected and incubated with 2–10 μg anti-FLAG antibody overnight at 4 °C with rotating to form immune complexes. Protein A/G plus agarose was added to the antibody/lysate sample and the incubated mixture was gentle shook for an hour. The immunoprecipitates were washed 4 times with ice-cold STN buffer, eluted by boiling SDS loading buffer and separated by SDS-PAGE for western blotting.

Western blotting for detection of ORF8 and HLA-A2

Cell lysates were prepared 24 hours after co-transfect (Bio-Rad). The samples were boiled for 5 minutes, separated by SDS-PAGE, and transferred to PVDF membranes. The membranes were blocked with 5% nonfat dry milk in Tris-buffered saline supplemented with 0.5% Tween 20, and proteins were detected by incubation with primary antibodies diluted in blocking buffer, followed by incubation with secondary antibodies (raised in goat against the appropriate species) conjugated to horseradish peroxidase (HRP) and diluted in blocking buffer. GST was detected using a rabbit polyclonal anti-GST antibody (catalog no. ab19256; Abcam), FLAG was detected using mouse polyclonal anti-FLAG antibodies (catalog no. HT201-01; TransGen Biotech), and GAPDH (glyceraldehyde-3-phosphate dehydrogenase) was detected using mouse monoclonal anti-GAPDH antibody 6 C5 (Calbiochem). Horseradish peroxidase (HRP) was detected using an enhanced chemiluminescence (ECL) kit (Bio-Rad).

Cell surface staining and flow cytometry analysis to detect the binding ability of site mutated S proteins to human ACE2

Spike expressor and site-mutated Spike expressor were transfected into 1 × 106 293 T cells with a weight of 3 μg, respectively. Cells were pre-incubated with 500 nM ACE2-biotyl after 24h post-transfection. ACE2 binding was detected using PE-conjugated streptavidin (Invitrogen). The percentage of ACE2 binding cells (PE+ cells) was determined by gating the living cell population based on viability dye staining (Aqua Vivid, Invitrogen). Samples were acquired on an LSRII cytometer (BD Biosciences, Mississauga, ON, Canada), and data analysis was performed using FlowJo vX.0.7 (Tree Star, Ashland, OR, USA).

Production and titration of SARS-CoV-2 S pseudoviruses

SARS-CoV-2 pseudotyped viruses were produced as previously described with some modifications [Citation51]. Briefly, 5 × 106 HEK 293 T cells were co-transfected with 6 μg each of pNL4-3. Luc. R -E- and 6 μg recombinant SARS-CoV-2 S plasmids were transfected into HEK 293 T cells using Lipofectamine 3000 Transfection Reagent (Invitrogen) according to the manufacturer’s instructions. The wild-type and mutated S protein pseudotyped viruses in supernatants were harvested 48 hours after transfection, centrifuged, filtered through a 0.45 μm filter, and stored at −80 °C. The pMD2.G was co-transfected with the pNL4-3. Luc. R-E- plasmid to package the VSV-G pseudovirus as the control. The titers of the pseudoviruses were calculated by determining the concentration of cytosolic gag p24 by ELISA.

SARS-CoV-2 S-mediated pseudo viral entry assay

To detect S variant-mediated viral entry, 293 T-ACE2 cells (2 × 104) grown on 96-well plates were infected with the same amount of wild-type S or mutated S pseudovirus. The cells were transferred to fresh DMEM medium 12 hours post-infection. After 48–72 hours post-infection, the 293T-ACE2 cells were lysed with 30 μL lysis buffer (Promega, Madison, WI, USA) to measure the pseudo viral transduction. Relative luminescence units (RLU) of Luc activity was detected using the Luciferase Assay Kit (Promega). All experiments were performed at least three times and expressed as mean ± standard deviation (SD).

Statistical analysis

Differences in the frequencies of SNPs’ alleles between mild and severe groups were compared using the Student’s t-test (t.test in R 3.6.2, two-sided, unadjusted for multiple comparisons). Effect size (Cohen’s d) of the comparisons was calculated with R lsr v.0.5.

Results

Distribution of single nucleotide variations (SNVs) among COVID-19 patients

Meta-transcriptomic sequencing was performed on 94 clinical specimens (55 nasal swabs,11 throat swabs, 21 sputa, and 7 bronchoalveolar lavage fluid samples) of 48 hospitalized patients at the Shenzhen Third People’s Hospital in late January 2020. Among these patients, 23 were severe cases and 25 were mild cases (Table S2). After mapping reads and genome assembly, we constructed 43 completely assembled SARS-CoV-2 genomes and 43 partial genomes. SARS-CoV-2 sequences represented 0.01% to 2.00% of all quality-filtered reads and 0.08% to 8.82% of the non-rRNA reads. 20x sequencing depth can yield more than 50% genome coverage (). These samples have an average sequencing depth of 63.4 on the virus genome (). For each sample, the virus’s average sequencing depth showed a strong correlation with the Ct values of the RT-qPCR performed on extracted RNA ().

Single nucleotide variants (SNVs) were identified in the SARS-CoV-2 population. To reduce false positives, SNP was filtered based on the following criteria: minor allele was sequenced at least 5 times and had a minimum frequency of 1%. After filtering out low-quality sequencing data, 180 intra-host SNVs of high confidence were identified, including 158 SNPs, 21 deletions, and 1 insertion. The 158 SNPs were identified with high confidence in 41 COVID-19 patients (Table S3). Among the 158 SNPs, 72 are synonymous and 86 nonsynonymous (Table S3). Among these 48 COVID-19 confirmed patients, 7 patients had no identified SNPs, 15 patients have less than 5 SNPs, 9 patients had more than 10 SNPs, and 17 patients has between 5 and 10 SNPs (). Among all identified intra-host SNPs, C8782T and T28,144C were the most commonly shared SNPs, found in more than 24 patients (). The number of SNPs did not differ between mild and severe patients (p-value = 0.2227) ( and ), neither did the alternate frequency of synonymous SNPs (Student’s t-test p-value = 0.9013, (). Nevertheless, the alternate allele frequency of nonsynonymous SNPs in mild patients was notably higher than in severe patients (Student’s t-test p-value = 0.02538, (). We speculate that some non-synonymous mutations probably cause reduced symptoms because competition between virus variants at the population level might create population mosaics of disease characteristics, such as infection fatality rates, transmission, and immune status[Citation52]. With such a small sample size, the results we observe can only reflect one of these symptoms.

Figure 2. Comparison of SNPs’ alternate frequencies between mild and severe patients

A. Box plot of synonymous SNPs’ alternate frequencies in mild patients and severe patients. B. Box plot of non-synonymous SNPs’ alternate frequencies in mild patients and severe patients. The alternate frequencies of syn and non-syn SNPs possessed by mild and severe patients are compared to each other with the Student’s t-test (t.test in R 3.6.2, two-sided, unadjusted for multiple comparisons).
Figure 2. Comparison of SNPs’ alternate frequencies between mild and severe patients

Genomic distribution of SNPs reveals that ORF8 can bear alternate alleles with high frequency

We compared the SNPs in our datasets with SNPs from assembled genomes deposited in the GISAID online datasets [Citation53–55] as of April 11, 2020 and found differences in the mutation spectrum (). 5ʹ-UTR is the functional region most significantly enriched with SNPs at an average of 449.06 SNPs per KB compared to an average of 104.43 SNPs per KB across the genome of SARS-CoV-2 based on GISAID SNP dataset (Odd Ratio, OR = 4.30, p-value = 1.89034E-10, (). 3ʹ-UTR had an average of 543.859 GISAID SNPs per KB (Odd Ratio, OR = 5.2078, p-value = 7.15E-11, () and no significant accumulation of SNPs at the 5ʹ-UTR or 3ʹ-UTR was observed in 94 clinical samples ().

Figure 3. Genomic distribution of SARS-CoV-2 SNPs

A. Location of SNPs detected in this study and GISAID SNPs in the SARS-CoV-2 genome. The y-axis represents the proportion of the sample that contains a certain type of SNP. Positions with coverage < 10x were excluded from the analysis. The shared SNPs of online datasets were analyzed based on 6032 SARS-CoV-2 genome sequences available online as of 11 April 2020 (2019nCoVR, https://bigd.big.ac.cn/ncov). B. The percentage of SNPs/kb in ORF regions and non-coding regions for all patients in this study. The value of the y-axis was calculated as the proportion of SNP of a viral gene accounting for the total SNP of the whole viral genome, then normalized by gene length. The calculation formula is “Percentage of SNPs per kb = SNPs of a gene/(Total SNPs of the virus genome×gene length)”. C. The cumulative distribution of the alternative allele frequency for the non-synonymous SNPs of different genes among all samples. The y-axis represents the cumulative frequency of occurrence for non-synonymous SNPs from a specific gene, which means that the total frequency of all non-synonymous SNPs in a specific gene is set to 1. According to the value of the alternate frequency of these SNPs, the frequency of occurrence for SNPs with the same alternate frequency are accumulated in ascending order. The x-axis represents the alternative allele frequency of non-synonymous SNPs. D. The cumulative distribution of the average alternative allele frequency for each non-synonymous SNP of a specific gene among all samples. To avoid the dominance of certain shared SNPs in the samples, the alternative allele frequency shown here is the mean value of the alternative allele frequencies of the same SNPs among all samples. The y-axis represents the cumulative frequency of occurrence for non-synonymous SNPs from a specific gene. The x-axis represents the alternative allele frequency of SNPs.
Figure 3. Genomic distribution of SARS-CoV-2 SNPs

For virus genomes under positive selection, the ratio of nonsynonymous substitution to synonymous substitution (dN/dS) was greater than 1.0 [Citation3,Citation56]. To test whether the genes of SARS-CoV-2 underwent positive selection during host–virus interaction, we performed positive selection analysis in each gene of the SARS-CoV-2 base on the SNP data in each sample to construct the alternate sequence for each gene using MN908947.3 as the reference. Due to the small sample size, few SNPs in ORF6, ORF7a, E, and a series of NSP genes were identified, so the analysis is not reliable. None of the other genes (M, N, S, ORF3, ORF8, or ORF10) were subjected to significant positive selection according to the likelihood ratio test of positive selection performed by comparing M7 (beta) and M8 (beta&ω) models using PAML4.9 (). We detected amino acid residues under positive selection for these genes. The G251V of ORF3a used to determine virus V clade[Citation15], was identified as a positively selected site by using the Bayes empirical Bayes (BEB) procedure (). Furthermore, alleles under positive selection can reach a higher allele frequency in the population than alleles selected against. Thus, a shift of the Alternative Allele Frequency Spectrum (AAFS) of all variants in a particular gene toward higher frequency indicates a positive selective pressure on the gene as a whole. To seek genetic signatures of adaptation of SARS-CoV-2 to the host’s intracellular environment, we examined the AAFS for all SARS-CoV-2 genes ( and ). All the AAFS for SNPs of the ORF8 gene showed a drastic shift toward higher frequency. The function of the ORF8 gene was largely unclear until recently. An unpublished study showed that the ORF8 protein expressed by SARS-CoV-2 could disrupt the antigen presentation of virus-infected cells by binding directly to the MHC-I molecules and induce the degradation of MHC-I by lysosomes[Citation57]. Here, Co-IP proves the direct intracellular interaction between ORF8 and HLA-A2 (). With the increase of ORF8 expression, the intracellular concentration of HLA-A2 decreased (). It is tempting to speculate that the high variation and positive selection in the viral ORF8 gene may be linked to the extraordinary diversity of human MHC-I genes. ORF8 represents the rapid adaptation of the SARS-CoV-2 genome to the host immune environment.

Table 1. Positive selection analysis in the genes of SARS-CoV-2 based on the dataset in this study

Figure 4. The direct intracellular interaction between ORF8 and MHC I molecular, and ORF8 degrades MHC I in a dose-dependent manner

A. ORF8 was co-immunoprecipitated with the overexpressed HLA-A2. HEK293T Cells were transfected with HLA-A2-FLAG expressing plasmid together with ORF8-GST expressing plasmid or vector for 24 hours before harvest. Cells were collected for co-IP with the anti-FLAG-tag beads. An anti-GPDAH antibody was used as a loading control. A representative result for three independent experiments displayed. B. Dose-dependent decreasing of HLA-A by ORF8. HEK293T cell was co-transfected with 0.5 μg pCMV3-HLA-A-Flag plasmid and pCMV-GST-ORF8 of 0, 0.25, 0.5, 1.0, 1.5, and 2.0 μg for 24 hours before harvest. Then equal cell lysis was used for western blotting analysis to detect the expression level of HLA-A. An anti-GPDAH antibody was used as a loading control. A representative result for three independent experiments displayed.
Figure 4. The direct intracellular interaction between ORF8 and MHC I molecular, and ORF8 degrades MHC I in a dose-dependent manner

The allele composition and frequency of each SNP site was also analyzed. Among 12 single nucleotide polymorphisms types, A > G, G > A, T > C, and G > T were most common. The ratio of C > T was 40% in our SNPs datasets, which is lower than 50% C > T for GISAID SNPs datasets but similar to 40% C > T iSNV for Houston samples in a recent survey [Citation58] (Figure S1). In humans and many other species, the rate of C > T substitutions is higher than that of other types of substitutions, as methylated Cytosine (C) can be replaced by thymine (T) during DNA replication if the amino group is removed from the methylated cytosine. However, unlike in humans, where the majority of the C > T substitutions occur on the canonical methylated CpG sites, C > T substitutions occur more on CpA, CpT, and CpC sites than on CpG sites in SARS-CoV-2, suggesting a potential distinct mechanism underlying the methylation of the virus genome[Citation59].

Dynamic of the quasispecies component in SARS-CoV-2 during the period of symptom onset

Analyzing the allele frequency of viral intra-host SNPs at different sampling sites and time points (mainly sputum, nasal swabs) from an individual might help us observe the spatial and temporal dynamics of intra-host variations of SARS-CoV-2 during infection (). However, our sampling time point is too close. Only 9 (22%) patients were detected with varying intra-host SNP frequency over time among all patients ( and Table S3). For instance, the alternate frequency of SNPs (G25,540A, C2536T, and C12,036T) in the nasal swabs of Patient 32 increased quickly from the second day to the fourth day after symptom onset. G25,540A substitution leads to V50I mutation in ORF3a, C12,036T substitution leads to A3924V mutation in ORF1ab and Nsp3, and the substituted bases eventually took over (). From the fourth day to the seventh day after symptom onset, the alternate frequency of C24,034T and T28,144C substitutions in Patient 18 was significantly increased ( and ). C24,034T mutation leads to synonymous mutation at S protein, T28,144C leads to L84S mutation in ORF8, and 84S eventually dominated the upper respiratory tract of Patient 18 (). Moreover, the alternate frequency of C2334T(A690V, ORF1ab, and Nsp2) and A9162G (N2966S, orf1ab, and Nsp4) of Patient 12 decreased from day 5 to day 6 after symptom onset ( and ). These data suggest the occurrence of purifying selection of C2334T (A690V, ORF1ab, and Nsp2)/A9162G (N2966S, ORF1ab, and Nsp4) in Patient 12, and positive selection on C12,036T (A3924V, ORF1ab, and Nsp3)/G25,540A (V50I, ORF3a) in Patient 32 and T28,144C (L84S, ORF8) in Patient 18.

Figure 5. Dynamic of quasispecies’ composition identified in three COVID-19 patients

A. Symptoms and results of RT-qPCR testing for SARS-CoV-2 of the three COVID-19 patients (P12, P18, and P32) by day of the investigation. Gradient shading indicates an unclear period of symptom onset from the patient report. RT-qPCR means Real-time quantitative PCR, inc means inconclusive result. SARS-CoV-2, severe acute respiratory syndrome coronavirus 2. The sampling dates are denoted by red arrows. B, C. Allele frequency of C23,34T(B), A9162G(C) substitutions in nasal swabs collected from Patient 12 (P12) by time course. D, E. Allele frequency of C24,034T(D), T28,144C (E) substitutions in sputum and nasal swabs collected at time points from Patient 18 (P18). F. Allele frequency of G24,430A, C2536T, C12,036T substitutions in different types of samples (sputum and nasal swab) collected from Patient 32 (P32). Allele frequency of G24,430A, C2536T, and C12,036T substitutions in nasal swabs collected from Patient 32 at different time points.The allele frequency was analyzed using Wuhan-hu-1(MN908947.3) as a reference. The alternate allele depth/total depth for each sample is shown up the bar. G. The percentage of patients with dynamic quasispecies diversity.
Figure 5. Dynamic of quasispecies’ composition identified in three COVID-19 patients

To study the correlation of mutation with the disease severity, we checked the patients’ clinical records and found the sampling date almost followed the onset date. As shown in , samples were collected 5 days, 4 days, and 2 days after symptom onset date for Patients12, 18, and 32, respectively, suggesting that the virus begins to evolve early during the onset of symptoms. We hypothesized that the virus evolved rapidly in the acute phase of infection due to the human body’s intense immune response. It is important to note that although Patients 12 and 18 were classified as mild, the RT-qPCR results for SARS-CoV-2 remained positive for a long time. For example, Patient 18 tested positive even after two months of home isolation after discharge. We hypothesize that mutations in the virus resulted in its persistence within Patients 12 and 18 (Table S2 and S3).

Co-occurrence of SNPs was associated with the severity of disease

To explore the dominant SNPs in the study population, shared SNPs of the virus were further analyzed. A SNP network connecting patients with shared SNPs was constructed. In this network, nearly all patients are connected by at least one shared SNP and the most abundant five SNPs formed four major clusters (). SNPs at 8782(C-T) and 28,144(T-C) coexisted in the patients, forming the largest cluster. The clustering heatmap of allele frequency of different samples collected from mild and severe patients also revealed that SNPs at 8782 and 28,144 frequently occur together and are the most common variants among samples (). Interestingly, SNPs at sites 22,246(T-G), 22,638(G-A), 28,077(G-C), 24,034(C-T), 13,394(A-G), and 26,729(T-C) with high alternating frequencies can be detected simultaneously in a sample (). In this cluster, 7 (70%) patients came from the severe group and 3 (30%) patients were in the mild group. This cluster of patients did not belong to the same family or community, nor did they have direct contact with each other, suggesting that the variations at these loci were randomly distributed and dominant in the population. Among the most abundant SNPs, SNPs at sites 22,246, 22,638, and 24,034 were all distributed in the S glycoprotein region and Site 28,144 was located at ORF8. The T-to-G substitution at site 22,246 leads to Asp228Glu amino-acid substitution of S protein; the G-to-A nucleotide substitution at site 22,638 causes a Ser359Asn amino acid substitution of S protein; and the C-to-T substitution at 28,144 leads to the Leu84Ser amino-acid substitution of ORF8. In the early phases of the epidemic, the L84S mutation in ORF8 resulted in two subgroups of the virus[Citation60].

Figure 6. Co-occurring SNPs were found in S glycoprotein among COVID-19 patients

A. Network analysis of patients with shared SNP sites. Nodes denote patients, and edges indicate that two patients have the same SNP sites. Clusters of the network are highlighted by colored shading, and the shared SNP sites leading to the clusters are shown. B. Box plot of the alternate frequency of T22,246G SNP in mild and severe patients. The location of SNP in the genome was shown. S, S glycoprotein. Nsyn, Non-synonymous. The reference was MN908947.3. C. The alternate frequency of G22,638A SNP in mild and severe patients. S, S glycoprotein. Nsyn, Non-synonymous. D. The alternate frequency of C24,034T SNP in mild and severe patients. S, S glycoprotein. Syn, synonymous. E. The alternate frequency of T28, 144C SNP in mild and severe patients. ORF8, T28,144C SNP located in ORF8 region of SARS-CoV-2 genome. Boxes represent the interquartile range (IQR) between the first and third quartiles (25th and 75th percentiles, respectively). Lines inside denote the median. T-test was used to compare alternate frequencies of SNPs between the mild and severe groups. F. The cluster heatmap of SNP in different samples collected from mild and severe patients by time point. Sampling date, organ, and population group are shown by a different color in right-side boxes; the samples with shared SNPs and relative patients were highlighted. The x-axis represents the identified SARS-CoV-2 SNPs in this study, which clustered according to its alternate frequency. Alternate frequencies of SNPs are showed with a gradient color marker. White represents the SNP alternate frequency value of 0; the genomic regions failed to pass our criteria used in identifying SNPs. Dark red indicates an alternate frequency approaching 1. The y-axis represents samples clustered by different mutation types. P means patient. BALF, bronchoalveolar lavage fluid.
Figure 6. Co-occurring SNPs were found in S glycoprotein among COVID-19 patients

Subsequently, whether the allele frequency of these dominant SNPs was associated with disease severity was analyzed. The alternate frequency of SNP at sites 22,246, 22,638, 24,034, and 28,144 was analyzed in patients with different severity. The alternate frequency distribution of SNP at sites 22,638 was statistically lower in severe patients than in mild patients (p = 0.042, cohensD = 0.9045, (). No differences were found for the alternate frequency of SNPs at sites 22,246, 24,034, and 28,144 between the mild and severe groups (p = 0.1358, 0.95, and 0.7 for sites 22,246, 24,034, and 28,144, respectively, (, , and ). The S359N (G22,638A) mutation on the S protein seems to be more likely to appear in mild patients based on our small sample. However, we need more samples and experiments to determine whether the mutation of S359N will affect the fitness of the virus in the host.

Site 359 of S protein does not affect its binding to ACE2 but is a candidate epitope for CTL response

To investigate the effect of intra-host genomic variations of SARS-CoV-2 on its biological function and pathogenicity, the function of SNPs with high alternate frequency was further analyzed. According to the high-resolution cryo-EM structure information of SARS-CoV-2 S glycoprotein (PDB: 6LZG), the receptor-binding domain (RBD) interacts with the human ACE2 receptor to mediate the binding of the virus to the target cell. The receptor-binding motif (RBM) in RBD forms the interface between S glycoprotein and human ACE2 [Citation61,Citation62]. The Ser359Asn substitution was located in the RBD domain but outside the RBM region (), implicating amino acid 359 does not directly interact with ACE2[Citation63]. As the region outside the RBM also plays an important role in maintaining the RBD’s structural stability[Citation64], whether the S359N mutation altered the virulence of the virus and its fitness in the host was explored. A series of s glycoprotein site mutations expression vectors were constructed, then transfect 293 T cell () and ACE2 binding ability to mutated S proteins were detected by cell surface staining and flow cytometry analysis. Compared to the F486A and N487A in RBM motif-2 and T500A in RBM motif-1, the S359N mutation led to a slight decrease of ACE2-binding (), but this slight change was not enough to affect the efficiency of the pseudovirus infection () and is not an important site for maintaining ACE2 binding. Recently, a study reported that the top eight neutralizing antibodies maintained their potency against S359N variants[Citation65], so the S359N variant is not an escape mutant from neutralizing antibodies. As in the host, the virus will face cellular immune responses as well as humoral immune responses. Thus, different MHC binding and T cell responses should be considered as triggers for viral mutations. The peptides of S protein for binding to most frequent MHC class I (A&B) alleles in Europe, Asia, and Africa were analyzed (Table S5). The peptides with S359 inside can bind to the HLA-A30 allele (consensus_percentile_ranks<1) and S359 is an HLA-A30 restricted epitope.

Figure 7. Functional and structural insights into S359N variant of SARS-CoV-2 S glycoprotein

A. Overall structure of SARS-CoV-2 RBD bound to human ACE2. The interface of RBD and ACEs are displayed as RBM motif-1 and RBM motif-2. The amino acid S359 and amino acids in the receptor-binding motif are shown as sticks. B. Detection of the expression of site-mutated S glycoprotein in 293 T cell line. C. Cell staining and flow cytometry analysis of site-mutated S glycoprotein binding to ACE2. The experiment was repeated three times, and similar results were obtained. D. Detection site-mutation derived SARS-CoV-2 pseudovirus infection. The data are expressed as mean relative luciferase units (RLU) ± standard deviation (SD) of 4 parallel wells in 96-well culture plates. The relative luminescence units (RLU) detected 72 hours post-infection (hpi). The experiment was repeated three times and similar results were obtained.
Figure 7. Functional and structural insights into S359N variant of SARS-CoV-2 S glycoprotein

L84S mutation is an candidate for HLA-A02 restricted CTL epitope

ORF8 directly interacts with human MHC I molecular. The high frequency of SNP28,144 (L84S) was widespread and detected in most samples in this study, suggesting that this site may be selected ( and , Table S3). It is also predicted that peptides of ORF8 bind to the MHC I molecule, based on the IEDB analysis resource using a consensus method[Citation66]. The wild-type peptides containing site L84 can bind to HLA-A*02:06, common in Asian populations. HLA-A*02:06 accommodates the wild-type, but not the SARS-CoV-2 ORF8 (L84S) variant (). To show the accuracy of these predictions, the literature was reviewed for HA peptides of H5N1 influenza viruses that bind to HLA-A*02:01, which have been experimentally verified as HLA-A*02:01-restricted peptides for H5 HA peptides[Citation67]. The ORF8 L84S variants changed peptide binding to MHC class I alleles, meaning the site 84 of ORF8 may be a potentially cytotoxic T-lymphocyte epitope.

Table 2. ORF8 L84S variants and differential peptide binding to the HLA-A allele

Discussion

RNA viruses replicate in-vivo as a quasispecies, a dynamic distribution of divergent but closely related genomes subjected to a continuous process of genetic variation, competition, and selection[Citation3]. This genomic heterogeneity confers a remarkable advantage to the viral population allowing for a rapid adaptation to a changing environment. Although SNVs have been identified in the SARS-CoV-2 genome and some of them contribute to viral pathogenicity[Citation68], more viral genomic data from clinical specimens needs to be analyzed in order to study intra-host SNV dynamics, which may directly reflect virus-host interaction.

This study addresses this gap by using metatranscriptomic sequencing to find SNVs from 94 sequenced clinical samples of 48 COVID-19 patients with varying disease severity. One hundred fifty-eight SNPs were identified with high confidence in 70 samples of 41 COVID-19 patients after removing the low-quality reads and samples with low coverage and less than 10-fold sequencing depth. The positive selection analysis of genes in the SARS-CoV-2 genome based on our datasets did not result in any genes being subjected to positive selection. Our result is different from an earlier study, which reported that ORF3a and ORF8 were under positive selection and exhibited higher dN/dS ratios than other genes at the level of the individual gene [Citation69], but this may be due to the small sample size of this study. As no synonymous SNP in ORF3 were detected, it is difficult to conclude a more widely representative dN/dS ratio at the level of individual genes. According to our dataset, ORF3a has a positively selected amino acid G251V, consistent with the previous literature[Citation69]. Additionally, in this study, ORF8 seemed to be more tolerant of some high-frequency alternate allele. It is worth noting that 84(L > S) is a high-frequency non-synonymous mutation found on ORF8 in our data set, which is contrary to an earlier survey that found L84 lineage is more prevalent than S84 lineage[Citation60]. The function of SARS-CoV-2 ORF8 has not yet been clarified, though the SARS-CoV-2 382-nt deletion variant with truncated ORF8 was detected in Singapore and Taiwan [Citation70,Citation71]. The SARS-CoV-2 382-nt deletion viruses showed significantly higher replicative fitness in vitro than the wild type[Citation70]. In one recent report, ORF8 was determined to be a protein secreted by infected cells into serum and is highly immunogenic in COVID-19 patients [Citation72]. Thus, we predict ORF8 peptides bind to the MHC class I allele. Some specific MHC class I alleles accommodate L84 peptides but not S84 peptides (). Under host selective pressure, ORF8 deletions or site mutations may occur to evade antibody neutralization and cellular immunity, allowing SARS-CoV-2 to escape clearance from the host’s immune system.

The correlation between viral quasispecies evolution and pathogenicity has been confirmed in prior studies [Citation68]. Our analysis of shared SNPs among patients with different severity suggest that mild patients had higher alternate frequency SNPs at 22,638 (S359N). Because of the small sample size of this study, it is difficult to conclude if the S359N mutation caused reduced symptoms in patients. However, S359N, a common mutation in the population was determined. The reason for the emergence of this S359N variant is not clear. In a recent survey, eight top NAbs maintained their potency against the S359N variant [Citation65], suggesting that S359 might not be the target of humoral immunity. The TepiTool was used to predict T cell epitope candidates for S protein of SARS-CoV-2, which revealed peptide (RISNCVADY) with site S359 is an HLA-A*30 restricted T cell epitope (Table S5). Whether these circulating SARS-CoV-2 variants are associated with T-cell responses or neutralizaing antibodies still needs to be clarified. By comparing SNPs’ allele frequencies in samples collected from Patients 18 and 32 at similar time points, we found a novel intra-host viral profile of SARS-CoV-2. As seen in Patient 18, the mutation S84 (28,144 C) on ORF8 eventually replaced L84 (28,144 T) as the dominant variant in quasispecies in the patient’s sputum. This is reminiscent of rapidly changing viral lineages during acute HIV-1 infection, namely the rapid replacement of the major transmitted/founder lineage by a minor transmitted/founder lineage[Citation73]. This phenomenon may be due to the interplay between viral and host factors. The type I IFNs and amount of antibodies in serum changed dramatically at symptom onset [Citation73], which will exert selective pressure on the virus genome.

Studies on the quasispecies diversity of other RNA viruses have found more abundant intra-host genetic variations. For example, metatranscriptomic sequencing results of HIV-1 in the blood and female genital tract have identified 77 iSNVs in an individual [Citation74]. Research on intra-host dynamics of the Ebola virus during 2014 identified 710 iSNVs in 135 EBOV samples[Citation56]. Fewer SNVs were identified in SARS-CoV-2 positive samples from this study, which could in part be attributable to lower sequencing depth (due to the low viral load in the sample), but also likely reflects some true loss of diversity. As a β-coronavirus, SARS-CoV-2 encodes an RNA dependent RNA polymerases (RdRPs) with a high-fidelity nucleotide incorporation ability[Citation75]. Its CoV nonstructural protein14 (nsp14) also encodes 3ʹ-to-5ʹ exoribonuclease activity (ExoN), which performs a proofreading function and is required for high-fidelity replication[Citation75]. This would provide some evidence for the low diversity of the intra-host population for SARS-CoV-2 to some extent.

In conclusion, frequent sampling and metatranscriptomic sequencing was utilized to study SARS-CoV-2 populations present in upper and lower respiratory tracts of patients with different severity. Rapid and dramatic changes in quasispecies diversity were observed, providing new insights into the intra-host evolution of SARS-CoV-2. However, our current research has great limitations due to the small sample size. More samples are needed to investigate the intra-host genomic diversity of SARS-CoV-2. Furthermore, more experiments should be carried out to explore the host immune system’s influence on virus evolution.

Notes on contributions

Z.Z., F.L., and X.L. contributed to the study design. R.G., and X.L. contributed to the manuscript preparation. R.G., W.Z., Y.L., Y.W., Y.G., Z.L., X.L., and Y.Z. contributed to the data analysis. R.G., S.Z., L.C. performed the library construction and all experiments. H.W., J.Y., and W. L. performed the quantitative RT-PCR. All authors approved the manuscript.

Supplemental material

Supplemental Material

Download Zip (286.1 KB)

Acknowledgments

The authors thank Dr. Yuxia Gao from the China-USA Citrus Huanglongbing Joint Laboratory, National Navel Orange Engineering Research Center, Gannan Normal University for her calculation of dN/dS. We acknowledge the National supercomputing center in Shenzhen for providing free use of the High Performance Computing Platform. We thank the China National GeneBank at Shenzhen for providing sequencing service. We appreciate all the authors who have deposited and shared genome data on GISAID.

Disclosure statement

The authors declare that they have no conflict of interests.

Supplementary material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work was supported by the Central Charity Fund of Chinese Academy of Medical Science [2020-PT310-009];Guangdong Basic and Applied Basic Research Foundation [2019A1515110643]; National Science Fund for Distinguished Young Scholars [82025022]; Shenzhen Science and Technology Innovation Committee [2020A1111350032, JCYJ20190809112205541]; the emergency grants for prevention and control of SARS-CoV-2 of Guangdong province [2020B111108001, 2020B1111340036, 2020B1111340029]; the China Postdoctoral Science Foundation [2019M660837].

References

  • Kunkel TA. Exonucleolytic proofreading. Cell. 1988;53(6):837–840.
  • Vaughan G, Goncalves Rossi LM, Forbi JC, et al. Hepatitis A virus: host interactions, molecular epidemiology and evolution. Infect Genet Evol. 2014;21:227–243.
  • Domingo E, Sheldon J, Perales C. Viral quasispecies evolution. Microbiol Mol Biol Rev. 2012;76:159–216.
  • Nowak MA. What is a quasispecies? Trends Ecol Evol. 1992;7(4):118–121.
  • Barik S, Das S, Vikalo H. QSdpR: viral quasispecies reconstruction via correlation clustering. Genomics. 2018;110(6):375–381.
  • Xu D, Zhang Z, Wang FS. SARS-associated coronavirus quasispecies in individual patients. N Engl J Med. 2004;350(13):1366–1367.
  • Park D, Huh HJ, Kim YJ, et al. Analysis of intrapatient heterogeneity uncovers the microevolution of middle east respiratory syndrome coronavirus. Cold Spring Harb Mol Case Stud. 2016;2(6):a001214.
  • Zhou P, Yang X, Wang X, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–273.
  • Li X, Giorgi EE, Marichannegowda MH, et al. Emergence of SARS-CoV-2 through recombination and strong purifying selection. Sci Adv. 2020;6(27):eabb9153.
  • Forster P, Forster L, Renfrew C, et al. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A. 2020;117(17):9241–9243.
  • Van Dorp L, Acman M, Richard D, et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect Genet Evol. 2020;83:104351.
  • Rambaut A, Loman N, Pybus O, et al. Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. Virological org. 2020.
  • Domingo-Calap P, Schubert B, Joly M, et al. An unusually high substitution rate in transplant-associated BK polyomavirus in vivo is further concentrated in HLA-C-bound viral peptides. PLoS Pathog. 2018;14(10):e1007368.
  • Holmes EC, Dudas G, Rambaut A, et al. The evolution of Ebola virus: insights from the 2013–2016 epidemic. Nature. 2016;538(7624):193–200.
  • Gámbaro F, Behillil S, Baidaliuk A, et al. Introductions and early spread of SARS-CoV-2 in France. Euro Surveill. 2020;25(26):2001200.
  • Rambaut A, Holmes EC, Á O, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020;5(11):1403–1407.
  • Kupferschmidt K. Fast-spreading U.K. virus variant raises alarms. Science. 2021;371(6524):9–10.
  • Santos JC, Passos GA. The high infectivity of SARS-CoV-2 B.1.1.7 is associated with increased interaction force between Spike-ACE2 caused by the viral N501Y mutation. bioRxiv. 2020. doi: 10.1101/2020.12.29.424708
  • Gaudieri S, Rauch A, Park LP, et al. Evidence of viral adaptation to HLA class I-restricted immune pressure in chronic hepatitis C virus infection. J Virol. 2006;80(22):11094–11104.
  • Cuypers L, Li G, Libin P, et al. Selective pressure in hepatitis C virus genotypes 1–6: significance for direct-acting antiviral treatment and drug resistance. Viruses. 2015;7(9):5018–5039.
  • Ghadimi-Moghadam A, Haghani M, Bevelacqua JJ, et al. COVID-19 tragic pandemic: concerns over unintentional “directed accelerated evolution” of novel coronavirus (SARS-CoV-2) and introducing a modified treatment method for ARDS. J Biomed Phys Eng. 2020;10(2):241–246.
  • Azkur AK, Akdis M, Azkur D, et al. Immune response to SARS-CoV-2 and mechanisms of immunopathological changes in COVID-19. Allergy. 2020;75(7):1564–1581.
  • Xu X, Gao X. Immunological responses against SARS-coronavirus infection in humans. Cell Mol Immunol. 2004;1(2):119–122.
  • Hodgins B, Yam KK, Winter K, et al. A single intramuscular dose of a plant-made virus-like particle vaccine elicits a balanced humoral and cellular response and protects young and aged mice from influenza H1N1 virus challenge despite a modest/absent humoral response. Clin Vaccine Immunol. 2017;24(12):e00273–17.
  • Da Costa XJ, Brockman MA, Alicot E, et al. Humoral response to herpes simplex virus is complement-dependent. Proc Natl Acad Sci USA, 1999; 96(22):12708–12712.
  • Veron P, Leborgne C, Monteilhet V, et al. Humoral and cellular capsid-specific immune responses to adeno-associated virus type 1 in randomized healthy donors. J Immunol. 2012;188(12):6418–6424.
  • Rogers TF, Zhao F, Huang D, et al. Isolation of potent SARS-CoV-2 neutralizing antibodies and protection from disease in a small animal model. Science. 2020;369(6506):956–963.
  • Long QX, Liu BZ, Deng HJ, et al. Antibody responses to SARS-CoV-2 in patients with COVID-19. Nat Med. 2020;26(6):845–848.
  • Amanat F, Stadlbauer D, Strohmeier S, et al. A serological assay to detect SARS-CoV-2 seroconversion in humans. Nat Med. 2020;26(7):1033–1036.
  • Wu F, Wang A, Liu M, et al. Neutralizing antibody responses to SARS-CoV-2 in a COVID-19 recovered patient cohort and their implications. medRxiv. 2020; doi: 10.1101/2020.03.30.20047365.
  • Ni L, Ye F, Cheng M-L, et al. Detection of SARS-CoV-2-Specific humoral and cellular immunity in COVID-19 convalescent individuals. Immunity. 2020;52(6):971–977.
  • Baruah V, Bose S. Immunoinformatics-aided identification of T cell and B cell epitopes in the surface glycoprotein of 2019-nCoV. J Med Virol. 2020;92(5):495–500.
  • Draper JL, Hansen LM, Bernick DL, et al. Fallacy of the unique genome: sequence diversity within single helicobacter pylori strains. mBio. 2017;8(1):e02321–16.
  • Roedig JV, Rapp E, Höper D, et al. Impact of Host Cell Line Adaptation on Quasispecies Composition and Glycosylation of Influenza A Virus Hemagglutinin. Plos One. 2011;6(12):e27989.
  • Kissi B, Badrane H, Audry L, et al. Dynamics of rabies virus quasispecies during serial passages in heterologous hosts. J Gen Virol. 1999;80(8):2041–2050.
  • Ke R, Zitzmann C, Ribeiro RM, et al. Kinetics of SARS-CoV-2 infection in the human upper and lower respiratory tracts and their relationship with infectiousness. medRxiv. 2020. 2020.09.25.20201772. http://doi.org/10.1101/2020.09.25.20201772
  • Branchett WJ, Lloyd CM. Regulatory cytokine function in the respiratory tract. Mucosal Immunol. 2019;12(3):589–600.
  • Zuercher AW, Jiang H-Q, Thurnheer MC, et al. Distinct mechanisms for cross-protection of the upper versus lower respiratory tract through intestinal priming. J Immunol. 2002;169(7):3920–3925.
  • Kechin A, Boyarskikh U, Kel A, et al. cutPrimers: a new tool for accurate cutting of primers from reads of targeted next generation sequencing. J Comput Biol. 2017;24(11):1138–1143.
  • Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–1760.
  • Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079.
  • McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–1303.
  • Tamura K, Peterson D, Peterson N, et al. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 2011;28(10):2731–2739.
  • Price MN, Dehal PS, Arkin AP. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol. 2009;26(7):1641–1650.
  • Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–1591.
  • Yang Z, Nielsen R, Goldman N, et al. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155(1):431–449.
  • Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3(5):418–426.
  • Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009;25(11):1451–1452.
  • Kassambara A. ggplot2: guide to create beautiful graphics in R. Create Space Independent Publishing Platform; 2016; 1:236.
  • Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–2504.
  • Crawford KH, Eguia R, Dingens AS, et al. Protocol and reagents for pseudotyping lentiviral particles with SARS-CoV-2 spike protein for neutralization assays. Viruses. 2020;12(5):513.
  • Danchin A, Timmis K. SARS-CoV-2 variants: relevance for symptom granularity, epidemiology, immunity (herd, vaccines), virus origin and containment? Environ Microbiol. 2020;22(6):2001–2006.
  • Zhao WM, Song SH, Chen ML, et al. The 2019 novel coronavirus resource. Hereditas (Beijing). 2020;42:212–221.
  • Song S, Ma L, Zou D, et al. The global landscape of SARS-CoV-2 genomes, variants, and haplotypes in 2019nCoVR. Genomics Proteom Bioinform. 2020 S1672-0229(20)30131-5.
  • Gong Z, Zhu JW, Li CP, et al. An online coronavirus analysis platform from the national genomics data center. Zool Res. 2020;41(6):705–708.
  • Ni M, Chen C, Qian J, et al. Intra-host dynamics of Ebola virus during 2014. Nat Microbiol. 2016;1(11):16151.
  • Zhang Y, Zhang J, Chen Y, et al. The ORF8 protein of SARS-CoV-2 mediates immune evasion through potently downregulating MHC-I. bioRxiv. 2020. doi: 10.1101/2020.05.24.111823.
  • Sapoval N, Mahmoud M, Jochum MD, et al. SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission. Genome Res. 2021; 31(4):635-644
  • Robertson KD, Jones PA. DNA methylation: past, present and future directions. Carcinogenesis. 2000;21(3):461–467.
  • Tang X, Wu C, Li X, et al. On the origin and continuing evolution of SARS-CoV-2. Natl Sci Rev. 2020;7(6):1012–1023.
  • Wang Q, Zhang Y, Wu L, et al. Structural and functional basis of SARS-CoV-2 entry by using human ACE2. Cell. 2020;181(4):894–904.e9.
  • Morse JS, Lalonde T, Xu S, et al. Learning from the past: possible urgent prevention and treatment options for severe acute respiratory infections caused by 2019-nCoV. Chembiochem. 2020;21(5):730–738.
  • Walls AC, Park YJ, Tortorici MA, et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell. 2020;181(2):281–292.e6.
  • Cagliani R, Forni D, Clerici M, et al. Computational inference of selection underlying the evolution of the novel coronavirus, SARS-CoV-2. J Virol. 2020;94(12):e00411-20.
  • Baum A, Bo F, Wloga E, et al. Antibody cocktail to SARS-CoV-2 spike protein prevents rapid mutational escape seen with individual antibodies. Science. 2020;369(6506):1014–1018.
  • Paul S, Sidney J, Sette A, et al., TepiTool: A pipeline for computational prediction of T Cell Epitope Candidates. Curr Protoc Immunol. 2016; 114: 18.19.1-18.19.24.
  • Sun Y, Liu J, Yang M, et al. Identification and structural definition of H5-specific CTL epitopes restricted by HLA-A*0201 derived from the H5N1 subtype of influenza A viruses. J Gen Virol. 2010;91(4):919–30.
  • Yao H, Lu X, Chen Q, et al. Patient-derived SARS-CoV-2 mutations impact viral replication dynamics and infectivity in vitro and with clinical implications in vivo. Cell Discov. 2020;6(1):76.
  • Velazquez-Salinas L, Zarate S, Eberl S, et al. Positive selection of ORF1ab, ORF3a, and ORF8 genes drives the early evolutionary trends of SARS-CoV-2 during the 2020 COVID-19 pandemic. Front Microbiol. 2020;11:550674.
  • Su YCF, Anderson DE, Young BE, et al. Discovery and genomic characterization of a 382-nucleotide deletion in ORF7b and ORF8 during the early evolution of SARS-CoV-2. mBio. 2020;11(4):e01610–20.
  • Gong YN, Tsao KC, Hsiao MJ, et al. SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in middle east. Emerg Microbes Infect. 2020;25(1):1754–1760.
  • Wang X, Lam JY, Wong WM, et al. Accurate diagnosis of COVID-19 by a novel immunogenic secreted SARS-CoV-2 orf8 protein. mBio. 2020;11(5):e02431–20.
  • Kijak GH, Sanders-Buell E, Chenine AL, et al. Rare HIV-1 transmitted/founder lineages identified by deep viral sequencing contribute to rapid shifts in dominant quasispecies during acute and early infection. PLoS Pathog. 2017;13(7):e1006510.
  • Piantadosi A, Freije CA, Gosmann C, et al. Metagenomic sequencing of HIV-1 in the blood and female genital tract reveals little quasispecies diversity during acute infection. J Virol. 2019;93(2):93.
  • Subissi L, Posthuma CC, Collet A, et al. One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase and exonuclease activities. Proc Natl Acad Sci U S A. 2014;111(37):E3900–9.