2,195
Views
4
CrossRef citations to date
0
Altmetric
Research Paper

Multiomic spatial analysis reveals a distinct mucosa-associated virome

, , , &
Article: 2177488 | Received 22 Aug 2022, Accepted 02 Feb 2023, Published online: 23 Feb 2023

ABSTRACT

The human gut virome has been increasingly explored in recent years. However, nearly all virome-sequencing efforts rely solely on fecal samples and few studies leverage multiomic approaches to investigate phage–host relationships. Here, we combine metagenomics, metaviromics, and metatranscriptomics to study virome-bacteriome interactions at the colonic mucosal-luminal interface in a cohort of three individuals with inflammatory bowel disease; non-IBD controls were not included in this study. We show that the mucosal viral population is distinct from the stool virome and houses abundant crAss-like phages that are undetectable by fecal sampling. Through viral protein prediction and metatranscriptomic analysis, we explore viral gene transcription, prophage activation, and the relationship between the presence of integrase and temperate phages in IBD subjects. We also show the impact of deep sequencing on virus recovery and offer guidelines for selecting optimal sequencing depths in future metaviromic studies. Systems biology approaches such as those presented in this report will enhance our understanding of the human virome and its interactions with our microbiome and our health.

Summary

The human gut hosts many bacteria-infecting viruses, also known as phages. These phages interact with other gut microbes and affect human health, yet they are understudied. We advance the study of human gut phages on three frontiers by studying gut viruses in three patients with bowel diseases. First, nearly all phage studies use fecal sampling; however, these phages may be transient and are less relevant to our health. By obtaining samples during colonoscopy, we studied phages from the colonic wall and show that they are different than those found in our stool. Second, we use RNA sequencing to explore how phages interact with bacteria within a complex microbial ecosystem. By applying these new techniques, we advance our study from “what phages are there?” to “what are they doing?” Finally, this study reports one of the most extensive per-sample sequencing efforts to date, allowing us to make practical recommendations for other researchers aiming to design phage studies. Further studies will expand our understanding of our gut phages and how they interact with human health.

Introduction

The human gut virome, which encompasses the vast and diverse collection of viruses in our gastrointestinal tract, has been linked to several human diseases including inflammatory bowel disease, cancer, and diabetes, and may impact the efficacy of microbiome-modulating therapies.Citation1–3 Our gut viromes are highly individualized, temporally stable, and consist mainly of double-stranded DNA bacteriophages from the Caudovirales order.Citation4

Virome research has lagged behind its bacterial counterpart due to difficulties in sample processing, unannotated viral “dark matter”, and high viral heterogeneity. Yet recent advances have led to the exploration of virome assembly and development in neonates, a significant expansion of viral databases, and the characterization and isolation of several crAss-like phages which were only first described in 2014.Citation5–9 Most studies rely either on the sequencing of virus-like particle (VLP) enriched metagenomes, herein referred to as metaviromes, or the identification of viral sequences within whole metagenomes (also referred to as bulk or whole-community metagenomes). Each approach captures a distinct subset of the virome community and, when used together, can increase the recovery of viral species and provide additional context about viral populations.Citation6,Citation10,Citation11 Paired metagenomic and metaviromic sampling also enables the study of interactions between viral species and their bacterial hosts.Citation10 Efforts to integrate other experimental technologies, including metatranscriptomics and metaproteomics, are also evolving, bringing a systems biology approach to viromics and its transkingdom interactions.Citation1,Citation12

To date, metatranscriptomic approaches have been rarely applied in the context of human metaviromics. This may be due to the additional technical challenges of obtaining high-quality RNA, the lack of procedures to enrich for viral transcripts, and limited viral sequence databases. Oral viromes have been investigated using RNA-seq; one study was able to map 30% of the reads to viral populations, but lacked the sequencing depth for further downstream analysis such as assembly.Citation13 Data mining efforts in the past few years have also identified ssRNA phages in activated sludge and aquatic environments, significantly expanding the number of known ssRNA phage genomes.Citation14 Due to the challenge of studying viromes at the community level, transcriptomic analysis has thus been primarily used for more focused phage-host studies such as phage profiling of Salmonella and Yersinia,Citation15,Citation16 and more recently ΦcrAss001 and its host Bacteroides intestinalis.Citation17

Furthermore, most of our understanding of the “gut” virome is based on fecal viromes, which, like the bacteriome, likely do not represent the complex biogeography of our gastrointestinal tract.Citation18 A recent profiling of gastrointestinal tract viromes in the domestic pig and rhesus macaque show that virome composition was specific to anatomical region, with differences in viral load and diversity between luminal and mucosal samples.Citation19 The mucosal virome is hypothesized to have different ecological pressures than the primarily luminally derived stool virome and may have greater influence over the host-associated bacteriome.Citation3,Citation20–22 Rectal samples have been used to study viromes in patients with ulcerative colitis while a recent study utilized colon resections and ileostomy fluid.Citation23,Citation24 These human studies, however, did not compare the virome across multiple sites or compare colonic sampling to fecal viromes.

We recently demonstrated the use of mucosal-luminal interface samples to study the gut virome at specific sites within the gastrointestinal tract.Citation11 This sampling technique also provides sufficient microbial content for matched whole metagenomic and metatranscriptomic shotgun sequencing. Here, we leverage these advantages to demonstrate a multiomic spatial characterization of the virome in three treatment-naïve, pediatric patients with ulcerative colitis, an inflammatory bowel disease. We also highlight the power of matched metatranscriptomic sequencing to reveal viral gene transcription to better understand virome-bacteriome interactions. We did not seek to compare the IBD or ulcerative colitis virome to non-IBD controls in this study.

Results and Discussion

Multiomic sequencing of the mucosal-luminal interface microbiome

Samples were obtained from three pediatric patients at the Children’s Hospital of Eastern Ontario in Ottawa, Canada (). These participants (hereafter referred to as Participants F, G, and H) were treatment-naïve patients undergoing colonoscopy for confirmation of their clinically-suspected ulcerative colitis. Washes of the mucosal-luminal interface (MLI) from the proximal colon (PC) and distal colon (DC) were collected, along with a stool (STL) sample that was obtained between 2 days prior to endoscopy and 20 days after endoscopy. All samples were processed for metavirome and whole metagenome sequencing as previously described.Citation11 MLI samples were also subjected to metatranscriptome sequencing; H-DC had insufficient quality after library preparation and was not sequenced. After removal of low-quality, low-complexity, and host reads, a mean of 211 (ranging from 157 to 254) million paired-end metavirome reads were obtained per sample, providing one of the deepest metavirome sequencing efforts to date. Whole metagenome sequencing yielded an average of 75.4 (6.83–162) million paired-end reads per sample while metatranscriptomic sequencing yielded an average of 119 (107–139) million reads per sample.

Table 1. Participant and sample descriptions.

As noted previously, whole metagenome sequencing at the mucosal-luminal interface is prone to high host contamination at inflamed sites (Supplementary Figure 1A).Citation11 Host contamination has been suggested as a possible biomarker for inflammation, reflecting epithelial and blood cells that are more likely shed in intestinal diseases.Citation25 In this study, all aspirates taken from sites of Mayo Score 2, which is indicative of moderate disease, had >69.8% host contamination.Citation26 In contrast, aspirates from non-inflamed sites (Mayo 0) or mildly inflamed sites (Mayo 1) had no more than 3.7% host contamination. Similarly, host content in the stool metagenomes was markedly increased if Mayo Grade 2 inflammation in the proximal or distal colon was observed. These results suggest that moderate inflammation (Mayo 2) which includes erosions, complete loss of vascular pattern, and significant erythema but not mild inflammation (Mayo 1) is correlated with increased host content. Furthermore, both MLI and stool sampling are affected by mucosal inflammation.

Bacterial contamination in the metavirome sequencing data was assessed by aligning reads against a cpn60 housekeeping gene database.Citation27 Metavirome samples had a mean 172-fold decrease of mapped cpn60 reads compared to their whole metagenomic counterparts (Supplementary Figure 1B), demonstrating efficient viral purification in this dataset consistent with prior metavirome studies.Citation11,Citation28

Viral contig identification across multiomic datasets

High-quality, host-removed sequencing reads were assembled per sample and clustered (Supplementary Figure 2). Using Cenote-Taker2, we identified 1,880 putative viral contigs (VCs) from the metavirome, and an additional 785 VCs from the whole metagenome, including 242 pruned sequences with flanking host regions. These putative VCs were pooled and further clustered, resulting in a set of 2,171 VCs, of which 330 were present in both the metagenome and metavirome datasets. Most VCs (1,499) were only assembled from the metavirome, while 342 were only found in the metagenome. By integrating metagenomics and metaviromics, we could infer bacteriophage lysogeny by the presence of bacterial flanking regions. VCs were defined as integration-capable (IC-VCs) if their cluster included a metagenome-derived VC with pruned flanking regions,Citation9 or a virome-derived VC that could be clustered within a non-viral contig (296 of 2,171 VCs). These 296 IC-VCs were linked to 309 non-viral contigs representing their host genomes. Of these IC-VCs, 105 were only identified from the metagenome and were thus likely inactive (i.e. prophages), while the remaining 191 were also present in the metavirome, suggesting simultaneous prophage activation as temperate phages. The other 1,875 VCs are likely primarily composed of free phages but could also include pseudolysogenic phages; here, we define these phages as non-integrated VCs (NI-VCs). While NI-VCs may be capable of phage integration, they were not detected as prophages in our study.

Viral contigs were assessed for quality using CheckV and annotated using Demovir and vContact2.Citation29–31 94.5% of VCs (2,051/2,171) could be annotated at the viral order level () and primarily represented the order Caudovirales (n = 1,892). The remaining VCs were mainly Microviridae and Circoviridae, which were predominantly identified from metavirome sequencing. Non-viral contigs in the metagenome (n = 39,735) were taxonomically annotated and subsequently used for phage-host analysis. High-quality sequencing reads across all three datasets were mapped to the set of viral and non-viral contigs; counts were normalized for contig length, resulting in a normalized relative abundance (NRA). In all, 86.4–98.3% of metagenome reads, 67.3–97.6% of transcriptome reads, and 88.9–98.2% of metavirome reads mapped to a contig; 4.74–11.8%, 0.96–2.52%, and 52.9–91.8% of reads mapped to viral contigs, respectively. Metagenome and metaviromic datasets were further filtered for a minimum breadth of coverage of 75% with ≥1 read per bp for subsequent analysis.

Table 2. Viral genomes by viral family. A total of 2,171 viral genomes were identified from metavirome and metagenome assemblies. The table below shows their annotations by source.

VCs are plotted in by viral order, quality, contig length, maximum observed NRA in the metavirome, and dataset identified (i.e. metavirome, metagenome, or both). VCs identified in both the metavirome and metagenome were more likely to be longer (p = 6.52e-11), abundant (p < 2e-16), and less likely to be low-quality (p < 2e-16). Thus, co-presence in both datasets could be used as a marker for high-quality VCs. Caudovirales contigs were significantly longer than those belonging to other viral orders (mean length of 24.8 kb vs. 5.3 kb, p < 2e-16 by Wilcoxon rank sum test) which reflects the smaller genome sizes of Microviridae and Circoviridae species; this difference was even greater when only including complete and high-quality VCs (48.9 kb vs. 4.5 kb, p < 2e-16). Lower-quality viral contigs were also significantly more likely to be shorter. As viral databases continue to grow, we recommend including the type of source dataset (i.e. VLP-enriched vs. whole metagenome) as key annotations alongside taxonomy and quality to help evaluate the influence of sequencing methodologies on human virome profiles.

Figure 1. Viral contig annotation and alpha-diversity across metagenomic, metatranscriptome, and metaviromic datasets. (a) 2,171 viral contigs (VCs) were identified across metaviromic and whole metagenomic assemblies and are plotted by: dataset of origin, viral contig quality, maximum normalized relative abundance (NRA) in the metavirome in any given sample, contig length, and viral order. (b) NRA of viral contigs across multiomic datasets for each participant’s samples, annotated by source, taxonomy, and quality. (c) Boxplots of Chao1, Shannon diversity and population variance by multiomic dataset. *p-value < 0.05; **p < .001. DC: distal colon, PC: proximal colon, STL: stool.

Figure 1. Viral contig annotation and alpha-diversity across metagenomic, metatranscriptome, and metaviromic datasets. (a) 2,171 viral contigs (VCs) were identified across metaviromic and whole metagenomic assemblies and are plotted by: dataset of origin, viral contig quality, maximum normalized relative abundance (NRA) in the metavirome in any given sample, contig length, and viral order. (b) NRA of viral contigs across multiomic datasets for each participant’s samples, annotated by source, taxonomy, and quality. (c) Boxplots of Chao1, Shannon diversity and population variance by multiomic dataset. *p-value < 0.05; **p < .001. DC: distal colon, PC: proximal colon, STL: stool.

Viral contigs across metagenomic, metaviromic, and metatranscriptomic datasets

Viral contigs represented 75.1% of the metavirome (by NRA), 4.85% of the metagenome, and 1.90% of the metatranscriptome (). The 2.5-fold decrease from the metagenome to the metatranscriptome suggests that the virome may be less transcriptionally active than the bacteriome, consistent with hypotheses of a predominantly temperate phage-host dynamic in the intestinal microbiome.Citation3,Citation32

While only 15.8% (342/2,171) of VCs were identified from both the metagenome and metavirome, these VCs were highly abundant, representing an average of 36.4%, 35.6%, and 73.1% of the observed viral community in the metagenomic, metatranscriptomic, and metaviromic datasets by NRA, respectively. The metagenome also contained metavirome-derived VCs that were not identified from the whole metagenomic assemblies. This finding may be due to improved assembly of viral contigs in VLP-enriched samples (with greater sequencing depth per VC), and due to different thresholds employed by Cenote-Taker2 when identifying viral assemblies in whole metagenomes as compared to VLP-enriched metaviromes, the former being more stringent to reduce false positives. It remains pertinent to consider the impact of bioinformatic tools and approaches in the identification of viral sequences,Citation33,Citation34 where multiomic approaches could be utilized to corroborate and contextualize viral contig identification.

Viral contigs of the order Caudovirales represented the most abundant viruses of the metagenome and metatranscriptome while other viral taxa, primarily Microviridae, were enriched in the metavirome. Random displacement amplification, a common step employed in virome sequencing efforts, including this study, has been reported to introduce positive bias for small, circular ssDNA viruses such as Microviridae.Citation35,Citation36 However, the patterns of ssDNA virus enrichment in our dataset appear to be participant-specific: all virome samples from participant G were Caudovirales-dominant while all samples from participant F and H were Microviridae-dominant. We have also previously shown that this virome sequencing protocol does not interfere with relative quantification of a spike-in phage in mucosal-luminal interface samples.Citation11 Thus, the expansion of Microviridae in participants F and H is likely suggestive of a true relative enrichment of these viruses in comparison to participant G.

We also observed that metaviromic reads were more likely to map to complete or high-quality viral contigs compared to metagenomic or metatranscriptomic reads. This supports the use of CheckV annotations to infer viral genome completeness. Lower quality genomes may include incomplete prophages, chimeric assemblies, or low abundance phages that lacked sufficient coverage for full genome assembly. We thus chose to use these CheckV quality thresholds when examining viral gene transcription and prophage induction as described in subsequent sections.

Next, we assessed the virome’s alpha-diversity as observed across multiomic datasets; read counts to VCs were rarefied for these calculations. The Chao1 index, which evaluates species richness, was significantly lower in the metatranscriptome (p = .0015, Wilcoxon rank sum test), suggesting that only a subset of viruses are transcriptionally active (). The similar Chao1 indices between metagenome and metavirome suggest that both methods capture a similar number of VCs per viral sequencing read, as has been previously reported.Citation6 However, a metagenome sample would require a greater sequencing depth of 1–2 orders of magnitude to be comparable to metaviromic sequencing. The Shannon index and population variance show that the viral community as seen in the metavirome is less evenly distributed than in the metagenome (p = .00012) or metatranscriptome (p = .0015); the Shannon index also shows that the metatranscriptome population is less evenly distributed than the metagenome (p = .0015). The reduced evenness in these datasets may be due to prophage activation, virulent phage replication, or virulence factors, that enable select viruses to become highly abundant in the metatranscriptome and metavirome, including the Microviridae-dominance seen in select virome samples.

In all, we demonstrate that each sequencing method provides a distinct snapshot of the human gut virome. With the majority of virome studies utilizing either metagenomic or metaviromic approaches, it is important to recognize each technology’s biases on the observed taxonomy, quality, and community dynamics of identified viral populations, and to utilize multiomic approaches where possible to capture the true complexity of these viral communities.

The virome at the colonic mucosal-luminal interface is distinct from the stool virome

Several studies have investigated differences between the luminal and mucosal bacteriome along the length of the gastrointestinal tract.Citation18,Citation37,Citation38 Different ecological pressures would therefore be expected to impact the intestinal viral community along both radial and longitudinal dimensions, though there have been very few efforts for multi-site, or spatial, characterization of the human virome. Here, we provide a direct comparison between the proximal colon MLI, distal colon MLI, and stool (). Among the three participants, 24.6% to 40.4% of VCs were identified at all three sites, representing a shared viral community throughout an individual’s colonic mucosa and lumen. With 32.7 to 61.0% of VCs unique to MLI samples and 9.6 to 23.6% of VCs unique to stool, multiple site sampling substantially expands an individual’s known intestinal virome. Bray-Curtis dissimilarities between these samples show the MLI samples of participants G and H clustering apart from the stool, while participant F, the only patient to have pancolitis, has all three samples clustering together (). Multi-site sampling allows for an increased spatial resolution to understand viral-host interactions, especially in conditions like inflammatory bowel disease.

Figure 2. Beta diversity of colonic and stool metavirome communities. (a) Euler diagrams showing the number of shared viral contigs (with a minimum of 75% of breadth of coverage) between sampling sites. (b) Principal coordinate analysis showing Bray-Curtis dissimilarities of metavirome communities. (c) Bray-Curtis dissimilarities of viral and non-viral metagenomic communities between MLI and STL samples, plotted by intra-individual and inter-individual comparisons. *p-value < 0.05. **all inter-individual metagenome communities had a significantly lower Bray-Curtis distance than inter-individual metavirome communities (p < .01). DC: distal colon, PC: proximal colon, STL: stool, MLI: mucosal luminal interface.

Figure 2. Beta diversity of colonic and stool metavirome communities. (a) Euler diagrams showing the number of shared viral contigs (with a minimum of 75% of breadth of coverage) between sampling sites. (b) Principal coordinate analysis showing Bray-Curtis dissimilarities of metavirome communities. (c) Bray-Curtis dissimilarities of viral and non-viral metagenomic communities between MLI and STL samples, plotted by intra-individual and inter-individual comparisons. *p-value < 0.05. **all inter-individual metagenome communities had a significantly lower Bray-Curtis distance than inter-individual metavirome communities (p < .01). DC: distal colon, PC: proximal colon, STL: stool, MLI: mucosal luminal interface.

Consistent with other studies demonstrating personalized microbiomes and viromes, inter-participant metagenome and metavirome diversity was significantly greater than between intra-participant sites (). Bray-Curtis dissimilarities of the non-viral metagenomic community demonstrated reduced distances as compared to the virome (p ≤ .009) indicating that the virome is more individualized than the metagenome. Within the same patient, Bray-Curtis distances between the proximal and distal colon samples were smaller than those between the colonic mucosa and stool (p = .037), suggesting that despite heterogeneity in local inflammation (such as a non-inflamed proximal colon and an inflamed distal colon), the colonic mucosa provides a common niche that is distinct from the luminal, stool virome. We hypothesize that bacteriophage adherence to mucin, facilitated by interactions between mucin glycoproteins and phage capsid proteins, supports the development and persistence of a mucosa-associated virome.Citation39,Citation40 Our data also demonstrated a trend for greater species richness (i.e. Chao1 index) at both MLI sites in comparison to stool in each participant. Larger studies are required to further investigate these observations and to assess whether these findings extend beyond pediatric subjects with inflammatory bowel disease. Characterizing the mucosa-associated virome will expand the known human virome while enabling the study of microbial communities that are more likely to interact with human health and intestinal diseases.

Viral contig transcription and prophage activation in the metatranscriptome

A total of 59,217 open reading frames (ORFs) were predicted across the set of 2,171 VCs, which were further clustered into 28,112 viral genes and annotated using a combination of five protein family databases: two viral orthologous group databases VOGDB and pVOG,Citation41,Citation42 and three general-purpose databases PFAM, KEGG, and TIGRFAM.Citation43–45 This approach annotated 9,646 (34.3%) viral genes, with PFAM annotating the most viral genes (Supplementary Figure 3A). However, VOG and pVOG annotations were more effective at annotating abundant viral genes detected in the metavirome, supporting the use of more tailored viral databases (Supplementary Figure 3B). We thus assigned viral functions in the order of the databases listed above, with VOGDB being the primary database used in this study.

To assess viral gene transcription, we focused on the subset of complete and high-quality VCs (580/2,171). We classified these VCs as “present” in a sample’s metagenome and/or metavirome if detected with a minimum breadth of coverage of 75%, or as “transcriptionally active” (TA) if they had a minimum NRA ≥0.0001% in the sample’s metatranscriptome. These TA-VCs, highlighted in , were more likely (p = .016) to be present in both metagenome and metavirome datasets (64.8%) than non-transcriptionally active VCs (17.2%). We hypothesize that TA-VCs which are present in all three multi-omic datasets reflect the host-dependent production process of free viral particles. TA-VCs were less likely to be derived solely from the metavirome, which could represent viral particles without available hosts including transient, luminal VLPs.

Figure 3. Transcriptionally active viromes of the colonic mucosal-luminal interface. (a) Bar plot showing all complete and high-quality VCs by metatranscriptome activity (minimum abundance of 0.0001%), further annotated by viral contig source, taxonomy, phage integration, and integrase presence. (b) Scatter plots highlighting transcriptionally active VCs by their NRA in the metagenome, metavirome, and metatranscriptome. (c) Bar plot showing annotated ORFs of transcriptionally active VCs present in ≥30 instances (i.e. counted for every VC and every sample). Transcribed ORFs (i.e. present in that sample’s transcriptome) are counted on the right side of the plot.

Figure 3. Transcriptionally active viromes of the colonic mucosal-luminal interface. (a) Bar plot showing all complete and high-quality VCs by metatranscriptome activity (minimum abundance of 0.0001%), further annotated by viral contig source, taxonomy, phage integration, and integrase presence. (b) Scatter plots highlighting transcriptionally active VCs by their NRA in the metagenome, metavirome, and metatranscriptome. (c) Bar plot showing annotated ORFs of transcriptionally active VCs present in ≥30 instances (i.e. counted for every VC and every sample). Transcribed ORFs (i.e. present in that sample’s transcriptome) are counted on the right side of the plot.

Most TA-VCs were Caudovirales (96.0%, compared to 64.2% of non-transcriptionally active VCs, p = .018) with only a small minority of Microviridae detected in the metatranscriptome. This observation could suggest that the non-Caudovirales viruses are more likely to be transient while also reflecting the sequencing bias of small circular genomes in commonly used metavirome sequencing efforts, though future multiomic studies with greater sample sizes are required to further explore these findings. TA-VCs were also more likely to be observed with flanking host regions (47.0% vs. 7.5%, p = .0095) and contain an integrase (62.4% vs. 27.8%, p = .016); however, the latter observation was no longer significant when excluding non-Caudovirales VCs, which rarely contain an integrase.

Using metagenomic, metaviromic, and metatranscriptomic sequencing data, we plotted virome transcription profiles of each sample (). The NRA of viral contigs in the metatranscriptome was most correlated with their NRA in the metagenome (Pearson correlation = 0.395, p = 3.9e-12). The NRA of viral contigs in the metavirome was also positively correlated with their NRA in the metatranscriptome (Pearson correlation = 0.307, p = 1.1e-7), suggesting that observed increases in viral transcription may translate to increased VLP production. The NRA of VCs in the metavirome was also correlated with their NRA in the metagenome (Pearson correlation = 0.136, p = .021); the relatively weaker correlation between the metagenome and metavirome once again demonstrates the differences between whole metagenome and VLP-enriched sequencing approaches. The similarity between colonic sites is further visualized in Supplementary Figure 3C, reflecting the inter-participant variability and intra-participant stability of the virome as described in the previous section. More samples are required to investigate differences along the gastrointestinal tract and to compare viromes between subjects with and without IBD, while longitudinal studies will provide further insight into temporal variations in the virome’s transcriptomic activity.

Within the subset of complete and high-quality TA-VCs, we also examined the metatranscriptome dataset at the feature level. The most frequently occurring genes are shown in , led by repressor protein cI (n = 227), integrase (207), and packaging protein 1 (148). Gene transcription varied from 39% to 95%, with several genome processing genes (DNA helicase uvsW, 89%; integrase, 84%; reverse transcriptase 82%) showing higher transcription than structural proteins (major capsid proteins, 49–73%; baseplate proteins, 45–57%). While inferring population-level trends remains difficult given significant viral heterogeneity, we hypothesize that this pattern reflects the increased presence of early-stage bacteriophage genes, many of which are supportive of phage lysogeny, as compared to late-stage genes. The greater transcription of repressor protein cI (81%) compared to the anti-repressor protein ant (49%) may also provide further evidence of an overall temperate viral community.

Two abundant crAss-like phages identified at the MLI

Two circular Caudovirales VCs, V014264 (97.91 kb) and V016904 (99.54 kb), had high similarity to existing delta crAss-like phages: ERR844016_ms_1 and ERR844030_ms, respectively.Citation7 The comparative phage genome tool VIRIDIC showed that each VC had 95.3% and 89.9% identity with their aligned crAss-like phage, 15.3% with each other, and less than 2% with the first isolated crAssphage.Citation46,Citation47 These were the only two VCs in our dataset that contained all three markers (portal, TerL, major capsid protein) utilized in a recent profiling of crAss-like phages.Citation48 While crAss-like phages have been observed in up to 90% of metavirome samples,Citation47 the rarity of these phages in our IBD-only dataset could reflect a recent study which showed a depletion of crAss-like viruses in patients with inflammatory bowel disease.Citation48 The prevalence and diversity of crAss-like phages also appear to increase with age;Citation49 here, we profiled adolescents, a rarely studied demographic within virome studies. A larger dataset is required to better capture the prevalence of crAss-like phages in the human colonic mucosa, including in the context of human disease.

Both crAss-like phages were highly abundant in the proximal and distal colon metaviromes of participant G (Supplementary Figure 4), with V014264 being the third most abundant VC (NRA of 4.82%) in the proximal colon. We did not detect any single-nucleotide polymorphisms in either VC between the two sites.Citation50 These crAss-like phages were detected in the metagenome and were among the subset of complete and high-quality, transcriptionally active VCs (highlighted in Supplementary Figure 3C). Both crAss-like phages were also present at very low abundance (< 2 × 10Citation4 %) in the distal colon metavirome of patient H.

While abundant in MLI metaviromes, both crAss-like phages were nearly undetectable in the stool metavirome of participant G (2.6 x 10−6% for V014264; 1.4 × 10Citation7 % for V016904). These reads were below our filtering thresholds and exceeded a million-fold decrease from the MLI samples. At a more conventional sequencing depth, these crAss-like phages would be effectively undetectable by stool sampling. Both VCs were among the 61.0% of MLI-specific VCs detected in participant G (), demonstrating the importance of distinct viral niches across our complex biogeography.

Multiomic coverage maps of V014264 and V016904 are shown in and Supplementary Figure 5. Functional annotation of the crAss-like phages was performed as described above using Prokka and several viral and general-purpose databases, yet this only annotated 11.9% and 2.4% of ORFs on V014264 and V016904, respectively (Supplementary Figure 6A). Utilizing a curated set of crAss-like phage protein families described in Yutin, et al.,Citation51 we were able to increase the proportion of annotated reads to 39.6% and 32.1%, albeit including limited annotations such as “uncharacterized protein of delta crassfamily delta group phages.” Metatranscriptomic sequencing reads, when mapped to forward and reverse strands, corroborated the orientation of the predicted crAss-like phage genes and enabled gene-level analysis (). For V014624, 55.4% of the contig was transcribed in the proximal colon while 70.7% was transcribed in the distal colon (i.e. with a minimum sequencing depth of 1). Between the two sites, metavirome map depths were highly correlated (Spearman correlation = 0.983), compared to 0.585 in the metagenome and 0.504 in the metatranscriptome, suggesting that the transcription profiles of these crAss-like phages were similar in both the non-inflamed proximal colon and inflamed distal colon.

Figure 4. Multiomic sequencing map of V014624, a crAss-like phage identified in the colonic mucosal-luminal interface. Open reading frames (ORFs) were predicted over the 97.9 kb genome and colored by forward (green) or reverse (purple) orientation. Annotated ORFs are shaded in gray and labeled (excluding ‘uncharacterized’ or ‘hypothetical’ proteins). Metagenome, metavirome, and metatranscriptome sequencing depths at the proximal colon (top) and distal colon (below) are plotted using a 151 bp sliding window; metatranscriptome reads are also mapped by strand. Another crAss-like phage, V016904, is shown in Supplementary Figure 5.

Figure 4. Multiomic sequencing map of V014624, a crAss-like phage identified in the colonic mucosal-luminal interface. Open reading frames (ORFs) were predicted over the 97.9 kb genome and colored by forward (green) or reverse (purple) orientation. Annotated ORFs are shaded in gray and labeled (excluding ‘uncharacterized’ or ‘hypothetical’ proteins). Metagenome, metavirome, and metatranscriptome sequencing depths at the proximal colon (top) and distal colon (below) are plotted using a 151 bp sliding window; metatranscriptome reads are also mapped by strand. Another crAss-like phage, V016904, is shown in Supplementary Figure 5.

Considering all transcribed genes, V016904 had a higher mean gene transcription ratio (transcriptome/metagenome) than V014264 (0.62 vs. 0.23, p = 4.3e-15), while VC16904 was significantly more transcribed in the proximal colon than the distal colon (0.80 vs. 0.44, p = .00078) (Supplementary Figure 6B). The most transcribed gene clusters are highlighted in Supplementary Figure 6C, with a GIY-YIG endonuclease and major capsid protein among the highest transcribed crAss-like phage genes in this dataset. We also note the presence of highly transcribed regions without annotated ORFs (i.e. a 500 bp region at ~44.5 kb) on V014264, that could be involved in some form of regulatory role to support the stable relationship between crAss-like phages and their hosts.Citation17,Citation52 Despite significant improvement with targeted databases, virome studies continue to be limited by poor viral genome annotation. As these databases continue to grow with our understanding of bacteriophage function, metatranscriptomics have great potential in capturing the in vivo functional capabilities of our human viromes.

Host-phage prediction and transcription

While there are multiple methods to predict bacterial host-phage relationships, we utilized two approaches that leveraged our paired metavirome and whole metagenomic sequencing (). First, we used a probabilistic modeling approach (WiSH) to assign VCs to their most likely host within our set of whole metagenome-assembled, non-viral contigs. After applying a false-discovery rate adjusted p-value threshold of 10Citation5, 855 (39.4%) VCs were assigned a host contig. Nearly all VCs assigned a host were Caudovirales (). Few Circoviridae (2/86) and Microviridae (1/53) were assigned hosts. While ssDNA viruses have shorter genomes (Microviridae and Circoviridae have a mean length of 5.2 kb and 4 kb in our dataset, respectively) and there is a correlation between VC length and host assignment, short genomes do not preclude host assignment: 25.8% of all our host-assigned VCs are less than 6 kb. Moreover, ssDNA viruses have been reported to be less adaptive to their host genomes and have a higher mutation rate,Citation4,Citation53,Citation54 which could limit our ability to assign their hosts. Additionally, ssDNA virus enrichment due to random displacement amplification could result in a lack of host-assignment if its corresponding bacterial genome was below our sequencing or contig assembly detection threshold. Finally, three Phycodnaviridae VCs were assigned a bacterial host, which could represent a false prediction or an incorrect annotation, as Phycodnaviridae are known to infect algae rather than bacteria.

Figure 5. Phage-host prediction inferred from metavirome and whole metagenome assemblies. Phage host relationships as calculated using WiSH (A,C,E,G) or inferred by the presence of non-viral flanking regions of VCs (B,D,F,H). (a-b) Bar plots of all VCs by source, viral order, and their host assignment status. (c-d) VC-host or IC-VC–host combinations by viral and bacterial order. (e-f) Spearman correlations of the NRAs of matched VC-host or IC-VC host pairs, as compared to unmatched pairs. The dashed line indicates the median. (g-h) Scatter plots of transcription (i.e. metatranscriptome/metagenome NRA), with the line and formula representing the linear model and Pearson correlation between VC and host transcription.

Figure 5. Phage-host prediction inferred from metavirome and whole metagenome assemblies. Phage host relationships as calculated using WiSH (A,C,E,G) or inferred by the presence of non-viral flanking regions of VCs (B,D,F,H). (a-b) Bar plots of all VCs by source, viral order, and their host assignment status. (c-d) VC-host or IC-VC–host combinations by viral and bacterial order. (e-f) Spearman correlations of the NRAs of matched VC-host or IC-VC host pairs, as compared to unmatched pairs. The dashed line indicates the median. (g-h) Scatter plots of transcription (i.e. metatranscriptome/metagenome NRA), with the line and formula representing the linear model and Pearson correlation between VC and host transcription.

VCs identified in the metagenome were also more likely to be assigned a bacterial host than those identified solely in the metavirome (). We hypothesize that whole metagenome sequencing enables the identification of prophages and intracellular phage particles within their host bacteria that may be missed by metavirome sequencing, while VLP-enriched methods may capture transient VLPs that are unrelated to the resident bacteriome. shows the breakdown of host-viral pairs by the predicted VC family and bacterial host contig order, with most VCs assigned to bacteria in the order Clostridiales.

Our second approach for studying virome-bacteriome interactions involved our subset of IC-VCs with flanking host regions. These 296 IC-VCs were associated with 309 unique host contigs, with 320 unique VC-host pairs (23 VCs were present in 2 or 3 host contigs; 11 host contigs contained 2 VCs). Like WiSH-assigned VCs, IC-VCs were nearly all Caudovirales and more likely to be derived from the metagenome or both the metagenome and metavirome (). IC-VC hosts were also most commonly of the bacterial order Clostridiales, though compared to WIsH assignments, Bacteroidales hosts were proportionally increased from 7.3% to 18.1% ().

Spearman correlations were calculated between the NRA of each VC in the metavirome and the NRA of its predicted host in the metagenome across the nine paired datasets (). The NRA of both WiSH-assigned VC-host pairs and IC-VC-host pairs were significantly (p < 2.2e-16) more likely to be positively correlated in comparison to the NRAs of unmatched pairs, with median Spearman correlations of 0.325 and 0.713, respectively. Viral contig transcriptional enrichment (i.e. NRA of contig in the metatranscriptome/metagenome) also positively correlated with host contig transcriptional enrichment (). These results are indicative of phage-host coexistence in the mammalian gut mucosa, and suggests that the transkingdom equilibrium as demonstrated between ΦcrAss001 and its host B. intestinalis may also be reflected at the community level.Citation17 The gut mucosa is thought to enable a heterogenous distribution of phages and their bacterial hosts that supports their co-existance.Citation55 In our small IBD cohort, this apparent co-existence is maintained despite the presence of local host inflammation, though the impact of inflammation on virome-bacteriome interactions at the MLI requires further study.

The presence of an integrase is a weak predictor for observed viral lysogeny

Phage integrases are viral enzymes that facilitate site-specific recombination, allowing for the integration of a viral genome into its host. Many model bacteriophages, such as Escherichia virus Lambda, require an integrase for lysogeny.Citation56,Citation57 Additionally, integrases tend to be relatively prevalent in virome datasets including this study, where integrases were the most common viral gene annotation overall (second-most common in complete and high-quality VCs). Integrases have therefore been commonly employed as prophage markers and as indicators of a temperate lifestyle.Citation4,Citation53,Citation58,Citation59

Among our 2,171 VCs, we identified 603 integrase genes (591 VOG00035 “Integrase”, 10 pVOG0275 “integrase”, 1 VOG11667 “DDE-type integrase/transposase/recombinase”, and 1 PF13495 “Phage integrase, N-terminal SAM-like domain”) on 497 VCs. All identified PFAM-annotated integrases (PF00589), commonly used for integrase identification, were encompassed within the VOG00035 integrases. These integrases were present on 25.8% of Caudovirales VCs (489/1892) and absent in all other viral orders. Longer and higher-quality VCs were more likely to contain an integrase gene. We thus focused on the subset of complete and high-quality Caudovirales VCs, of which 54.9% (219/399) contained an integrase.

IC-VCs were more likely to contain an integrase (69.6%, 71/102) than non-integrated VCs (49.8%, 148/297); this finding was independent of contig length. While this result is expected, the majority of integrase-containing VCs were only detected as free phages, and thus suggests that many NI-VCs may be temperate phages observed only in their lytic cycle. We therefore investigated the transcription of all common gene families including integrases (present in 10% or more of high-quality or complete Caudovirales VCs) for their association with observed phage integration ().

Figure 6. Presence and transcription of common viral ORFs including lysogenic proteins on Caudovirales VCs. (a) The frequency of common viral genes (≥10%) present on complete and high-quality Caudovirales VCs based on contig source and between IC-VCs and NI-VCs. ORF counts were normalized by contig length. (b) Integration likelihood ratios of based on presence of each gene family. (c) Proportion of genes transcribed in IC-VCs and NI-VCs, sorted by difference.

Figure 6. Presence and transcription of common viral ORFs including lysogenic proteins on Caudovirales VCs. (a) The frequency of common viral genes (≥10%) present on complete and high-quality Caudovirales VCs based on contig source and between IC-VCs and NI-VCs. ORF counts were normalized by contig length. (b) Integration likelihood ratios of based on presence of each gene family. (c) Proportion of genes transcribed in IC-VCs and NI-VCs, sorted by difference.

Repressor protein cI (VOG06177) and integrase (VOG00035), both known to be involved in phage lysogeny, were by far the most common gene annotations. However, both were similarly present in IC-VCs and NI-VCs when normalized by contig length and had modest integration likelihood ratios of 1.11 and 1.31, respectively (). In comparison, the highest observed integration likelihood ratios were for ParB-like nuclease domain protein (2.77), packaging protein 1 (2.50), and insertion sequence IS21 putative ATP-binding protein (2.29). We thus suggest that the presence of an integrase gene alone is a weak predictor of a VC lysogeny, as it is neither a sensitive nor specific indicator of phage integration. Existing databases may limit our capacity to annotate the diverse array of enzymes that enable phage recombination, which may include various integrases, transposases, and recombinases.Citation57 Our own annotation sensitivity would have been further reduced if we had relied only on the commonly used PFAM integrase family PF00389, which missed 15.9% of integrase genes identified with VOG00035. Successful identification of flanking host regions is also dependent on accurate metagenomic contig assembly and subsequent annotation, and thus limits the utility of using these regions as a gold standard for lysogeny.

Given that the presence of integrase and repressor protein merely infers lysogenic potential but may not clearly differentiate phages in their lytic or lysogenic cycle, we hypothesized that transcription would be a better indicator of phage lysogeny. We thus examined the transcription of these common viral genes (). Integrases and repressor proteins were both highly transcribed, which supports the overall lysogenic nature of the intestinal virome community. Both integrases and repressor protein cI were slightly more transcribed in VCs with flanking host regions (), increasing from 81.4 to 83.9% and 81.8% to 87.3%, respectively, which supports the hypothesis that some temperate prophages are being detected during a lytic cycle. Packaging protein 1, putative nuclease p44, and two uncharacterized proteins were most likely to be transcribed in IC-VCs (+14.3–29.3%), while several terminases and major capsid protein genes were more transcribed in NI-VCs, consistent with these phages being in their lytic phase. Improved viral annotation and further phage studies will assist with the prediction of phage lifestyle, especially since phage-host relationships outside of model organisms remain poorly understood. Notably, crAss-like phages display features of temperate phages including high-level long-term persistence without plaque formation in its host, yet both ΦcrAss001 and ΦcrAss002 do not contain lysogeny modules.Citation17,Citation60 We also did not detect integrases within crAss-like phages V014264 and V016904, though integrases were noted within a larger subset of crAss-like phages.Citation51 Ultimately, phage integration is likely among other lysogenic mechanisms including episomes, pseudolysogeny, phage carrier states, and other uncharacterized means, that are able to sustain stable viral populations.Citation61

We thus demonstrate the potential for integrating metagenomics, metaviromics, and metatranscriptomics to evaluate viral lysogeny in a cohort of pediatric, IBD subjects. Through metatranscriptomics, we revealed that lysogenic proteins are highly transcribed at the community level, while the presence and transcription of these proteins could be used to predict phage lysogeny at the genome level. We recommend against inferring phage lifestyle based on the presence of an integrase alone, and instead include additional factors such as integrase transcription, alternative integration-associated genes, and the presence of flanking host DNA. To help overcome limitations in viral gene databases, machine-learning classification tools and alignment-free K-mer frequency models are also being developed to assist in phage lifestyle prediction.Citation62,Citation63

Metatranscriptomic sequencing does not reveal a significant population of RNA phages

We also searched our metatranscriptomic data for RNA phages, which may be missed in DNA-based approaches for metagenomics and metaviromics. Predicted ORFs on metatransciptomic assemblies were searched against two protein profile databases: a set of RNA-dependent RNA polymerases (RdRps) and capsid genes curated for Cenote-Taker2, and a set of ssRNA marker genes that were used in a recent identification of 15,611 non-redundant environmental ssRNA phages.Citation1,Citation14 Both MLI samples in participant F and G contained an RdRp originally identified in a dsRNA human picobirnavirus.Citation64 There were no other RdRps or ssRNA phage markers detected in our metatranscriptomic dataset. Thus, like prior studies, we report a scarcity of RNA phages in the human virome, though these observations remain hampered by limited RNA phage databases and a small, IBD cohort with no control subjects.Citation65 Additionally, our metatranscriptome protocol, which like our metagenome protocol involves pelleting bacterial cells prior to DNA/RNA extraction, may exclude viral VLPs that remain in the supernatant.Citation11 Alternative methods to capture RNA-based VLPs would be required to assess their true contribution to the mucosa-associated virome.

The impact of sequencing depth on virus discovery

Virome sequencing has substantially improved over the past decade, from read counts in the 10,000s to over ten million paired-end (PE) reads. To investigate the impact of sequencing depth on virome discovery, we subsampled our metaviromic sequencing reads at increments spanning 10 k and 100 million PE reads to examine how sequencing depth affects VC identification (). Empirically, the yield for new contigs per million bases dropped from 38.6 VCs at 1 million reads to 9.4 VCs at 10 million reads to 2.3 VCs at 100 million reads (). Mean contig length peaked at 20 million PE reads (18.1 kb) but was overall stable between 17.1 and 18.1 kb above 10 million reads (). Maximum contig length appeared to stabilize at around 10 million reads but did continue to increase as the sequencing depth reached 100 million reads. At all sequencing depths above 100,000 reads, VC quality remained remarkedly stable, with 8.1 to 10.3% of all VCs annotated as “complete” by CheckV (). Taxonomic annotations of VCs also remained relatively stable at sequencing depths greater than 2 million PE reads (). This study is one of the deepest sequenced datasets to date and allowed us to explore the impact of sequencing depth on virome identification. We show that after about 10–20 million PE reads, there were no further gains in mean contig length, viral contig quality, or contig annotation. Thus, given the existing bioinformatic tools currently available, it is unlikely that extensive sequencing depth will ultimately result in a set of “complete genomes”. As such, it may be more practical for virome researchers to design experiments to maximize the information obtained from a particular sample, as the “holy grail” of assembling all phages that may be present within a complex virome community is likely unfeasible. We also found that increasing the read depth leads to a decrease in the percent contigs which could have an annotation assigned. While shorter DNA lengths may contribute to this, it is equally likely that rarer, low-abundant viruses are less likely to be characterized and are thus more difficult to classify.

Figure 7. Effects of sequencing depth on viral contig identification. The impact of sequencing depth on (a) number of VCs identified, (b) VC length, (c) contig quality, and (d) annotation. Error bars show standard deviation. HQ: high-quality; MD: medium-quality; LQ: low-quality; ND: not-determined.

Figure 7. Effects of sequencing depth on viral contig identification. The impact of sequencing depth on (a) number of VCs identified, (b) VC length, (c) contig quality, and (d) annotation. Error bars show standard deviation. HQ: high-quality; MD: medium-quality; LQ: low-quality; ND: not-determined.

Conclusions

In summary, we present an in-depth, multiomic spatial analysis of the human virome in a small cohort of pediatric patients with ulcerative colitis. We demonstrated that the mucosal-luminal interface virome was distinct from stool and were able to isolate two highly abundant, mucosa-associated crAss-like phages, highlighting the importance of studying the virome across our complex host biogeography. We then developed practical approaches to leverage metagenomics, metaviromics, and metatranscriptomics to study the virome at the community and contig level. Because each technique yields a different viral subset, multiomic approaches help to corroborate and contextualize observations while revealing biases that are present in metagenome or metavirome only studies. Viral contigs present across all three datasets were more likely to represent higher-quality Caudovirales genomes. Many of our observations were consistent with an overall temperate virome, including reduced virome representation in the metatranscriptome, co-existence of predicted phage-host pairs, and high transcription of lysogeny-associated genes. We also assessed integrases and the presence of flanking host DNA regions and show that metatranscriptomic analysis improves our ability to infer phage lysogeny.

While these findings were described in a pediatric IBD cohort and may not extend to the general population, we describe techniques that advance our ability to characterize the human virome. Providing further biogeographic resolution of our virome and employing multiomic approaches are two key frontiers in understanding how the virome interacts with the bacteriome in the context of human health and disease.

Patients and Methods

Resource Availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Alain Stintzi [email protected].

Materials availability

This study did not generate new unique reagents.

Date availability

The datasets generated during this study are available at NCBI PRJNA818303.

Experimental Model and Participant Details

Sample collection from pediatric patients was approved by the Research Ethics Board of the Children’s Hospital of Eastern Ontario (CHEO) in Ottawa, Canada with informed consent/assent obtained from parents and/or participants. Mucosal luminal interface samples from three treatment-naïve patients were obtained during diagnostic endoscopy following a standard colonoscopy preparation protocol.Citation66 Demographic and clinical information is shown in . Stool samples were collected either before or after endoscopy and stored at −80°C.

Method Details

Sample Collection

The collection of mucosal-luminal interface (MLI) aspirates has been described previously.Citation18 In brief, sterile water was used to wash the bowel wall at the proximal and distal colon during colonoscopy to remove the loosely adherent mucous layer. The wash was then aspirated into a sterile container and stored at −80°C. Aliquots of 10 ml were used for virus-like particle purification and viral DNA extraction; 20 ml were used for RNA extraction, and 2 ml was used for whole metagenomic DNA extraction.

Virus-like Particle Purification and Nucleic Acid Extraction

We have previously described a protocol to purify virus-like particles from mucosal aspirates.Citation11 In summary, 10 ml aliquots of mucosal aspirates or 0.5 g of stool homogenized in 10 ml saline-magnesium buffer were subjected to centrifugation and sequential filtration at 5.0 and 0.45 µm filters (Sigma-Aldrich, SLSV025LS and SLHV033RB) to remove debris and cellular content. Virus-like particles were precipitated by overnight incubation at 10% w/v PEG-8000 (Fisher Scientific, BP233) and resuspended in saline-magnesium buffer. Remaining bacterial cells were lysed by treatment with 1 mg/ml lysozyme (Sigma, L4919) for 30 min at 37°C followed by 0.2 volumes of chloroform (10 min, room temperature). After centrifugation (5 min, 2500 g), the aqueous mixture was treated with TURBO DNase (Thermo Scientific, AM2238) and RNaseI (Life Technologies, EN0602) in a buffer of 1 mM CaCl2 and 5 mM MgCl2 for 1 h at 37°C to degrade remaining bacterial nucleic acids. Enzymes were inactivated at 70°C for 10 minutes. Virus-like particles were lysed with Proteinase K (3.2 µg/ml; Fisher Scientific, BP1700) in 3.2% SDS for 20 minutes at 55°C, then treated by 2.5% cetyltrimethylammonium bromide (Fisher Scientific, O3042) with 0.5 M NaCl for 10 minutes at 65°C. Viral DNA was then extracted by adding 1 volume of phenol-chloroform-isoamyl alcohol (25:24:1, pH 6.7) to each mixture, which was vortexed and subjected to centrifugation (10 min, 8000 g); this step was repeated with chloroform to remove trace phenol. Nucleic acids were purified from the aqueous layer using the Dneasy Blood and Tissue Kit (QIAGEN, 69506) and eluted in 50 µl of water. DNA was concentrated using an Eppendorf™ Vacufuge™ Concentrator to 3 µl to maximize the input DNA for the GenomiPhiTM V2 DNA Amplification polymerase kit (GE Life Science, 25660032). Reactions using 1 µl of input DNA were run in triplicate, then pooled and purified with the Dneasy Blood and Tissue Kit. DNA was quantified fluorescently using the Qubit dsDNA HS Assay Kit (Thermo Fisher, Q32854).

Whole Metagenomic and Metatranscriptomic Extraction

Whole metagenomic DNA was extracted using the FastDNA Spin Kit for DNA Isolation (MP Biomedicals, 116540600), eluted in water, and quantified using the Qubit High Sensitivity dsDNA Assay Kit as previously described.Citation67

Total RNA was freshly extracted from pellets obtained from MLI aspirates, using a modified phenol/chloroform extraction protocol adapted for samples with high mucin content.Citation68 Briefly, fresh 20 ml MLI aliquots were subjected to two sequential centrifugations at 4°C (700 g for 5 min; 13000 rpm for 20 min) to remove cellular debris and pellet bacterial cells. The pellets were resuspended in denaturing buffer (4 M guanidine thiocyanate, 25 mM sodium citrate, 0.5% N-lauroylsarcosine, 1 M 2-mercaptoethanol, 1% N-acetyl cysteine) and 0.1 volumes of 1 M sodium acetate pH 5.2. The samples were incubated at 65°C for 2 min to lyse the cells, preheated buffer-saturated phenol pH 4.3 was added at a 1:1 ratio and incubated at 65°C for 10 min with frequent inverting. The samples were then chilled by placing them directly on ice for 15 min, chloroform was added (0.5 volumes) and the tubes inverted 10 times. The aqueous phase was separated by centrifugation at 13000 rpm for 25 min at 4°C and the RNA precipitated at −80°C by adding 2.5 volumes of ethanol, 0.1 volumes 3 M sodium acetate pH 5.2 and EDTA added to a final concentration of 1 mM. Precipitated RNA extractions were further cleaned by washing the pellets 5 times with chilled 80% ethanol with 5 min 13000 rpm centrifugations at 4°C. The RNA was then resuspended in Rnase free water (Ambion, AM9937) and stored at −80°C. Contaminating DNA was removed using sequential Dnase I (Thermo Scientific, EN0521) treatments and further purified and concentrated using the RNA Clean & Concentrator-25 (Zymo Research, R1017). The absence of human and bacterial DNA was then assessed using PCR with primers targeting human actin (hACTBf2: AGTCCTACGGAAAACGGCAG, hACTBr2: CACCCTGAAGTACCCCATCG) and the bacterial V4-16S rRNA hypervariable region (V4_bc45: CCATCTCATCCCTGCGTGTCTCCGACTCAGTAACGCGTTCGAYTGGGYD TAAAGNG, V4_revs: CCTCTCTATGGGCAGTCGGTGATTA CNVGGGTATCTAATCC).Citation67 PCR products were analyzed on a 1% agarose gel and the Dnase I treatment repeated if a band was present. RNA quantity and quality were evaluated using the Qubit RNA HS Assay Kit (Thermo Fisher, EN0602) and the RNA 6000 Nano Kit (Agilent Technologies, 5067–1511) on the 2100 Bioanalyzer. Samples with an RNA Integrity Number (RIN) ≥ 7 were deemed suitable for sequencing.

DNA sequencing, quality-filtering, and host-read removal

DNA and RNA sequencing were performed at the Génome Québec CES using the NEB Ultra II and NEB rRNA-depleted stranded library preparation kits, respectively, with subsequent sequencing on the Illumina NovaSeq 6000 platform. Cutadapt 2.10 was used to trim Illumina’s universal adapters (AGATCGGAAGAG; AGATCGGAAGAG) while Trimmomatic 0.36 was used in paired-end mode to retain high-quality sequences with the settings SLIDINGWINDOW:4:20, MINLEN:60, and HEADCROP:10. Reads mapping to the human genome (GRCh38 with bowtie2ʹs ultra-sensitive mode)Citation69 were removed from further analysis using samtools 1.7.Citation70 Finally, low complexity reads were removed using Komplexity.Citation71 Bacterial contamination was assessed by aligning reads to the cpn60 database with bowtie2ʹs ultra-sensitive mode.Citation27

An average of 230 million paired end reads (173–280 million) were obtained for each metavirome sample, totaling 34 Gbps, while an average of 154 million paired end reads (137–177 million) were obtained for each metagenome sample, totaling 23 Gbps. Quality filtering resulted in the removal of 6.16% of reads (4.91–9.10%) across all 18 samples, with no significant difference between metavirome and metagenome sequencing. Host read filtering removed 0.020–0.319% of high-quality metavirome reads and 0.0849–94.3% of high-quality metagenomic reads (p = .0005 between the two groups). Low complexity sequence filtering, which removes human microsatellite DNA and aids downstream assembly and analysis, was performed using Komplexity, which removed another 0.235–5.79% high-quality reads.Citation71 Overall, this resulted in a mean of 211 (157–254) million paired-end metavirome reads and 75.4 (6.83–162) million paired-end metagenomic reads per sample.

Metatranscriptomic analysis

Metatranscriptome sequencing reads were subject to adapter trimming and quality filtering using cutadapt and Trimmomatic as described above. TopHat2 was used to identify reads aligning to human transcripts and these were removed using samtools 1.9.Citation70,Citation72 SortMeRNA and Komplexity was then used to remove ribosomal and low complexity reads, resulting in remaining high-quality, non-human, and non-ribosomal reads.Citation73 For RdRp and ssRNA marker identification, MEGAHIT v1.2.7 was used to assemble these reads at default settings (minimum contig length of 200 bp). ORFs were predicted using Prodigal 2.6.3 in its metagenomic mode,Citation74 and searched against databases of RNA-dependent RNA polymerases and capsids used by Cenote-Taker 2.1.3 and ssRNA marker genes using hmmsearch with a minimum E-value of 1E-05.Citation1,Citation14,Citation75

Metagenomic assembly, viral contig identification, and viral gene annotation

The bioinformatic pipeline for VC identification is summarized in Supplementary Figure 2. Host-decontaminated, quality-filtered reads from each metagenome and metavirome sample were assembled using MEGAHIT v1.2.7 with a minimum contig length of 3000 bp.Citation76 Metagenome and metavirome assemblies were each pooled and clustered using ClusterGenomes at 95% identity across 90% of the length of the smaller contig.Citation77 Viral contigs were identified using Cenote-Taker2 using default settings for whole metagenomic assembly (including pruning of flanking host regions) and virus-like particle preparation assembly, respectively. To examine the relationship between sequencing depth, this process was repeated with quality-filtered, metaviromic sequencing reads subsetted at increments from 10,000 to 100 million paired end reads per sample to simulate various sequencing depths.

All putative viral contigs were further pooled and clustered, resulting in 2,171 VCs. During this step, all metavirome-derived VCs were also clustered against the set of metagenomic non-viral contigs: non-viral contigs clustering within a viral contig were removed from the set of non-viral contigs, while metavirome-derived viral contigs clustering within non-viral contigs were considered prophages (along with host DNA flanked prophages identified in metagenome). Non-viral contigs that contained one or more prophages had these regions masked. Whole metagenomic assemblies that were not identified as viral contigs (n = 39,735) were annotated using the Contig Annotation Tool.Citation78

VCs were assessed using CheckV,Citation30 annotated using Demovir,Citation29 then clustered with vContact2.Citation31 Members of a viral cluster were annotated by majority vote. As Demovir has not yet been updated to match the latest viral taxonomy established by ICTV, we manually removed reference to the families Siphoviridae, Myoviridae, and Podoviridae, but opted to retain references to the order Caudovirales (now class Caudoviricetes) to allow for straightforward comparisons to previous published datasets. Open reading frame (ORF) prediction was performed using Prokka 1.14.5 with – meta enabled and – kingdom Bacteria or – Viruses against the set of non-viral and viral contigs, respectively.Citation79 All open reading frames were clustered using Mmseqs2 13.45111,Citation80 and annotated using hmmsearch (minimum E-value of 1E-05), a part of HMMER 3.3,Citation75 against the following databases: VOG (release 206),Citation41 pVOG,Citation42 PFAM (34.0),Citation43 KEGG (98.0),Citation44 and TIGRFAM (15.0).Citation45 Genes were preferentially annotated in the order of the databases listed above. crAss-like phages were additionally annotated using a curated set of crAss-like phage protein profiles.Citation51 SNP analysis was performed on un-clustered crAss-like phages using Snippy 4.6.0.Citation50 A viral contig was considered to contain an integrase if it contained one or more of the 603 genes annotated with the term “integrase”; the majority of these (591/603) were annotated by VOG00035.

All metagenomic, metatranscriptome, and metaviromic reads were mapped to the pooled set of viral and non-viral contigs using bowtie2,Citation81,Citation82 with counts and breadth of coverage calculated using samtools depth.Citation70 A breadth of coverage threshold of ≥75% of the contig should have a minimum depth of ≥1 read per bp was applied to metagenome and metavirome count tables; counts under this threshold was set to 0. The normalized relative abundance (NRA) of each VC was calculated as xixi where xi is the number of reads mapping to a contig divided by the contig length. Gene counts were calculated using subread featureCounts.Citation83

Viral gene transcription and host prediction

A viral contig was considered to be transcriptionally active if its NRA in the metatranscriptome was greater than 0.0001%. Gene features within transcriptionally active VCs with >1 counts were considered to be transcribed. Gene transcription was approximated using the ratio of NRA in the metatranscriptome to NRA in the metagenome.

WiSH 1.1 was run using the set of non-viral contigs against VCs.Citation84 Only virus-host scores with an adjusted p-value < 10Citation5 were selected for downstream analysis. Correlation with their viral hosts was calculated using a Spearman correlation.

Statistical analysis

Alpha-diversity and beta-diversity analysis, host-phage correlations, statistical analysis, and plotting were performed in R 4.1.3 using phyloseq 1.36.0.Citation85 Figures were created with the following packages: ggplot2 3.3.0, ggthemes 4.2.0, ggpubr 0.4.0, gggenes 0.4.1, ggalluvial 0.12.3, eulerr 6.1.0, and UpSetR 1.4.0. Alpha diversity analysis was performed using read counts rarefied to 1,104,709 reads (the lowest number of viral reads identified in a sample across the dataset). The Chao1 index was calculated as S0+a122a2 where S0 was the total observed species, a1 the number of species seen once and a2 the number of species seen twice; and the Shannon index was calculated as iSPilnPi where S was the total number of species and Pi the proportion of the total population composed of species i. The likelihood ratio was calculated as Sensitivity/(1 – Specificity).

Abbreviations

DC, distal colon; MLI, mucosal-luminal interface; IC, integration-capable; NI, non-integrated; NRA, normalized relative abundance; ORF: open reading frame; PC, proximal colon; STL: stool; TA, transcriptionally active; VC: viral contig; VLP: virus-like particle

Author contributions

AY and AS designed the experiments. AY performed all virome extractions, data analysis, and wrote the initial manuscript. Whole genome extractions were performed by AY, while LS performed the metatranscriptome extractions. DM was responsible for sample collection and clinical data generation. JB assisted in bioinformatic analysis and data interpretation. All authors reviewed and provided comments on the final manuscript.

Supplemental material

Supplemental Material

Download Zip (4.3 MB)

Acknowledgments

The authors would like to acknowledge the patients and their families for their participation in our study. We also thank Ruth Singleton for her help in enrolling patients and assistance in collecting intestinal aspirate samples and Dr. Kendra Hodgkinson for her proofreading and review of this manuscript. The multiomic analyses presented herein were enabled in part by Compute Ontario, and the Digital Research Alliance of Canada (https://alliancecan.ca/en).

AY is supported by the Frederick Banting and Charles Best Canada Graduate Scholarships Doctoral Award from the Canadian Institutes of Health Research. DRM is supported in part through a Distinguished Clinical Chair in Pediatric IBD through the Faculty of Medicine, University of Ottawa. This work was supported by the Government of Canada through Genome Canada and the Ontario Genomics Institute under Grant OGI-149; the Canadian Institutes of Health Research under Grant ECD-144627; and the Ontario Ministry of Economic Development and Innovation under Project 13440. The funders had no role in study design, data collection and analysis, or preparation of the manuscript.

Disclosure statement

AS and DM are co-founders of MedBiome, a clinical microbiomics company. The other authors have no competing interests to declare.

Data Availability Statement

Host-removed, high-quality sequencing reads are available under BioProject PRJNA81830 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818303.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19490976.2023.2177488

Additional information

Funding

This work was supported by the Canadian Institutes of Health Research [ECD-144627]; Genome Canada [OGI-149]; Ontario Ministry of Research and Innovation [Project 13440]; Frederick Banting and Charles Best [Scholarship]; Faculty of Medicine, University of Ottawa [Distinguished Clinical Chair].

References

  • Tisza MJ, Buck CB. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc Natl Acad Sci. 2021;118(23):e2023202118. doi:10.1073/pnas.2023202118.
  • Liang G, Bushman FD. The human virome: assembly, composition and host interactions. Nat Rev Microbiol. 2021;19(8):514–23. doi:10.1038/s41579-021-00536-5.
  • Shkoporov AN, Hill C. Review Bacteriophages of the Human Gut: the “ Known Unknown “ of the Microbiome. Cell Host Microbe. 2019;25(2):195–209. doi:10.1016/j.chom.2019.01.017.
  • Shkoporov AN, Clooney AG, Sutton TDS, Ryan FJ, Daly KM, Nolan JA, McDonnell SA, Khokhlova EV, Draper LA, Forde A, et al. The Human Gut Virome Is Highly Diverse, Stable, and Individual Specific. Cell Host Microbe. 2019;26(1):527–541.e5. doi:10.1016/j.chom.2019.09.009.
  • Liang G, Zhao C, Zhang H, Mattei L, Sherrill-Mix S, Bittinger K, Kessler LR, Wu GD, Baldassano RN, DeRusso P, et al. The stepwise assembly of the neonatal virome is modulated by breastfeeding. Nature. 2020;581(7809):470–474. doi:10.1038/s41586-020-2192-1.
  • Gregory AC, Zablocki O, Zayed AA, Howell A, Bolduc B, Sullivan MB. The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut. Cell Host Microbe. 2020;28(5):724–740.e8. doi:10.1016/j.chom.2020.08.003.
  • Guerin E, Shkoporov A, Stockdale SR, Clooney AG, Ryan FJ, Sutton TDS, Draper LA, Gonzalez-Tortuero E, Ross RP, Hill C, et al. Biology and Taxonomy of crAss-like Bacteriophages, the Most Abundant Virus in the Human Gut. Cell Host Microbe. 2018;24(1):653–664.e6. doi:10.1016/j.chom.2018.10.002.
  • Koonin EV, Yutin N. The crAss-like Phage Group: how Metagenomics Reshaped the Human Virome. Trends Microbiol. 2020;28(5):349–359. doi:10.1016/j.tim.2020.01.010.
  • Nayfach S, Páez-Espino D, Call L, Low SJ, Sberro H, Ivanova NN, Proal AD, Fischbach MA, Bhatt AS, Hugenholtz P, et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nature Microbiol. 2021;6(7):960–970. doi:10.1038/s41564-021-00928-6.
  • Johansen J, Plichta DR, Nissen JN, Jespersen ML, Shah SA, Deng L, Stokholm J, Bisgaard H, Nielsen DS, Sørensen SJ, et al. Genome binning of viral entities from bulk metagenomics data. Nat Commun. 2022;13(1):965. doi:10.1038/s41467-022-28581-5.
  • Yan A, Butcher J, Mack D, Stintzi A. Virome Sequencing of the Human Intestinal Mucosal–Luminal Interface. Front Cell Infect Microbiol. 2020;10:10. doi:10.3389/fcimb.2020.00010.
  • Garmaeva S, Sinha T, Kurilshikov A, Fu J, Wijmenga C, Zhernakova A, Kurilshikov A, Fu J, Wijmenga C, Zhernakova A, et al. Studying the gut virome in the metagenomic era: challenges and perspectives. BMC Biol. 2019;17:17. doi:10.1186/s12915-019-0636-6.
  • Santiago-Rodriguez TM, Naidu M, Abeles SR, Boehm TK, Ly M, Pride DT. Transcriptome analysis of bacteriophage communities in periodontal health and disease. BMC Genomics. 2015;16(1):549. doi:10.1186/s12864-015-1781-0.
  • Callanan J, Stockdale SR, Shkoporov A, Draper LA, Ross RP, Hill C. Expansion of known ssRNA phage genomes: from tens to over a thousand. Sci Adv. 2020;6(6):eaay5981. doi:10.1126/sciadv.aay5981.
  • Owen SV, Canals R, Wenner N, Hammarlöf DL, Kröger C, Hinton JCD. A window into lysogeny: revealing temperate phage biology with transcriptomics. Microb Genom. 2020;6(2):e000330. doi:10.1099/mgen.0.000330.
  • Leskinen K, Blasdel BG, Lavigne R, Skurnik M. RNA-Sequencing Reveals the Progression of Phage-Host Interactions between φR1-37 and Yersinia enterocolitica. Viruses. 2016;8(4):111. doi:10.3390/v8040111.
  • Shkoporov AN, Khokhlova EV, Stephens N, Hueston C, Seymour S, Hryckowian AJ, Scholz D, Ross RP, Hill C. Long-term persistence of crAss-like phage crAss001 is associated with phase variation in Bacteroides intestinalis. BMC Biol. 2021;19(1):163. doi:10.1186/s12915-021-01084-3.
  • Mottawea W, Butcher J, Li J, Abujamel T, Manoogian J, Mack D, Stintzi A. The mucosal–luminal interface: an ideal sample to study the mucosa-associated microbiota and the intestinal microbial biogeography. Pediatr Res. 2019;85(6):895–903. doi:10.1038/s41390-019-0326-7.
  • Shkoporov AN, Stockdale SR, Lavelle A, Kondova I, Heuston C, Upadrasta A, Khokhlova EV, van der Kamp I, Ouwerling B, Draper LA, et al. Viral biogeography of the mammalian gut and parenchymal organs. Nature Microbiol. 2022;7(8):1301–1311. doi:10.1038/s41564-022-01178-w.
  • Silveira CB, Rohwer FL. Piggyback-the-Winner in host-associated microbial communities. npj Biofilms Microbio. 2016;2(1):16010. doi:10.1038/npjbiofilms.2016.10.
  • Galley JD, Yu Z, Kumar P, Dowd SE, Lyte M, Bailey MT. The structures of the colonic mucosa-associated and luminal microbial communities are distinct and differentially affected by a prolonged murine stressor. Gut Microbes. 2014;5(6):748–760. doi:10.4161/19490976.2014.972241.
  • Duerkop BA, Condit RC. Bacteriophages shift the focus of the mammalian microbiota. PLoS Pathog. 2018;14(10):e1007310. doi:10.1371/journal.ppat.1007310.
  • Zuo T, Lu X-J, Zhang Y, Cheung CP, Lam S, Zhang F, Tang W, Ching JYL, Zhao R, Chan PKS, et al. Gut mucosal virome alterations in ulcerative colitis. Gut. 2019;2019:1–11.
  • Adiliaghdam F, Amatullah H, Digumarthi S, Saunders TL, Rahman R-U, Wong LP, Sadreyev R, Droit L, Paquette J, Goyette P, et al. Human enteric viruses autonomously shape inflammatory bowel disease phenotype through divergent innate immunomodulation. Sci Immunol. 2022;7(70):eabn6660–eabn. doi:10.1126/sciimmunol.abn6660.
  • Jiang P, Lai S, Wu S, Zhao X-M, Chen W-H. Host DNA contents in fecal metagenomics as a biomarker for intestinal diseases and effective treatment. BMC Genomics. 2020;21(1):348. doi:10.1186/s12864-020-6749-z.
  • Schroeder KW, Tremaine WJ, Ilstrup DM. Coated Oral 5-Aminosalicylic Acid Therapy for Mildly to Moderately Active Ulcerative Colitis. NEJM. 1987;317(26):1625–1629. doi:10.1056/NEJM198712243172603.
  • Vancuren SJ, Hill JE. Update on cpnDB: a reference database of chaperonin sequences. Database. 2019;2019. doi:10.1093/database/baz033.
  • Shkoporov AN, Ryan FJ, Draper LA, Forde A, Stockdale SR, Daly KM, Whiteley AS, Murrell JC. Reproducible protocols for metagenomic analysis of human faecal phageomes. Microbiome. 2018;6(1):1–17. doi:10.1186/s40168-017-0383-2.
  • feargalr. Demovir: taxonomic classification of viruses at Order and Family level; 2019.
  • Nayfach S, Camargo AP, Schulz F, Eloe-Fadrosh E, Roux S, Kyrpides NC. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;39(5):578–585. doi:10.1038/s41587-020-00774-7.
  • Bin Jang H, Bolduc B, Zablocki O, Kuhn JH, Roux S, Adriaenssens EM, Brister JR, Kropinski AM, Krupovic M, Lavigne R, et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019;37(6):632–639. doi:10.1038/s41587-019-0100-8.
  • Reyes A, Wu M, McNulty NP, Rohwer FL, Gordon JI. Gnotobiotic mouse model of phage–bacterial host dynamics in the human gut. Proc Natl Acad Sci. 2013;110(50):20236–20241. doi:10.1073/pnas.1319470110.
  • Lau MCY, Harris RL, Oh Y, Yi MJ, Behmard A, Onstott TC. Taxonomic and Functional Compositions Impacted by the Quality of Metatranscriptomic Assemblies. Front Microbiol. 2018;9. doi:10.3389/fmicb.2018.01235.
  • Clooney AG, Sutton TDS, Shkoporov AN, Holohan RK, Daly KM, O’Regan O, Ryan FJ, Draper LA, Plevy SE, Ross RP, et al. Whole-Virome Analysis Sheds Light on Viral Dark Matter in Inflammatory Bowel Disease. Cell Host Microbe. 2019;26(6):764–78 e5. doi:10.1016/j.chom.2019.10.009.
  • Kim K-H, Bae J-W. Amplification methods bias metagenomic libraries of uncultured single-stranded and double-stranded DNA viruses. Appl Environ Microbiol. 2011;77(21):7663–7668. doi:10.1128/AEM.00289-11.
  • Parras-Moltó M, Rodríguez-Galet A, Suárez-Rodríguez P, López-Bueno A. Evaluation of bias induced by viral enrichment and random amplification protocols in metagenomic surveys of saliva DNA viruses. Microbiome. 2018;6(1):119. doi:10.1186/s40168-018-0507-3.
  • Vaga S, Lee S, Ji B, Andreasson A, Talley NJ, Agréus L, Bidkhori G, Kovatcheva-Datchary P, Park J, Lee D, et al. Compositional and functional differences of the mucosal microbiota along the intestine of healthy individuals. Sci Rep. 2020;10(1):14977. doi:10.1038/s41598-020-71939-2.
  • Van den Abbeele P, Van de Wiele T, Verstraete W, Possemiers S. The host selects mucosal and luminal associations of coevolved gut microorganisms: a novel concept. FEMS Microbiol Rev. 2011;35(4):681–704. doi:10.1111/j.1574-6976.2011.00270.x.
  • Almeida GMF, Laanto E, Ashrafi R, Sundberg L-R, Martiny JBH. Bacteriophage Adherence to Mucus Mediates Preventive Protection against Pathogenic Bacteria. mBio. 2019;10(6):e01984–19. doi:10.1128/mBio.01984-19.
  • Barr JJ, Auro R, Furlan M, Whiteson KL, Erb ML, Pogliano J, Stotland A, Wolkowicz R, Cutting AS, Doran KS, et al. Bacteriophage adhering to mucus provide a non–host-derived immunity. Proc Natl Acad Sci. 2013;110(26):10771. doi:10.1073/pnas.1305923110.
  • CUBE UoV. VOGDB: virus Orthologous Groups; 2021.
  • Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 2017;45(D1):D491–D8. doi:10.1093/nar/gkw975.
  • Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar Gustavo A, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2020;49(D1):D412–D9. doi:10.1093/nar/gkaa913.
  • Kanehisa M, Goto S. KEGG: kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. doi:10.1093/nar/28.1.27.
  • Haft DH. The TIGRFAMs database of protein families. Nucleic Acids Res. 2003;31(1):371–373. doi:10.1093/nar/gkg128.
  • Moraru C, Varsani A, Kropinski AM. VIRIDIC—A Novel Tool to Calculate the Intergenomic Similarities of Prokaryote-Infecting Viruses. Viruses. 2020;12(11):1268. doi:10.3390/v12111268.
  • Dutilh BE, Cassman N, McNair K, Sanchez SE, Silva GGZ, Boling L, Barr JJ, Speth DR, Seguritan V, Aziz RK, et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun. 2014;5(1). doi:10.1038/ncomms5498.
  • Gulyaeva A, Garmaeva S, Ruigrok RAAA, Wang D, Riksen NP, Netea MG, Wijmenga C, Weersma RK, Fu J, Vila AV, et al. Discovery, diversity, and functional associations of crAss-like phages in human gut metagenomes from four Dutch cohorts. Cell Rep. 2022;38(2):110204. doi:10.1016/j.celrep.2021.110204.
  • Siranosian BA, Tamburini FB, Sherlock G, Bhatt AS. Acquisition, transmission and strain diversity of human gut-colonizing crAss-like phages. Nat Commun. 2020;11:15.
  • Seemann T. Snippy: rapid haploid variant calling and core genome alignment (version 4.6.0); 2020.
  • Yutin N, Benler S, Shmakov SA, Wolf YI, Tolstoy I, Rayko M, Antipov D, Pevzner PA, Koonin EV. Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features. Nat Commun. 2021;12(1):1044. doi:10.1038/s41467-021-21350-w.
  • Yutin N, Makarova KS, Gussow AB, Krupovic M, Segall A, Edwards RA, Koonin EV. Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut. Nature Microbiol. 2018;3:38–46. doi:10.1038/s41564-017-0053-y.
  • Minot S, Bryson A. Rapid evolution of the human gut virome. Proc Natl Acad Sci. 2013;110:12450–12455.
  • Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus–host interactions resolved from publicly available microbial genomes. eLife. 2015;4:e08490. doi:10.7554/eLife.08490.
  • Lourenço M, Chaffringeon L, Lamy-Besnier Q, Pédron T, Campagne P, Eberl C, Bérard M, Stecher B, Debarbieux L, De Sordi L, et al. The Spatial Heterogeneity of the Gut Limits Predation and Fosters Coexistence of Bacteria and Bacteriophages. Cell Host Microbe. 2020;28(1):390–401.e5. doi:10.1016/j.chom.2020.06.002.
  • Fogg PCM, Rigden DJ, Saunders JR, McCarthy AJ, Allison HE. Characterization of the relationship between integrase, excisionase and antirepressor activities associated with a superinfecting Shiga toxin encoding bacteriophage. Nucleic Acids Res. 2011;39(6):2116–2129. doi:10.1093/nar/gkq923.
  • Balding C, SA B, Pickup RW, Saunders JR. Diversity of phage integrases in Enterobacteriaceae: development of markers for environmental analysis of temperate phages. Environ Microbiol. 2005;7(10):1558–1567. doi:10.1111/j.1462-2920.2005.00845.x.
  • Hannigan GD, Meisel JS, Tyldsley AS, Zheng Q, Hodkinson BP, SanMiguel AJ, Minot S, Bushman FD, and Grice EA. The Human Skin Double-Stranded DNA Virome: topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome. mBio. 2015;6(5):e01578–15. doi:10.1128/mBio.01578-15.
  • Hockenberry AJ, Wilke CO. BACPHLIP: predicting bacteriophage lifestyle from conserved protein domains. PeerJ. 2021;9:e11396–e. doi:10.7717/peerj.11396.
  • Guerin E, Shkoporov AN, Stockdale SR, Comas JC, Khokhlova EV, Clooney AG, Daly KM, Draper LA, Stephens N, Scholz D, et al. Isolation and characterisation of ΦcrAss002, a crAss-like phage from the human gut that infects Bacteroides xylanisolvens. Microbiome. 2021;9(1):89. doi:10.1186/s40168-021-01036-7.
  • Mäntynen S, Laanto E, Oksanen HM, Poranen MM, Díaz-Muñoz SL. Black box of phage-bacterium interactions: exploring alternative phage infection strategies. Open Biol. 2021;11(9):210188. doi:10.1098/rsob.210188.
  • Nami Y, Imeni N, Panahi B. Application of machine learning in bacteriophage research. BMC Microbiol. 2021;21(1):193. doi:10.1186/s12866-021-02256-5.
  • Song K. Classifying the Lifestyle of Metagenomically-Derived Phages Sequences Using Alignment-Free Methods. Front Microbiol. 2020;11. doi:10.3389/fmicb.2020.567769.
  • Collier AM, Lyytinen OL, Guo YR, Toh Y, Poranen MM, Tao YJ. Initiation of RNA Polymerization and Polymerase Encapsidation by a Small dsRNA Virus. PLoS Pathog. 2016;12(4):e1005523. doi:10.1371/journal.ppat.1005523.
  • Townsend EM, Kelly L, Muscatt G, Box JD, Hargraves N, Lilley D, Jameson E. The Human Gut Phageome: origins and Roles in the Human Gut Microbiome. Front Cell Infect Microbiol. 2021;11. doi:10.3389/fcimb.2021.643214
  • Jimenez-Rivera C, Haas D, Boland M, Barkey JL, Mack DR. Comparison of Two Common Outpatient Preparations for Colonoscopy in Children and Youth. Gastroenterol Res Pract. 2009;2009:518932. doi:10.1155/2009/518932.
  • Mottawea W, Chiang C-K, Mühlbauer M, Starr AE, Butcher J, Abujamel T, Deeke SA, Brandel A, Zhou H, Shokralla S, et al. Altered intestinal microbiota–host mitochondria crosstalk in new onset Crohn’s disease. Nat Commun. 2016;7(1):13419. doi:10.1038/ncomms13419.
  • Stahl M, Friis LM, Nothaft H, Liu X, Li J, Szymanski CM, Stintzi A. l -Fucose utilization provides Campylobacter jejuni with a competitive advantage. Proc Natl Acad Sci. 2011;108(17):7194–7199. doi:10.1073/pnas.1014125108.
  • Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, Murphy TD, Pruitt KD, Thibaud-Nissen F, Albracht D, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–864. doi:10.1101/gr.213611.116.
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The Sequence Alignment/Map format and SAMtools. bioinformatics. 2009;25(16):2078–2079. doi:10.1093/bioinformatics/btp352.
  • Clarke EL, Taylor LJ, Zhao C, Connell A, Lee J-J, Fett B, Shah P, Lou Y, Ebeling C, Mason AL, et al. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7(1):1–13. doi:10.1186/s40168-018-0604-3.
  • Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36. doi:10.1186/gb-2013-14-4-r36.
  • Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28(24):3211–3217. doi:10.1093/bioinformatics/bts611.
  • Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 2010;11(1):119. doi:10.1186/1471-2105-11-119.
  • Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195. doi:10.1371/journal.pcbi.1002195.
  • Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–1676. doi:10.1093/bioinformatics/btv033.
  • Roux S, Bolduc B. Stampede-ClusterGenomes. MAVERICKLab. 2017.https://bitbucket.org/MAVERICLab/stampede-clustergenomes/src/master/
  • von Meijenfeldt FAB, Arkhipova K, Cambuy DD, Coutinho FH, Dutilh BE. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol. 2019;20(1):217. doi:10.1186/s13059-019-1817-x.
  • Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–2069. doi:10.1093/bioinformatics/btu153.
  • Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–1028. doi:10.1038/nbt.3988.
  • Ziemann M. Accuracy, speed and error tolerance of short DNA sequence aligners. bioRxiv. 2016;053686.
  • Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–359. doi:10.1038/nmeth.1923.
  • Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2013;30(7):923–930. doi:10.1093/bioinformatics/btt656.
  • Galiez C, Siebert M, Enault F, Vincent J, Söding J. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics. 2017;33(19):3113–3114. doi:10.1093/bioinformatics/btx383.
  • McMurdie PJ, Holmes S. phyloseq: an R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PloS one. 2013;8(4):e61217. doi:10.1371/journal.pone.0061217.