338
Views
0
CrossRef citations to date
0
Altmetric
Reports

Phylobook: a tool for display, clade annotation and extraction of sequences from molecular phylogenies

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Received 07 Jul 2023, Accepted 22 Mar 2024, Published online: 03 May 2024

Abstract

As the volume of sequence data from variable pathogens increases, means of analyzing, annotating and extracting specific taxa for study becomes more difficult. To meet these challenges for datasets with hundreds to thousands of taxa, ‘Phylobook’ was developed. Starting with a sequence alignment file, Phylobook generates and displays phylogenetic trees adjacent to highlighter plots showing the position of mutations, and allows the user to identify lineages and recombinants, annotate and export selected subsets of sequences for downstream analysis. Accurate lineage assignment, which is difficult to automate, is aided using annotations created by different clustering methods. Phylobook provides web-based display combined with automated clustering and manual editing to allow for expert assessment and correction of lineage assignments and extraction for downstream analysis.

Tweetable abstract

Phylobook generates and displays phylogenetic trees adjacent to highlighter plots showing the position of mutations, allows the user to identify lineages and recombinants, annotate and export selected subsets of sequences for downstream analysis.

Method summary

Phylobook is a web-based tool for sharing phylogenetic trees and collaborating on lineage assignments. A unique feature is that lineage assignment is aided by two plots that are displayed adjacent to the phylogenetic tree. A mismatch plot highlights differences between individual sequences and the most abundant sequence in the tree and a match plot shows similarities between individual sequences and the currently assigned lineages.

Executive summary
  • Tracking emerging and spreading virus variant populations via analysis of highly accurate, long-read, deep sequence data is becoming increasingly important.

  • Phylobook eases the task of analyzing closely related sequences by displaying data as a phylogenetic tree adjacent to aligned plots that highlight phylogenetically important sites.

Materials & methods

  • Phylobook is web based tool written primarily in Python 3 using the Django web framework and made available in a docker container via GitHub.

Results

  • We demonstrate the utility of Phylbook through two use cases, Identification of lineages in an HIV infected subject and Display and identification of lineages in SARS-CoV2 sequence data from Washington State.

Future directions

  • We anticipate that a great deal of future research will be devoted to both tracking known infectious agents and to understanding the rapidly expanding repertoire of novel agents. Phylobook empowers research in such areas by allowing researchers to easily share, visualize and annotate multiple sequence alignments and extract lineage information.

The technology for study of viral evolution has accelerated greatly in the past several decades with the development of increasingly inexpensive and deep sequencing technologies. During the same period, there has also been widespread dissemination of variable human pathogens, for example, human immunodeficiency virus (HIV), Zika, and SARS coronaviruses. Tracking emerging and spreading virus variant populations has been extremely important in the design of annual influenza vaccines, the design of HIV vaccines and surveilling the spread of SARS-CoV2 variants of concern during the recent pandemic. Monitoring sequence variation has also been important to the identification of the often-subtle impacts (i.e., sieve effects [Citation1,Citation2]) of vaccines and therapies on virus infections in treated virus populations, including each of the major HIV vaccine trials over the past 15 years [Citation3-11]. Selection on viral genome sequences that is driven by treatment may occur throughout viral genomes and co-varying sites may be linked across large distances within genes and across the genome. The assessment of potentially distantly linked but synergistic genetic changes is not possible with typical short-read gene sequencing technologies, and long-range PCR followed by Sanger sequencing is not practical for study of low level, potentially emerging or waning variants that occur early in infection when vaccine effects are strongest, or upon treatment. As a result, highly accurate long-read deep sequencing technologies are starting to find widespread use for these and similar studies.

Analysis of deeply sampled, long-read derived phylogenies is complicated by several factors, including the high rate of viral evolution resulting in many distinct taxa, often only subtle impacts from treatment(s), the frequent transmission of multiple lineages, recombination between lineages and technical issues such as PCR and sequencing errors and the creation of PCR chimeras during sample processing. Phylobook eases the task of analysis of closely related sequences by several means: display of sequence data in two forms – as a phylogenetic tree juxtaposed with an alignment of sequences showing the positions of differences between sequences (i.e., a highlighter plot [Citation12]); superimposition of the results of user-selected clustering algorithms to help guide clade selection; user defined clade or variant annotation; the display of a ‘match plot’ showing sequence similarity between sequences in the tree and the consensus of assigned clades and; export of user-defined groups of sequences.

Materials & methods

Software design

Phylobook is written primarily in Python 3 using the Django web framework. Application permissions are managed through the Django administration module where users can be organized within groups and when groups are used, user permissions are inherited from the group. User credentials and permissions are stored in a local PostgresSQL database. Three different authentication models are configurable when installing Phylobook: python3-SAML based Single Sign On (SSO); local authentication using the standard Django authentication system; a combination of SSO and local authentication. The login method is configurable in an environment file.

Software installation & availability

Phylobook is installed as a server within a Docker container on machines running MacOS, Linux or Windows. Instructions for installation are at https://github.com/MullinsLab/phylobook. A detailed set of instructions for installation and usage is contained in the Phylobook User Manual. The docker container will supply the web server, the Postgres server and associated database, the configuration files, and the associated python and template files needed. The environment is then edited to designate a directory that will contain all the data for the projects within Phylobook (the project path). The configuration of Phylobook as an internet accessible server requires some knowledge of web server host port mapping and SSL certificate management. An SSO authentication configuration will also require a modest amount of additional editing of the environment file by someone who is familiar with SAML.

After installation and configuration, an admin panel within Phylobook allows an administrator to create projects, users, groups and the associated permissions. Projects are created by defining additional directories under the project path and by populating directories with the files detailed in . The bulk of the data used by Phylobook is contained within flat files in these directories. These files can be created by any number of software tools, but it is recommended that users implement the data processing pipeline described in the following section. Within the administrative panel, projects can be grouped into related collections of projects to simplify the listing of projects on the landing page.

Table 1. Files required for each dataset within a Phylobook project.

Data processing

Phylobook was designed to support projects containing multiple subject/sample datasets per project. Within each dataset it is assumed there are multiple sequences. In our typical use case, deep sequencing provides multiple sequences per subject at one or more time points and there are multiple subjects within a project (see Use Case 1). However, other use cases are also possible within Phylobook. For example, collections of SARS-CoV2 sequences from given populations can be stored and displayed within Phylobook to allow rapid visualization and selection of sequences from different lineages (see Use Case 2).

Following input of a FASTA formatted sequence alignment file, the Phylobook pipeline generates a phylogenetic tree and highlighter plot for input into a Phylobook project as described in . The process used by our group to generate the required input files is detailed in and the Phylobook pipeline software used to preprocess sequence data for display in Phylobook is available at https://github.com/MullinsLab/phylobook_pipeline. The Phylobook pipeline is available as a standalone tool or can be installed as a docker container. Detailed instructions for installing and using the pipeline as a docker container are provided in the Phylobook pipeline readme file. The process in is run for each sequence dataset. In brief, the sequence alignment is used to create a maximum likelihood phylogenetic tree using PhyML (https://github.com/stephaneguindon/phyml). When large numbers of closely related taxa are to be examined it may be useful to collapse identical sequences into one, with the number of sequences represented appended to the sequence name and output in rank order of abundance in a FASTA file (script available at https://github.com/MullinsLab/sequence_collapsing). This information is used by Phylobook to provide a representation of the relative abundance of a given sequence within the sampled population. The phylogenetic tree is rendered as an image using Figtree (https://github.com/rambaut/figtree). We have written an in-house tool to replicate the web-based Highlighter tool (https://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/highlighter_top.html) and this is used to generate an image that highlights the variation between sequences within the alignment. Our version of the highlighter tool is available at https://github.com/MullinsLab/Highlighter. Highlighter automatically uses the first sequence in the FASTA file containing the alignment as a reference sequence, in other words, variation in other sequences is relative to the first sequence. Thus, ordering the sequences in the alignment file by rank order of abundance assures that sequence differences are shown relative to the most common sequence. The outputs from the processes in are fed into Phylobook for display. It should be noted that the Phylobook pipeline is not required for use of the Phylobook server. Indeed, users may prefer alternate approaches to generating the same output types. For example, one could use the tool of their choice to do the multiple sequence alignments, use an alternate tool to do the tree building and then run the tree files and alignment files through FigTree and Highlighter (using the web-based tools if desired). After that, one would upload the desired files to Phylobook directory. While this is admittedly more work than running a pipeline, it provides complete flexibility in algorithm choice for the various steps that we have automated in the pipeline. We recommend that users start with the pipeline we have designed and adjust the pipeline to meet their specific preferences and scientific goals.

Figure 1. Data processing steps for documents to be imported into Phylobook.

Processes are shown in black boxes, files are shown in green boxes with a clipped corner. Optional processes are linked with dashed lines. Processes performed within the Phylobook pipeline are inside the red dashed box.

Figure 1. Data processing steps for documents to be imported into Phylobook.Processes are shown in black boxes, files are shown in green boxes with a clipped corner. Optional processes are linked with dashed lines. Processes performed within the Phylobook pipeline are inside the red dashed box.

Within Phylobook, one can manually define lineages by labeling and grouping sequences. Sequences associated with each lineage designation can then be exported. Optionally, a clustering or other algorithm can be used to pre-assign sequences to tentative lineage designations which can then be edited. This speeds up and increases the accuracy of lineage assignments. A simple clustering tool for this purpose is available at https://github.com/MullinsLab/ClusteringForPhylobook.

Results

After installation and populating the required directories with the appropriate files for each project, pointing a web browser to the Phylobook webserver will display a list of available projects (A) and clicking on a project will reveal a view similar to that shown in B. In this view, all the trees within a given project are displayed in a minimal thumbnail view – each within a rectangular box. A scale bar is present at the top of the Phylobook project and if a color range is selected and applied using the Update button, atop each Phylobook entry. This allows user defined scaling that results in placement of a small box to the left of the taxon name that is colored according to the number of collapsed sequences represented by that taxon name.

Figure 2. Initial Phylobook displays.

(A) The landing page. This contains a list of projects organized by categories defined in the Admin tool. (B) The main project page within Phylobook. Each tree and highlighter plot is shown in a thumbnail view within the rectangular boxes. Scrolling up and down reveals all the trees within a project. Clicking on the ‘Full’ button, expands the view for that tree (see ).

Figure 2. Initial Phylobook displays.(A) The landing page. This contains a list of projects organized by categories defined in the Admin tool. (B) The main project page within Phylobook. Each tree and highlighter plot is shown in a thumbnail view within the rectangular boxes. Scrolling up and down reveals all the trees within a project. Clicking on the ‘Full’ button, expands the view for that tree (see Figure 3).

Clicking on the “Full” button expands the tree for a given sample (A). Pre-processing of the image files aligned the highlighter plot to the individual sequences within the phylogenetic tree. Also present at the top of each Phylobook tree in minimum view is a button labeled ‘show annotation tools’ (upper right corner of boxes in B). This displays tools that can be used to annotate the tree. The tools are shown in B. The annotation tools displayed are context sensitive as not every type of annotation is available for a given dataset. For example, the “Annotate sequence boxes by cluster” tool only appears when clustering data is available.

Figure 3. Expanded Phylobook display.

(A) Each Phylobook entry is composed of a phylogenetic tree on the left, a highlighter plot on the right. The colored boxes to the left of sequence names provide a visual clue to the number of sequences collapsed into the sequence shown. The minimum and maximum cutoffs and color range are set in the text boxes and slider, respectively. The exact number is found as a suffix on each sequence name. This can be set for all Phylobook entries within the Project by selecting ‘Update All’ or ‘Remove All’. Different settings can be set for a given entry by selecting ‘Show color range’ within individual entry windows, or for the entire project by selecting ‘Save All’. Each entry can be viewed as a thumbnail (Min) or large image (Full), and the magnification further refined by clicking the (+) and (-) magnification icons. (B) A variety of annotation tools are available by clicking the ‘Hide/Show annotation tools’ button. In this figure annotations have been added to highlight the relative abundance of collapsed sequences in the tree. Colored squares to the left of each sequence name indicate the count of duplicates.

Figure 3. Expanded Phylobook display.(A) Each Phylobook entry is composed of a phylogenetic tree on the left, a highlighter plot on the right. The colored boxes to the left of sequence names provide a visual clue to the number of sequences collapsed into the sequence shown. The minimum and maximum cutoffs and color range are set in the text boxes and slider, respectively. The exact number is found as a suffix on each sequence name. This can be set for all Phylobook entries within the Project by selecting ‘Update All’ or ‘Remove All’. Different settings can be set for a given entry by selecting ‘Show color range’ within individual entry windows, or for the entire project by selecting ‘Save All’. Each entry can be viewed as a thumbnail (Min) or large image (Full), and the magnification further refined by clicking the (+) and (-) magnification icons. (B) A variety of annotation tools are available by clicking the ‘Hide/Show annotation tools’ button. In this figure annotations have been added to highlight the relative abundance of collapsed sequences in the tree. Colored squares to the left of each sequence name indicate the count of duplicates.

Sequence abundance annotation

When identical sequences are collapsed prior to tree building, the pre-processing pipeline adds an underscore followed by a number to the end of the sequence name corresponding to the number of identical sequences that have been collapsed into the representative sequence. A slider labeled “Mark sequences by count of duplicates” can be adjusted to produce square symbols in front of each sequence name that are color coded with the sequence count. This allows the user to rapidly identify which sequences are most abundant in the tree (B).

Lineage identification & editing

During data preprocessing one or more clustering algorithms can be used to semi-automate the lineage assignment process. When clustering data is available, an optional ‘Cluster’ pull-down menu appears with the annotation tools. This allows the user to select a clustering method and the clusters are depicted with colored triangles to the right of each taxon name. Then, selecting the apply button will highlight each sequence with a colored box (one color for each cluster ID) around the associated taxon names as is shown in . These constitute initial lineage assignments. Once initial lineage assignments exist, a ‘match plot’ shows to the right of the highlighter plot. A ‘match plot’ shows matches to sequences in the tree and reference sequences and was a concept that we have adapted from the LANL web based highlighter tool (https://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/highlighter_top.html). In this case, we automatically set the reference sequences to the consensus sequences of the assigned lineages and we color matches by the color of the assigned lineages (up to the first 5 lineages). Matches to more than one lineage are colored gray and there is an option (‘Show/Hide Multiple Matches’) that allows one to turn off the display of sequence matches to more than one lineage. The match allows one to visually identify recombinants and reassign them to a new lineage grouping that represents recombinants. Lineage assignments can be further manually edited as needed. Lineages can also be manually selected by a right-click over a sequence name and then drag across the full set of name(s) to include within a lineage. When the mouse click is released, a dialog box will be displayed and different colored boxes can be selected to indicate the selected lineage. As lineage assignments are updated, and saved the match plot automatically regenerates (with a slight time delay).

Figure 4. Lineage assignment tools and the match plot.

(A) Shows the result of using kmedoids clustering to assign tentative lineage designations to clades within a sample HIV dataset. Colored boxes around each sequence name indicate lineage membership. In this example, there are four clear lineages indicated by majority of sequences in boxes colored red, blue, black and orange. In addition, sequences in green and some of the sequences in red, blue, and black are likely recombinants. Note that once lineage assignments have been made a ‘match plot’ appears to the right of the highlighter plot. This shows how each sequence matches the consensus sequence of each of the assigned lineages. Matches to just one lineage are shown in the color of the lineage, matches to multiple lineages are shown as gray. Multiple matches can optionally be hidden/displayed with the ‘Show/Hide Multiple Matches’ button. (B) Shows the result of manually editing the initial lineage assignments in A to assign all the recombinants to the yellow ‘lineage’. After editing, the match plot automatically updates and we have turned off the display of multi-matches for clarity. The match plot shows clear recombinants as sequences with matches to more than one assigned lineage in different regions of the sequence.

Figure 4. Lineage assignment tools and the match plot.(A) Shows the result of using kmedoids clustering to assign tentative lineage designations to clades within a sample HIV dataset. Colored boxes around each sequence name indicate lineage membership. In this example, there are four clear lineages indicated by majority of sequences in boxes colored red, blue, black and orange. In addition, sequences in green and some of the sequences in red, blue, and black are likely recombinants. Note that once lineage assignments have been made a ‘match plot’ appears to the right of the highlighter plot. This shows how each sequence matches the consensus sequence of each of the assigned lineages. Matches to just one lineage are shown in the color of the lineage, matches to multiple lineages are shown as gray. Multiple matches can optionally be hidden/displayed with the ‘Show/Hide Multiple Matches’ button. (B) Shows the result of manually editing the initial lineage assignments in Figure 5A to assign all the recombinants to the yellow ‘lineage’. After editing, the match plot automatically updates and we have turned off the display of multi-matches for clarity. The match plot shows clear recombinants as sequences with matches to more than one assigned lineage in different regions of the sequence.

Use of the match plot is demonstrated in A & B. In A, there appears to be 4 clear lineages represented by the majority of the sequences in the red, blue, black and orange boxes. In addition, there appear to be some obvious recombinants (between the blue and black lineages and in a few other locations). B shows the same dataset with the recombinants now marked as yellow (and thus removed from the red, blue, black or orange lineages). For clarity, the annotation tools are hidden in 4B as are multiple matches in the match plot. Recombinants are easily visible in the match plot as sequences that show matches to more than one lineage in different portions of the sequence.

Once lineage assignments are made, sequences associated with each lineage can be assigned a lineage name by hitting the “Assign Lineage Names/Extract to File” button. This brings up the dialog shown in A which allows one to assign names and when that is completed, the dialog in B appears allowing one to download all lineages in a collection of appropriately named fastaA files. In our example, lineages are simply named Lineage 1..13 but with modest changes in the code, these names can be adjusted to an internal standard for a given lab/user. For example, we always use yellow as the designation for recombinants and name the lineage as such. If one has a project with a large collection of trees, the download of sequences can be deferred until all trees are processed through to named lineages and fasta files associated with the named lineages for the entire project can be downloaded in one zip file.

Figure 5. Dialogs for assigning lineage names and extracting sequences.

(A) The left panel shows the assignment of lineage names to given lineages. With modest editing of the software, lineage names can be adapted to specific needs of the user. (B) After lineage names have been assigned, sequences associated with each lineage can be extracted in FASTA format. FASTA files are exported in a zip file with enclosed folders for each tree and names for each lineage/project.

Figure 5. Dialogs for assigning lineage names and extracting sequences.(A) The left panel shows the assignment of lineage names to given lineages. With modest editing of the software, lineage names can be adapted to specific needs of the user. (B) After lineage names have been assigned, sequences associated with each lineage can be extracted in FASTA format. FASTA files are exported in a zip file with enclosed folders for each tree and names for each lineage/project.

Annotation based on fields in the sequence names

In some instances, it is useful to annotate the tree based on some other property (time of collection, tissue source, etc.). Phylobook provides a system to set the color of the sequence name in the tree based on properties encoded within the sequence names. In brief, sequence names can contain additional information in fields delimited with underscores. Phylobook parses the sequence names within a given tree to identify such fields and the number of variant types within a field. When such fields exist in the sequence names, the tool ‘Color sequence names by field’ appears when annotation tools are shown. The user can then select any field for which there are 10 or fewer values and color the sequences a different color for each value. For example, in the tree shown in , the dates of collection of the samples are encoded in the center of each sequence name in the format _YYYY_MM_ where YYYY is the year and MM is the month. Phylobook identifies the fields, allows users to select a field of interest and then color the sequence name with a unique color for each value of the field. In , the sequences are colored black for samples collected in 2020, blue for samples collected in 2021 and orange for samples collected in 2022.

Figure 6. Annotation of the tree by features encoded within the sequence name.

The tree shown contains spike protein sequences from SARS-CoV2 samples collected in the state of Washington over time. The dates of collection of the samples are encoded in the center of each sequence name in the format _YYYY_MM_ where YYYY is the year and MM is the month. In this example, sequence names are color coded by the year. Sequences were downloaded from the NCBI SARS-CoV-2 Data Hub at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049. The dataset used to generate this figure is available at the GitHub repository https://github.com/MullinsLab/phylobook/tree/master/sample_data

Figure 6. Annotation of the tree by features encoded within the sequence name.The tree shown contains spike protein sequences from SARS-CoV2 samples collected in the state of Washington over time. The dates of collection of the samples are encoded in the center of each sequence name in the format _YYYY_MM_ where YYYY is the year and MM is the month. In this example, sequence names are color coded by the year. Sequences were downloaded from the NCBI SARS-CoV-2 Data Hub at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049. The dataset used to generate this figure is available at the GitHub repository https://github.com/MullinsLab/phylobook/tree/master/sample_data

Comments & other annotations

For each dataset, there is a field in the upper left that allows a user to add comments to the Phylobook entry (see comment field in ). In addition, a colored palette of dots can be selected by first right-clicking on any blank area of the tree image to display the dialog box and a selected dot can then be added anywhere on the tree image. These are used to highlight a clade or sequence of particular interest and are particularly useful for labeling specific branch points in the tree.

Saving annotations

When any information is added to the entry, a red border is activated to highlight the Phylobook entry, along with a ‘Save’ button. A ‘Save All’ button is present at the top of the Phylobook project to save all edits in the session simultaneously. Notes and changes made by individual users track with the associated username and time of edit/comment. Thus, multiple users can review the same dataset and work on lineage assignments collaboratively.

Use case 1: identification of lineages in an HIV infected subject

Analyses of HIV infections are often complicated by the frequent observation of multiple lineages, non-functional hypermutated sequences, and the presence of sequences that appear to be recombinants between lineages.

shows a sample dataset. The data shown corresponds to 3 kb sequences derived from a single person who was sampled early in their HIV infection. K-medoid clustering with a k of 5 was used for the initial lineage assignments. On visual inspection, there appears to 4 lineages – one represented by the sequences in the upper part of the tree (red boxes) and 3 others (blue, black and orange) in the lower part of the tree (A). In addition, there appear to be multiple sequences in the middle of the tree that are recombinants formed between the lineages (green boxes and some mislabeled as red, blue or black). As discussed above, Phylobook allows the user to identify, label and export these groups for subsequent analysis.

Use case 2: display & identification of lineages in SARS-CoV2 sequence data from Washington state

SARS-CoV2 sequences obtained in the state of Washington between the months of March 2020 and July 2022 are shown in . Only complete genome sequences with less than or equal to one ambiguous (N) residue were retained. This produced a set of 15,811 sequences. The data were then ordered by collection date and sub-sampled to obtain the first ten sequences in each month, for example, the first ten sequences in the months of 3/2020, 4/2020...7/2022 were sub-selected from the downloaded data. In total 290 full length sequences were collected.

All sequences were mapped to the original Wuhan reference sequence and the spike protein encoding region was extracted. For each sequence, the name was modified to include the month from which it was extracted – e.g., sequence_name_yyyy_mm. Nucleotide sequences were collapsed to retain one sequence from each identical set of sequences and the date was replaced with the median date. Nucleotide sequences were then translated to amino acid sequences and run though the Phylobook pipeline. After collapsing there were 140 unique sequences from WA state plus the reference Wuhan sequence (NC_045512). This results in a small enough dataset to be displayed on one printed page within Phylobook and still reasonably samples the SARS-CoV2 sequences in WA state during that timeframe.

The emergence of different sequence variants over time is clear in the dataset (). For example: the clade in the upper portion of the tree (red colored boxes) represents the earliest (March 2020 and later) circulating variants of SARS-CoV2 and are most similar to the Wuhan reference which is included in the clade, adjacent to the red dot placed with Phylobook, the clade including sequence names enclosed by dark blue colored boxes correspond to the delta variant that peaked in mid-2021 in Washington state. Alpha (orange-colored boxes), gamma (light blue boxes) and omicron variants (lime green colored boxes) are shown with the corresponding date ranges of the sequences.

Figure 7. Display and analysis of SARS-CoV2 data in Phylobook.

Shown are spike protein sequences from SARS-CoV2 samples obtained over time in Washington state plus the reference Wuhan sequence (adjacent to the red dot to left of red boxed sequences). Clades were manually selected based on the tree and highlighter plot. Additional annotations (added in Powerpoint) show approximate WHO strain designations to the right of the tree.

Figure 7. Display and analysis of SARS-CoV2 data in Phylobook.Shown are spike protein sequences from SARS-CoV2 samples obtained over time in Washington state plus the reference Wuhan sequence (adjacent to the red dot to left of red boxed sequences). Clades were manually selected based on the tree and highlighter plot. Additional annotations (added in Powerpoint) show approximate WHO strain designations to the right of the tree.

In addition to the collaborative review of lineage assignments, Phylobook provides a convenient way to share sequence data and phylogenetic trees across multiple investigators. Access to the data is managed through user and group privileges and is designed so that specific users and groups can view and/or edit specific data sets. Access to each individual dataset can be tightly controlled and specific privileges can be assigned at the user and group level. The ability to support SSO login allows for institutional usernames and passwords to be used for access when multiple users belong to the same organization and the ability to also create server managed specific users and passwords allows access by users outside of the server host organization.

Summary & discussion

Phylobook is a tool for the display of molecular phylogenies and linked highlighter plots with facility for annotation of clades/lineages and the extraction of sequence data. Its utility was illustrated with two sample data sets – a nucleotide phylogeny from HIV data and an amino acid phylogeny from SARS-CoV2 spike protein data, each from collections made over time. Phylobook is capable of storing an unlimited number of phylogenetic trees and linked highlighter plots (limited only by disk storage space on the server) and supports collaborative editing and annotation of lineages and phylogenetic trees by multiple collaborating users.

Semi-automated lineage assignment followed by manual editing and collaborative review allows for robust lineage assignments. The ability to extract subsets of sequences allows downstream analyses to be performed separately on different lineages. Phylobook is used extensively to look at the effect of various treatments (vaccines and treatment protocols) on HIV infection.

Software & dataset availability

The Phylobook server software and the associated Phylobook pipeline are freely available to non-commercial users under an open-source license via GitHub at https://github.com/MullinsLab/phylobook and https://github.com/MullinsLab/phylobook_pipeline. Both are available for install in Docker containers. Docker installs allow for cross platform compatibility and reduce much of the complexity associated with dependencies of various pieces of code and versions thereof.

Conclusion

We have presented ‘Phylobook’ a web-based tool for the display and sharing of phylogenetic trees, semi-automated lineage assignment and manual editing of the lineage assignments. Displaying the tree adjacent to a highlighter plot showing the positions of mismatches between individual sequences and the most common sequence allows users to more rapidly identify and correctly edit lineage designations. An additional plot showing mismatches between individual sequences and assigned lineages allows users to more easily identify likely recombinants and further refine the lineage designations. The complete system thus allows highly accurate lineage designations.

As a web tool with the ability to set permissions for which users can access given data sets, Phylobook allows for secure sharing of data. A comment field within each dataset provides a mechanism for users to collaborate and comment on the data and lineage assignments. The ability to export all the data sorted by lineages allows for downstream analysis on a lineage specific basis. We have found this feature invaluable in our work to identify selective pressure of vaccines or monoclonal antibodies on HIV acquisition and evolution post infection.

Future perspective

Phylobook is a web-based tool that enables sharing, visualization, annotation and curated lineage identification in multiple sequence alignments. We demonstrated its applicability to two use cases: Tracking HIV viral evolution in infected subjects and surveying SARS-CoV2 sequences in the state of Washington. However, the software is broadly applicable to other applications.

Over the past several years, the use of sequencing as a tool for global surveillance of pathogens in wastewater [13], animal populations [14] and the local environment [15] has exploded as a means to monitor, detect and potentially prevent future epidemics. In addition, the development of methods to better analyze the petabytes of sequence data currently in public databases has resulted in the recent discovery of >100,000 previously unknown RNA viruses [16]. These newly discovered viruses combined with an ever-increasing dataset of publicly available sequences, provide a rich resource for future research to characterize the distribution and diversity of viruses as well as genes and gene segments from other entities on the planet. We anticipate that a great deal of future research will be devoted to both tracking known infectious agents and to understanding the rapidly expanding repertoire of novel agents. Phylobook empowers research in such areas by allowing researchers to easily share, visualize and annotate multiple sequence alignments and extract lineage information, and thus enables tracking of agent lineages within populations and across different sites.

Author contributions

JC Furlong, PD Darley – co-first authors. Phylobook programming was initiated by JC Furlong and greatly expanded upon by PD Darley. W Deng: programming of the pipeline. JC Furlong, PD Darley and W Deng all participated in the manuscript drafts and edits, JI Mullins: conception, feature selection and project review, manuscript edits; RE Bumgarner: programming, feature selection, project lead, manuscript preparation and edits. All authors reviewed and agreed with the final version of the manuscript.

Financial disclosure

This work was supported by grants from the US Public Health Service to the HIV Vaccine Trials Network (UM1 AI06818) and the Retrovirology and Molecular Data Science Core of the University of Washington/Fred Hutch Centers for AIDS Research (P30 AI027757). The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Writing disclosure

No writing assistance was utilized in the production of this manuscript.

Data sharing statement

Software and user manual have been deposited at https://github.com/MullinsLab/phylobook.

Competing interests disclosure

The authors have no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, stock ownership or options and expert testimony.

Additional information

Funding

This work was supported by grants from the US Public Health Service to the HIV Vaccine Trials Network (UM1 AI06818) and the Retrovirology and Molecular Data Science Core of the University of Washington/Fred Hutch Centers for AIDS Research (P30 AI027757).

References

  • Gilbert P, Self S, Rao M, Naficy A, Clemens J. Sieve analysis: methods for assessing from vaccine trial data how vaccine efficacy varies with genotypic and phenotypic pathogen variation. J. Clin. Epidemiol. 54, 68–85 (2001).
  • Gilbert PB. Interpretability and robustness of sieve analysis models for assessing HIV strain variations in vaccine efficacy. Stat. Med. 20, 263–279 (2001).
  • Rolland M, Tovanabutra S, de Camp AC et al. Genetic impact of vaccination on breakthrough HIV-1 sequences from the STEP trial. Nat. Med. 17, 366–371 (2011).
  • Janes H, Frahm N, de Camp A et al. MRKAd5 HIV-1 Gag/Pol/Nef vaccine-induced T-cell responses inadequately predict distance of breakthrough HIV-1 sequences to the vaccine or viral load. PLOS ONE 7, e43396 (2012).
  • Rolland M, Edlefsen PT, Larsen BB et al. Increased HIV-1 vaccine efficacy against viruses with genetic signatures in Env V2. Nature 490, 417–420 (2012).
  • Dommaraju K, Kijak G, Carlson JM et al. CD8 and CD4 epitope predictions in RV144: no strong evidence of a T-cell driven sieve effect in HIV-1 breakthrough sequences from trial participants. PLOS ONE 9, e111334 (2014).
  • Edlefsen PT, Rolland M, Hertz T et al. Comprehensive sieve analysis of breakthrough HIV-1 sequences in the RV144 vaccine efficacy trial. PLOS Computat. Biol. 11, e1003973 (2015).
  • Janes H, Herbeck JT, Tovanabutra S et al. HIV-1 infections with multiple founders are associated with higher viral loads than infections with single founders. Nat. Med. 21, 1139–1141 (2015).
  • Hertz T, Logan MG, Rolland M et al. A study of vaccine-induced immune pressure on breakthrough infections in the Phambili Phase IIb HIV-1 vaccine efficacy trial. Vaccine 34, 5792–5801 (2016).
  • de Camp A, Rolland M, Edlefsen PT et al. Sieve analysis of breakthrough HIV-1 sequences in HVTN 505 identifies vaccine pressure targeting the CD4 binding site of Env-gp120. PLOS ONE 12, e0185959 (2017).
  • Lewitus E, Sanders-Buell E, Bose M et al. RV144 vaccine imprinting constrained HIV-1 evolution following breakthrough infection. Virus Evol. 7, veab057 (2021).
  • Keele BF, Giorgi EE, Salazar-Gonzalez JF et al. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc. Natl Acad. Sci. USA 105, 7552–7557 (2008).