MBE Advance Access originally published online on January 29, 2008
Molecular Biology and Evolution 2008 25(4):762-777; doi:10.1093/molbev/msn023
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Reticulate Representation of Evolutionary and Functional Relationships between Phage Genomes
Service de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles, Bruxelles, Belgium
E-mail: gipsi{at}scmbb.ulb.ac.be.
| Abstract |
|---|
|
|
|---|
Bacteriophage genomes show pervasive mosaicism, indicating the importance of horizontal gene exchange in their evolution. Phage genomes represent unique combinations of modules, each of them with a different phylogenetic history. The traditional classification, based on a variety of criteria such as nucleic acid type (single/double-stranded DNA/RNA), morphology, and host range, appeared inconsistent with sequence analyses. With the genomic era, an ever increasing number of sequenced phages cannot be classified, in part due to a lack of morphological information and in part to the intrinsic incapability of tree-based methods to efficiently deal with mosaicism. This problem led some virologists to call for a moratorium on the creation of additional taxa in the order Caudovirales, in order to let virologists discuss classification schemes that might better suit phage evolution. In this context, we propose a framework for a reticulate classification of phages based on gene content. Starting from gene families, we built a weighted graph, where nodes represent phages and edges represent phage–phage similarities in terms of shared genes. We then apply various measures of graph topology to analyze the resulting graph. Most double-stranded DNA phages are found in a single component. The values of the clustering coefficient and closeness distinguish temperate from virulent phages, whereas chimeric phages are characterized by a high betweenness coefficient. We apply a 2-step clustering method to this graph to generate a reticulate classification of phages: Each phage is associated with a membership vector, which quantitatively characterizes its membership to the set of clusters. Furthermore, we cluster genes based on their "phylogenetic profiles" to define "evolutionary cohesive modules." In virulent phages, evolutionary modules span several functional categories, whereas in temperate phages they correspond better to functional modules. Moreover, despite the fact that modules only cover a fraction of all phage genes, phage groups can be distinguished by their different combination of modules, serving the bases for a higher level reticulate classification. These 2 classification schemes provide an automatic and dynamic way of representing the relationships within the phage population and can be extended to include newly sequenced phage genomes, as well as other types of genetic elements.
Key Words: bacteriophage classification phage evolution network graph functional module aclame
| Introduction |
|---|
|
|
|---|
Bacteriophages (phages) are the most abundant biological entities in the biosphere (Chibani-Chennoufi, Bruttin, et al. 2004
Early pairwise genome comparisons by DNA–DNA heteroduplex analysis revealed that some dsDNA phages are genetic mosaics, where regions with sequence similarity alternate with unrelated regions in a patchwise fashion (Westmoreland et al. 1969
). This observation was further substantiated once phage genome nucleotide sequences became available for comparative analysis. These showed that similar morphology does not imply genetic similarity or vice versa, raising serious debate on the validity of the ICTV taxonomy (Hendrix et al. 2000
; Brussow and Hendrix 2002
; Lawrence et al. 2002
; Nelson 2004
). Hendrix et al. (1999)
proposed that all dsDNA phages access a common gene pool, although horizontal genetic exchange is more intense among phages with a similar host range. As the sequencing of phage genomes progressed, mosaicism appeared to be so pervasive that it became impossible to reach consensus within the Prokaryote Virus Subcommittee of the ICTV as to whether the traditional hierarchical classification should be replaced with a new system, and if so, what form that system may take (Mayo and Ball 2006
).
Several proposals for virus classification have been made based on sequence information. Because phages lack a common locus, Rohwer and Edwards (2002)
discarded classical phylogenetic analysis and built a tree based on pairwise dissimilarities between the complete bacteriophage proteomes. A second system classified phages based on the structural head module (Proux et al. 2002
). Pride et al. (2006)
proposed a phylogeny based on tetranucleotide usage patterns. However, being hierarchical, none of those approaches account for the combinatorial structure of the phage population, a hallmark of phage genome structure.
Lawrence et al. (2002)
acknowledged that reticulate relationships cannot be modeled by classical taxonomies. They proposed a classification where, at a given taxonomic level, phages could belong to several groups defined as "modi." Phages within a modus would share a particular module or phenotypic character (e.g., the tail). One phage could belong to several modi (e.g., 1 for the head genes, 1 for the tail genes, and 1 for replication genes, etc.), reflecting their mosaic nature. This system of reticulate classification was illustrated with examples, but was not, and has not yet been implemented.
The new methodological framework presented here aims at filling this gap. We propose to represent relationships across the phage population as a weighted graph where nodes represent phages and edges represent phage–phage similarities in terms of gene (protein) content. We studied the structure of the phage population using graph theory tools and defined overlapping groups of phages by clustering the graph. Genes with similar phylogenetic profiles grouped into putative functional modules, some of which correlate with the phage clusters. This approach provides an accurate description of the global phage population and is useful for exploring particular regions of the phage sequence landscape.
| Methods |
|---|
|
|
|---|
Genome Sequences
In this study, we used the set of G = 306 bacteriophage genomes downloaded from National Center for Biotechnology Information (NCBI) Web site in February 2006 (http://www.ncbi.nlm.nih.gov/genomes/static/phg.html).
This set consists of 250 dsDNA, 36 single-stranded (ss) DNA, 12 dsRNA, and 8 ssRNA phages. We had previously manually classified the dsDNA phages of this data set according to their temperate or virulent lifestyle (list available at http://aclame.ulb.ac.be/Classification/Phages/life_style.html) (Lima-Mendez et al. 2007
). From 250 dsDNA phages, 72 could be classified as virulent, 156 as temperate, and 22 could not be classified.
Composition of Phages in Protein Families
We used BlastP (Altschul et al. 1997
) to detect pairwise similarities between all the phage proteins. Protein families were derived from the whole set of pairwise similarities using Markov clustering (MCL) algorithm with the same parameters as in our previous work (Leplae et al. 2004
). This procedure allowed us to classify 19,537 phage proteins into n = 8,576 families.
The composition of the phages in protein families was represented as a 306 x 8,576 matrix (M), where rows represent bacteriophage genomes and columns protein families, and the element Mi,j equals 1 if phage i encodes at least 1 protein belonging to family j and 0 otherwise.
Network of Similarities between Phages
We built a graph representing the similarity relationships between phages derived from the number of shared protein families. Given 2 phages A and B, containing a and b protein families, respectively, we estimated the probability to observe at least c protein families in common using the hypergeometric formula:
|
| (1) |
In total, we performed all T = (306 x 305)/2 = 46,665 pairwise comparisons between the 306 genomes. To estimate the expected number of false positives, we calculated the E value by multiplying the P value by the total number of comparisons. This E value is in turn converted into a significance index significance (sig) value by taking the minus logarithm in base 10:
|
| (2) |
We represented the phage population as an undirected weighted graph where nodes represent phages and edges the similarities between them. We refer to this graph as S277.
Permutation Test
As a negative control, we produced 100 random matrices by permuting the elements within the rows of M and calculated the similarity between the shuffled genomes.
Network Topology
A "component" is a maximal connected subgraph (Diestel 1997
). Two nodes are in the same connected component if, and only if, a path exists between them. In a graphical view of a graph, different components are seen separated by empty space. For each node in the network, we calculated the "degree," the "clustering coefficient," the "closeness," and the "betweenness" as described below.
The degree of a node is defined as the number of links the node has (see [Barabasi and Oltvai 2004
] for summary of graph theory measures).
The clustering coefficient CCv of node v is the number of links between its neighbors divided by the number of links that could possibly exist between them. In a nondirected graph without self-loops, the number of possible links between N nodes is N(N–1)/2. Thus, for a node v having N neighbors (Watts and Strogatz 1998
):
![]() | (3) |
The betweenness of node v measures the frequency at which it is found in the shortest paths connecting pairs of other nodes. It was defined by Freeman (1977)
as:
![]() | (4) |
Nodes with high betweenness hold different parts of the network together. We calculated the betweenness for all nodes in the phage network (S277) to identify phages acting as bridges between unrelated phages.
The closeness of a node v is calculated as the inverse of the average of the distance to all other nodes (Sabidussi 1966
):
![]() | (5) |
Extracting Clusters of Phages from the Network
We applied the MCL algorithm (van Dongen 2000
) to partition the network into nonoverlapping clusters. The main parameter of MCL algorithm is called "inflation," which influences clusters granularity (number and size). We tested all inflation values between the 2 allowed extremes 1.2 and 5 by steps of 0.2 and analyzed each clustering result as described in the next section.
Evaluating Clustering Results
The clustering coefficient measures the "cliquishness" around a node; hence, its average over the nodes of a cluster can be used as a measure of the cluster homogeneity. We computed an "intracluster clustering coefficient" (ICCC) considering only the edges within clusters. In principle, a clustering result that maximizes the ICCC produces more homogeneous clusters. For each clustering result, we calculated the average ICCC over all nodes of the graph. We excluded clusters with less than 3 members and set to 0, the clustering coefficient for nodes with a single edge to avoid the trivial solution where the optimal cluster is the one with most nodes assigned to singleton or duet clusters. The optimal clustering of the graph was defined as the one with the highest ICCC. For comparison, we generated 1,000 random graphs according to the Erdos–Renyi (ER) model (Erdos and Renyi 1959
), randomly assigned weights to their edges (using the S277 weight distribution), and used the same procedure as for S277.
Assigning Phages to Multiple Clusters
The "membership" of a node g to cluster c was calculated as the proportion of edge weights (the sum of the weight of all its edges) linking it to nodes of cluster c:
![]() | (6) |
The matrix B = {Bg,c} contains the membership values for every node–cluster pair. We refer to the matrix B as the "membership matrix."
Comparing with the ICTV Classification
The phage taxonomy can be extracted from 2 resources, the ICTV database (ICTVdb) at http://www.ncbi.nlm.nih.gov/ICTVdb/, the portal of the official classification, and the NCBI taxonomy database at http://www.ncbi.nlm.nih.gov/Taxonomy/, which contains information for more phages. The NCBI taxonomy covers essentially all phages that have been sequenced at the cost of relying on the information provided by the authors submitting the sequences. On the other hand, experts are responsible for the information stored in the ICTVdb, which, as a result, has a limited coverage of sequenced phages (Fauquet and Fargette 2005
).
For the reasons outlined above, we chose to retrieve the taxonomy from both databases. Out of the 277 phages in S277, 85 were classified in the ICTVdb. From the NCBI taxonomy database, the phage family could be retrieved for 265 and the genus for 144 phages. For the purpose of comparison and following the procedure described above, we built 3 new graphs, S85, S265, and S144, keeping only the nodes representing the phages classified in ICTVdb and NCBI taxonomy at family and genus level, respectively. Each of the 3 graphs was processed as described for S277; thus, we could directly map the current taxonomical classes (taken from ICTVdb or NCBI taxonomy database) into the clusters obtained in this study.
For the comparison between 2 classifications, we calculated the correspondence matrix Q = {Qc,k} between our clusters (indexed by c) and the ICTV clusters (indexed by k) as the matrix product between the transposed membership matrix BT = {Bc,g} and the g-by-k matrix describing the ICTV cluster membership for phages g into the k clusters K = {Bg,k}:
|
| (7) |
We adapted the statistics defined by Brohee and van Helden (2006)
to account for nonbinary membership and calculated the "recall" Rc,k (known also as sensitivity or coverage) as the proportion of members of ICTV class k found in cluster c, relative to the total number of members of class k assigned to all clusters:
|
| (8) |
The cluster-wise recall Rk is the maximal proportion of members from ICTV class k found in the same cluster:
|
| (9) |
The clustering-wise recall is the weighted average of the cluster-wise recall over all clusters:
![]() | (10) |
Analogously, we defined "precision" Pc,k (known also as positive predictive value) as the proportion of members of cluster c belonging to the ICTV class k, relative to the total number of members of the cluster c assigned to any ICTV class:
![]() | (11) |
The "cluster-wise precision" Pc is the maximal proportion of members of cluster c found in the same ICTV class:
|
| (12) |
The clustering-wise precision is the weighted average of the cluster-wise precision over all clusters:
![]() | (13) |
Permutation Tests
As negative control, we permuted 1,000 times the row labels of the membership matrix, while keeping intact the membership matrix. We then followed the procedure described above (eqs. 7–13) to compare the permuted matrices with the ICTV classification.
Shared Gene Content
As supportive information in the comparison between the classifications, we calculated the overlap between the gene content of a pair of phages and related this information to the ICTV classes where the corresponding phages are assigned. The overlap was defined as the size of the intersection divided by the size of the union of the corresponding protein family sets, a measure known as Jaccard similarity (Gordon 1999
):
|
| (14) |
The median of the distribution of shared gene content between phages belonging to different classes provides a general measure of the relationships between distantly related phages.
Defining Evolutionarily Cohesive Modules
In order to identify evolutionarily related genes, we analyzed the phylogenetic profiles of all the gene families found in phages (Pellegrini et al. 1999
). For each of the 1,486 gene families represented in at least 3 bacteriophage genomes, we constructed a binary vector of presence/absence among the bacteriophage genomes and compared all pairs of such profiles using the hypergeometric formula (eq. 1). The exclusion of families with 2 members was because, statistically, proteins present in only 2 genomes are unlikely to be part of a module. The P values were transformed into significance index (eq. 2). Similar profiles were defined as those having a sig value above a defined threshold. The highest the sig, the more similar the profiles are; we used 3 threshold values, 10, 5, and 1. Genes found in similar profiles were joined into a network with the edge weight set to the sig value and clustered using the MCL algorithm with the inflation value set at 5, which is the highest of the range generally used (http://micans.org/mcl/man/mcl.html#options).
The association of the modules to the clusters of phages obtained with the MCL algorithm was measured using the hypergeometric formula (eq. 1), where the overlap c between a cluster and a module is the number of phages found in a cluster that harbor a given module, a is the number of phages in the cluster, and b is the number of phages having the module. A phage with 70% of module occurrence (measured as protein families represented) was considered to have the corresponding module.
Computation
Graph topology and statistical analyses were performed using Regulatory Sequence Analysis Tools (van Helden 2003
) (http://rsat.ulb.ac.be/rsat/), R statistical package (http://www.r-project.org/), and in-house perl scripts. Graph visualization was done using the Cytoscape software (http://www.cytoscape.org/) and its GenePro plugin (Vlasblom et al. 2006
).
| Results |
|---|
|
|
|---|
Snapshot of the Bacteriophage Sequence Landscape
To build the network, we downloaded 306 completely sequenced phage genomes available in February 2006 at NCBI (http://www.ncbi.nlm.nih.gov/genomes/static/phg.html). We classified the phage proteins into families (supplementary table 1S, Supplementary Material online) as described for the A CLAssification of Mobile genetic Elements database (Leplae et al. 2004
1, (E value
0.1). As a negative control, we performed the same comparison between pairs of rows from 100 randomized matrices (see Methods). This produced only 1 pair with a sig value above this threshold, confirming the stringency of our significance threshold.
The network (fig. 1) consists of 277 nodes (phages) and 3,277 edges representing the significant relationships between phages. Twenty-nine phages were not similar to any other and were not further considered. Figure 1 shows the network (called S277 after the number of nodes). A force-directed layout algorithm (Fruchterman and Reingold 1991
) places the nodes based on their connections, grouping the most highly connected regions of the graph. Inspection of the network reveals immediately the clustering structure of the global phage population according to the type of genetic material. Out of 17 network components (see Methods for definition), 9 consist of dsDNA phages, 5 of ssDNA phages, 2 of dsRNA phages, and 1 corresponds to the ssRNA phages.
|
The overlap between the gene pools of phages with different genome type is limited and hence not statistically significant. The distribution of dsDNA phages among the components is far from balanced. There are 8 small components containing from 2 to 9 phages. One hundred and 99 dsDNA phages are found in the largest component. Yet, within the largest component, some groups appear separated from the bulk. This mass of very interconnected phages observed in the largest component is mainly composed of temperate phages. Indeed, another feature in this network is that phylogenetically related phages are found in quasi-cliques; hence, no strict hubs are observed.
Graph Topological Measures Capture Relevant Properties of the Phage Population Structure
The clustering coefficient measures the extent to which the neighbors of a given node are interlinked. We used this coefficient as an indicator of cohesiveness around a phage's neighborhood. Virulent phages have higher clustering coefficients than temperate phages (median temperate = 0.72; median virulent = 1; P value < 2.683 x 10–07 Mann-Whitney U-test). To avoid a possible bias due to the overrepresentation of temperate phages in the network, we built 1,000 subnetworks composed of the 70 virulent phages and a random selection of 70 temperate phages. The clustering coefficient of virulent phages remains higher in those subnetworks (median P value over 1,000 comparisons = 4.364 x 10–05 Mann U-test). The differences between temperate and virulent clustering coefficient distributions may indicate that mosaicism is less crucial in the diversity of virulent phages than it is for temperate ones, as suggested from analysis of T4-like (Chibani-Chennoufi, Canchaya, et al. 2004
; Filee et al. 2006
) and dairy phages (Chopin et al. 2001
).
The closeness of a node is the inverse of its average distance to all other nodes in the graph (see Methods). The higher the closeness, the more central is the node. We compared the distribution of closeness for temperate and virulent phages in the largest component of the graph and found that temperate phages are more central than virulent ones (median temperate = 0.41; median virulent = 0.27; P value < 1.524 x 10–12 Mann U-test). In this component, there are 53 virulent and 135 temperate phages. We also computed the closeness of the nodes of the main component of 1,000 balanced subnetworks built from 53 randomly chosen temperate and the 53 virulent phages of S277 main component, and the results confirmed that the observed difference between temperate and virulent phages is not due only to the overrepresentation of temperate phages (median P value over 1,000 comparisons = 1.66 x 10–7 Mann U-test). Temperate phages form essentially a single group in the center of the network; no path is observed between all virulent phages bypassing temperate ones. Thus, the closeness reflects the tendency of temperate phages to act as module donors/recipients between the virulent phages and the rest of the population (Lawrence et al. 2002
).
The betweenness measures the extent to which a node is located in the shortest paths between all the other nodes. Nodes with high betweenness values are bridges between otherwise disconnected regions of the network. It is a centrality measure that gives an idea of which nodes are holding the network together. To identify phages combining genes from different phage groups, we calculated the betweenness for all nodes in the phage population graph and in 1,000 random graphs generated according to the ER model (Erdos and Renyi 1959
). In the ER graph, every pair of nodes are joined with equal probability; thus, the network formed is homogeneous in nature and suitable as a negative control for estimating the significance of phages with high betweenness values, a feature linked to the network community structure (Girvan and Newman 2002
). The P value was estimated as the frequency of node betweenness above a given threshold in the ER graphs. This P value represents the probability of having nodes with betweenness greater than the threshold by chance alone. The distribution of betweenness of the phage network is represented in figure 2 (inset), together with the distribution over the 1,000 ER graphs. The vertical line marks the betweenness value of 400, for which we obtained a P value of 6 x 10–4. The 28 phages with betweenness above this threshold are ranked in decreasing order of betweenness values in table 1 and appear mapped onto the network depicted in the figure. Some of the phages with high betweenness values have been reported as founding members of novel groups. This is the case for T5—rank 1—(Wang et al. 2005
), phiKMV—rank 2—(Ceyssens et al. 2006
), and LP65—rank 9—(Chibani-Chennoufi, Dilmann, et al. 2004
). Other phages have been described as genetic mosaics; ST64B—rank 4—(Mmolawa et al. 2003
), XP10—rank 6—(Yuzenkova et al. 2003
), P27—rank 11—(Recktenwald and Schmidt 2002
), HK022 and HK97—rank 12 and 14, respectively—(Juhala et al. 2000
), SfV—rank 24—(Allison et al. 2002
), etc. The common theme among these phages is that they are shortcuts between otherwise unrelated phages. ST64B, SfV, and P27 bridge Mu-like phages with lambda-like phages. T5 bridges T4-like phages with lambda-like phages. PhiKMV and XP10 bridge T7-like phages with the rest of the Siphoviridae.
|
|
The Network Breakdown
Having seen that the network topology disclosed the relationships underlying the phage population, we exploited the network representation to retrieve reticulate phage groups.
We first applied the MCL algorithm (van Dongen 2000
) to the graph. The main parameter of this algorithm is the inflation factor that modulates cluster granularity. Figure 3A shows the number of clusters as a function of the inflation value for the real network (black) and for 1,000 random networks (gray) generated according to the ER model (Erdos and Renyi 1959
) and randomly weighted with the weight distribution of the S277 network. To choose the optimal inflation factor, we explored values ranging from 1.2 to 5 by steps of 0.2 and estimated the clusters homogeneity by computing the average ICCC (see Methods). Figure 3B shows the evolution of the ICCC as the inflation value increases. The peak at inflation value of 2 suggests that this clustering solution is the best trade-off between homogeneity and cluster size. Hence, we selected 2 as the inflation value, which produced 48 clusters.
|
Once the graph was partitioned, we reassessed the membership of the phages to the different clusters. From the S277 network, we calculated the total weight of the connections of a given node to the nodes within a particular cluster. The membership of the node to that cluster is equal to this value divided by the sum of the weight of all the connections of the node. Each phage was associated with a vector describing its membership to each cluster (see Methods). The membership matrix is available as supplementary table 2S (Supplementary Material online). In the network in figure 4, each node is a pie chart where each wedge represents the membership of the node to one cluster.
|
Comparison of the Clusters with the ICTV Taxonomy
To compare the result of our approach with the traditional phage classification, we measured the overlap between our phage clusters and the phage classes described in the ICTV taxonomy (Fauquet et al. 2005
Classically, the comparison between 2 classifications is based on a contingency table, where each column represents a class of the first classification (e.g., ICTV taxonomy), each row a class of the second classification (e.g., our clustering result), and a cell indicates the number of elements at the intersection between the corresponding classes. In our reticulate classification though elements do not necessarily belong to a single class; instead, the relationship of each element to each class is estimated with a membership coefficient ranging from 0 to 1. Hence, we calculate the overlap between 1 ICTV class and 1 cluster as the sum of memberships of the phages of the cluster classified in that ICTV class. This overlap can thus not be strictly interpreted in terms of number of phages classified into the same cluster and ICTV class but as membership units.
Three statistics were used to evaluate the overlap between the classifications, recall, precision, and "accuracy (Acc)." The "class-wise recall" (R) represents the coverage of an ICTV class by its best-matching cluster; a value of 1 corresponds to the case where all phages of a given ICTV class are found in the same cluster. Reciprocally, the cluster-wise precision (P) measures how well a given cluster corresponds to its best-matching ICTV class; a value of 1 indicates that all phages in the cluster belong to the same ICTV class. From the class- and cluster-wise statistics, we computed the clustering-wise statistics as the weighted means over all classes/clusters of the class/cluster-wise values. The 2 measures are combined in the Acc, the arithmetic mean of R and P (see Methods for equations a detailed description of the matching statistics).
To estimate the random expectation, we performed 1,000 permutation tests by shuffling the labels of the rows of the membership matrix, that is, phage identifiers. This permutation test preserves the number of clusters and their respective sizes but regroups phages in a random way. These permuted clusters were then compared with the ICTV classes, and the same statistics (R, P, Acc) were computed as described above.
The results are summarized in table 2. The clustering-wise recall was approximately 0.4 for families and 0.7 for genera, both values being twice as high as those obtained for the negative control. Such high values were never encountered in any of the 1,000 permutation tests. The P value can thus be roughly estimated as smaller than 10–3, which indicates that the overlap between our reticulate classification and the official taxonomy is highly significant.
|
Clearly, our classification shows a better correspondence to genera than to families. This is not surprising because we mapped 13 families and 30 genera from the ICTV classification onto 48 clusters, the ICTV classes were inevitably split in our classification so that achieving a recall of 100% would have been impossible.
Table 2 also shows that the precision was approximately 0.92 for the genera and 0.95 for the families. This value indicates that on average 8% of significant evolutionary links join phages that belong to distinct ICTV genera and 5% to distinct ICTV families. The precision is about 2-fold higher for the data than for the negative controls (S144mean = 0.43 and S85meangenera = 0.46), indicating that the high precision value is not an artifact of the cluster size distribution.
The main conflicts between the morphology-based and the gene content–based classifications concern the tailed phages. For each pair of tailed phages, we can measure the shared gene fraction as the shared protein families divided by the families found in either of the 2 phages (see eq. 14, Methods). Despite the shared gene content is higher for phages belonging in the same ICTV class (median intrafamily = 0.026; median interfamily = 0.009; median intragenus = 0.106; and median intergenera = 0.008); there are phages within an ICTV class sharing less genes than pairs of phages from different classes, whether families or genera (data not shown; http://aclame.ulb.ac.be/Classification/Phages/).
Note that all the phages of the data set are not assigned to ICTV classes. We can speculate that a good fraction of the phages that remain unclassified are those for which a reticulate system is more needed; thus, the real benefit of the reticulate classification will be higher than what we can estimate by analyzing the subset of phages found in the ICTV classification.
The Reticulate Evolutionary Relationships Are Apparent in the Proposed Classification
Further inspection of the membership to different clusters brings more insight into the added value of our approach. The subset of phages distributed mainly between clusters 0 and 1 defines a continuous gradient between these 2 clusters (fig. 5A). At one extreme, 27 Staphylococcus aureus phages belong mainly to cluster 0. At the other extreme, phages Sfi19, Sfi21, and DT1 from Streptococcus thermophilus belong mainly to cluster 1. Other phages infecting this host (7201, Sfi11, and O1205) and the Lactobacillus and Streptococcus phages scatter between the 2 clusters.
|
The well-documented mosaicism of lambda-related phages is easily visible. They are found in clusters 4 and 9 with memberships plotted in figure 5B. The shiga toxin–encoding Siphoviridae Stx1, Stx2 I and II, 933W, and VT2-Sa have membershipcluster9 >0.8 and membershipcluster4 <0.2, forming a subgroup. Cluster 4 appears rich in Podoviridae phages such as Acyrthosiphon pisum endosymbiont bacteriophage APSE-1, Sf6, ST104T, ST64T, P22, and HK620 with membershipcluster4
0.8 and membershipcluster9 <0.2. Interestingly, phage lambda belongs almost equally to both clusters (membershipcluster9 = 0.38 and membershipcluster4 = 0.46). The chimera phages P27, SfV, and ST64B are classified with Siphoviridae in clusters 0 and 1 and with the Myoviridae phage Mu in cluster 32 (fig. 6A). Noteworthy, all 3 have a low membership to clusters 4 and 9, despite their reported lambda genetic structure (Allison et al. 2002
|
Phage N15, which bears features from lambda-related phages and from linear plasmids (Rybchin and Svarchevsky 1999
Evolutionary Cohesive Modules
The modular theory of bacteriophage evolution states that the product of phage evolution is a family of interchangeable genetic elements called modules carrying a biological function (Susskind and Botstein 1978
; Botstein 1980
). Thus, phage genomes are often dissected—not automatically—into modules corresponding to the functional categories: replication, lysis/lysogeny, DNA packaging, and head and tail morphogenesis, etc (Brussow and Desiere 2001
). Phylogenetic profiles have been used to predict functional links between genes on the assumption that genes involved in the same function and thus required together co-occur in genomes (Pellegrini et al. 1999
). We derived the phylogenetic profiles (see Methods) for each phage gene and identified modules as clusters of proteins with similar phylogenetic profiles (see Methods). We observed that the evolutionary modularity of phage is rather limited. Using strict parameters (sig
10), we discovered 26 modules covering 144 protein families containing 2–27 proteins. Decreasing the sig value threshold, we detected more modules; sig
5 produced 62 modules spanning 385 protein families and sig
1 allowed identifying 93 modules covering 754 protein families. Throughout this work, we use the term "module" to refer to those defined with sig threshold of 10. Modules defined at lower sig thresholds are appended with the corresponding sig values. All modules are available as supplementary table 3S (Supplementary Material online). Replication and recombination, head and tail structural and assembly functions are prominent components of the modules beside numerous protein families with unknown function.
Figure 7 displays the profile of the modules across the main phage clusters (the complete matrix of occurrences of modules in phages is represented in supplementary figure 1S (Supplementary Material online). Important differences exist between temperate and virulent phages. Virulent phages tend to have evolutionary modules that span over different functional categories. For example, module 0 contains head, tail, and DNA replication proteins of T4-like phages. Likewise, the T7 group of phage features module 3, which is composed of tail proteins, head protein, terminase, portal and RNA polymerase (see supplementary table 3S [Supplementary Material online] for the functional annotation).
|
Modules in temperate phages better corresponded to functional modules. Most temperate phages exhibit module 12, consisting in the integrase, transcriptional repressors (CI and Cro are grouped in the same protein family), a DNA-binding protein and the replication initiation protein. Allegedly lambdoid phages (clusters 4 and 9, fig. 7) have only one additional module: module 2. Module 2 is composed of the transcription antitermination protein N, proteins CII and CIII, which control lysogenization and several Nin and the Rz proteins, the function of which has not been elucidated yet.
In contrast to the lambdoid phages, the group of temperate phages infecting Gram(+) bacteria displays several modules in a combinatory fashion (clusters 0 and 1 in fig. 7). Interestingly, only phages from S. aureus have module 1. We do not have clues into the function of this module, but it displays a fragmentary profile across the phages, suggesting that it comprises more than 1 functional module. Because many pathogenicity determinants are phage encoded (Brussow et al. 2004
; Ogura et al. 2007
), it is important to further investigate the function of these proteins. This group (cluster 0/1) has 2 modules with DNA-packaging functions, modules 4 and 6, and 2 modules with head morphogenesis proteins, modules 7 and 11. Only the combinations module 4/module 7 and module 6/module 11 are observed. Noteworthy, within the subgroup of phages that have module 1 both combinations are found.
Modules 9 and 10 are present in phages of the Myoviridae family and correspond to proteins participating in head and tail morphogenesis, respectively (clusters 7 and 37, fig. 7). Module 9, composed of head proteins, is present in all P2-like phages except VHML. Module 10, composed of tail proteins, is strongly linked to but not exclusive of P2-like phages. That module is found in the Mu-like BcepMu but is absent from the P2-like phages HP1, HP2, and K139. BcepMu and VHML are clearly chimeric phages, featuring DNA packaging and capsid proteins related to Mu and lambda, respectively.
A Module-Based Network?
One interesting alternative in defining the phage network is to establish the links between phages based on their modules (fig. 8). We computed the "intersection network" from the edges in common between the 2 networks. The intersection network has 76% of the edges of the protein-based network (S277). At first glance, there are 2 important differences with the S277 network. First, T7-like phages are not connected to the main component in the module-based network. Second, in S277 T4 phages are connected to the bulk of the network through a path containing T5, Staphylococcus phage Twort, Listeria phage P100, Staphylococcus phage K, phage G1, and Lactobacillus plantarum phage LP65. In the module-based network, T4-like phages are in the path between those phages and the network bulk, thus inverting the local topology. The explanation for this inversion is straightforward. The small module of 2 genes (module 25) that links T4 phages to the bulk in the module-based network is not statistically significant when building the S277 network. Reciprocally, the genes shared by phages P100 and LP65 with phages in the network bulk do not form a module but their quantities are significant when building the S277 network.
|
| Discussion |
|---|
|
|
|---|
Pairwise comparisons between phage genomes have provided a high-resolution picture of phage relationships and revealed that mosaicism is the hallmark of phage genome structure (Hendrix 2003
Consistent with the consensus (Lawrence et al. 2002
; Fauquet et al. 2005
) that phages with different genome types—ssDNA, dsDNA, ssRNA, and dsRNA—are essentially different, they are found in separate components in the network. Graph theoretical measures capture relevant properties of the phage population and are suitable to analyze phages in their genetic neighborhood. The near-one clustering coefficient of virulent phages is consistent with the conservation of a core genome among virulent phages of the same group (Scholl et al. 2004
; Filee et al. 2006
). Virulent phages appear at the periphery of the phage sequence space centered on temperate phages. This reflects that module exchange occurs within the host, where temperate phages may reside for longer periods and function as "banks" for modules/genes exchange across the whole phage population (Lawrence et al. 2002
). Chimeric variants bridging temperate and virulent phages (Yuzenkova et al. 2003
; Wang et al. 2005
; Ceyssens et al. 2006
) have high betweenness. Such values possibly arise as an artifact of the sparse sampling of the sequence space. Yet, the betweenness could prove useful to fill in those gaps with dedicated sequencing projects.
The automatic classification system proposed here involves the definition of clusters with similar phages followed by the reassignment to multiple clusters in order to generate a reticulate classification of the phages. In order to identify clusters of highly connected phages (quasi-cliques), we selected parameters that maximized the ICCC. Thus, phages within the same MCL cluster are likely descendant from a unique module combination. The weight of the intracluster connections represents 79% of the total weight of the connections of the network. This number can be taken as a rough estimate of the contribution of vertical evolution in this network. However, phages from different MCL clusters may be also related through vertical evolution but they might have diverged so much that sequence similarities are no longer recognizable or only some modules may have been vertically inherited, whereas others have been replaced through horizontal gene transfer.
The number of MCL clusters is larger than the number of ICTV genera (Fauquet et al. 2005
). From the operational point of view, producing many clusters allowed more combinations to classify the mosaic genomes in the multiple assignment step. Given the importance of recombination in phage evolution, it is reasonable to observe an inflation of the number of phage groups, when considering that separate clusters can correspond to distinct combinations of gene modules. The overlap between the clusters of our reticulate classification thus enables to reflect the traces of the recombination events.
The distribution of the phages among the clusters is described with the membership matrix, where row vectors correspond to the membership of the individual phages. It can be argued that wrong relationships may be inferred from the membership matrix. For example, phage HK97 and phage Mu have no direct relationship although, in the multiple assignments step, they are assigned to cluster 32, due to their links to phages ST64B, P27, and SfV (fig. 6A). However, this is not a real problem because keeping track of the original MCL assignment ensures that a false direct link between Mu and HK97 would be dismissed. On the other hand, the inclusion of both phages, Mu and HK97, within the same cluster reflects that ancestors of these phages—and probably these phages too—shared the same gene pool as previously suggested (Hendrix and Casjens 2005
).
We derived a second phage network, based on modules shared between phages. Contrary to the S277 network in which the identity of the shared protein families is not encoded, this module-based network allows for a detailed exploration of the nature of the functions shared between the connected phages. This higher functionality of the module-based network runs at the expense of links corresponding to genes that are not part of modules, which do contribute to the S277 network. Noteworthy, if the protein families part of modules are filtered out of the S277 network (significant links due only to the genes part of modules are lost), the main component of the network still harbors temperate phages from Gram(+) and Gram(–) bacteria (data not shown), indicating that those are not linked because of the integrase and repressors alone.
The taxonomy outlined by Lawrence et al. (2002)
provided a few selected examples of phages decomposed in reticulate groups—called modi. The correspondence of those modi with the modules detected here is depicted in table 3, where phages are described by a combination of modules according to each system. The HK97 head modus is split between the packaging module (module 6) and the head structural proteins (module 11), in agreement with reports about cellular organisms, where evolutionary modules recover only fragments of functional modules (Glazko and Mushegian 2004
; Snel and Huynen 2004
). Nonorthologous gene displacement may occur (Koonin et al. 1996
), disrupting the modularity. This is the case for the 2 protease families from modules 6 and 11.
|
The modus named as "linear episome–mediated temperate phage," which describes the prophage status of bacteriophage N15, corresponds to modules 56_sig1 and 87_sig1 that, respectively, contain the protelomerase and the plasmid-related replication and partitioning proteins, both detected using sig threshold of 1.
The 2 modi present in phage PhiX174 are each composed of a single protein, the external scaffolding and the inhibitor of the MraY protein, respectively. The method of phylogenetic profiles (Pellegrini et al. 1999
) cannot detect functional modules composed of a single protein. We detected these 2 proteins in a single module defined with sig
5 (module 9_sig5) and harboring 8 out of the 11 proteins of the phage. Likely, this situation reflects a low level of recombination and divergence among PhiX174 and its relatives so that co-occurring genes correspond to more than single functional module. The same conclusion holds for modules 0 and 3, present in T4- and T7-like phages, respectively, and possibly to module 1, present in a set of 21 Staphylococcus phages.
Despite the limited evolutionary modularity, some modules behave as signature of phage groups. Moreover, the module sharing also allows defining a network that is easier to interpret and may hold the most relevant biological information. The existence of genes that remain tightly linked across several genomes in spite of the pervasive recombination suggests constraints against module disruption. The constraints are not equal for all phages. Lambdoid phages have retained a module comprising proteins involved in transcription regulation, whereas they display poor modularity in phage morphogenesis and DNA packaging. On the other hand, virulent phages have conserved many genes as a group, implying constraints to evolutionary successful new gene combinations. These results argue against the validity of a single module to classify all phages because module conservation is strongly dependent on the phage biology and lifestyle.
The acquisition of more genomic data would—no doubt—reshape the clusters and modules as new evolutionary links will be unveiled. Questions regarding the structure of the phage population, in particular whether there is a defined boundary between temperate and virulent phages, may be answered in the future with this kind of systematic analysis. The global view depicted in this work suggests that the phages acting as bridges between less related phage groups may be the sole representatives of phage families located in the path between such groups. When more members of the family are known, the picture could converge toward the kind of relationship observed between clusters 0 and 1 and clusters 4 and 9 (fig. 5).
| Conclusions |
|---|
|
|
|---|
Our work addresses the long-standing claim for a classification capable of assigning phages to multiple groups featured by a series of marker modules. We proposed here 2 strategies toward classification. Both achieve the reticulate classification by describing the phages as vectors; cluster membership in one case and module occurrence in the second. The 2 approaches can be combined to further explore the evolutionary links trying to discriminate the contribution of vertical and horizontal evolution.
Graph theory measures captured the genetic differences between phages due to their lifestyle. These measures could be further exploited as predictors of phage lifestyle. The automatic detection of chimeric phages may prove useful in the ongoing assessment of phage diversity.
These methodologies could be extended to the classification of other mosaic genetic entities existing in prokaryotes such as plasmids, conjugative transposons, and other genomic islands. It may also provide a way to reinvestigate the extent to which eukaryotic dsDNA viruses undergo recombination (Iyer et al. 2006
), a matter of crucial importance to assess the safety of defective viruses as vectors for gene therapy or the development of live vaccines.
| Supplementary Material |
|---|
|
|
|---|
Supplementary figure 1S and tables 1S–3S are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|














