1,194
Views
52
CrossRef citations to date
0
Altmetric
Original Articles

The limits of nuclear-encoded SSU rDNA for resolving the diatom phylogeny

, , &
Pages 277-290 | Received 26 Sep 2008, Accepted 25 Nov 2008, Published online: 19 Aug 2009

Abstract

A recent reclassification of diatoms based on phylogenies recovered using the nuclear-encoded small subunit ribosomal RNA (SSU rRNA) gene contains three major classes, Coscinodiscophyceae, Mediophyceae and the Bacillariophyceae (the CMB hypothesis). We evaluated this with a sequence alignment of 1336 protist and heterokont algae SSU rRNAs, which includes 673 diatoms. Sequences were aligned to maintain structural elements conserved within this dataset. Parsimony analysis rejected the CMB hypothesis, albeit weakly. Morphological data are also incongruent with this recent CMB hypothesis of three diatom clades. We also re-analysed a recently published dataset that purports to support the CMB hypothesis. Our re-analysis found that the original analysis had not converged on the true bipartition posterior probability distribution, and rejected the CMB hypothesis. Thus we conclude that a reclassification of the evolutionary relationships of the diatoms according to the CMB hypothesis is premature.

Introduction

Analyses of molecular data (mainly nuclear-encoded small subunit ribosomal DNA; henceforth SSU) have generally reinforced the traditional view (Simonsen, Citation1979; Round et al., Citation1990) that centric diatoms broadly grade into pennates through many nodes (Medlin et al., Citation1993, Citation1996a , Citation b , Citation2000; Ehara et al., Citation2000; Medlin & Kaczmarska, Citation2004; Sorhannus, Citation2004, Citation2007; Alverson et al., Citation2006; Choi et al., Citation2008; see Alverson & Theriot, Citation2005 for review). However, Medlin & Kaczmarska (Citation2004) recently proposed that centric diatoms were composed of only two clades rather than many. They retained the name Coscinodiscophyceae for the so-called ‘radial centrics’ and applied the name Mediophyceae for the so-called ‘bipolar’ or ‘multipolar centrics’. They also suggested a number of morphological characters as diagnostic for these groups. We refer to this as the CMB hypothesis (for the three major clades discovered–Coscinodiscophyceae, Mediophyceae and Bacillariophyceae).

The CMB phylogenetic hypothesis has not been universally embraced. For example, Adl et al. (Citation2005) treated both Coscinodiscophyceae and Mediophyceae as paraphyletic taxa without discussion. Williams & Kociolek (Citation2007) challenged the robustness of the CMB phylogeny based on the fact that many different SSU analyses return different trees. In contrast, Sims et al. (Citation2006) recovered the CMB hypothesis with high bipartition posterior probability (BPP) support. Medlin et al. (Citation2008) recovered the CMB hypothesis with high BPP support using a secondary structure alignment but noted that several aspects of the tree were unusual (e.g. the placement of Attheya).

In fact, topology of the diatom SSU tree (and support values for incongruent groups) has changed from study to study. For example, the elongate Toxarium has been placed well within the centric grade amidst multipolar diatoms (very distant from the pennate diatoms) using maximum likelihood (ML) analysis on 38 diatoms (Kooistra et al., Citation2003), as sister to all pennates in a Bayesian analysis of 51 diatom sequences (Chepurnov et al., Citation2008), poorly resolved in a maximum parsimony (MP) analysis of 181 diatom sequences (Alverson et al., Citation2006), and once again well within the multipolar diatoms in a Bayesian analysis of 54 diatom SSU sequences (Medlin et al., Citation2008). As underscored by this brief comparison, the many different inferences of diatom phylogeny have utilized different alignment strategies, different optimality criteria, have employed those criteria in different ways and have used different taxa. Any or all of these factors may have lead to the novel results of Medlin & Kaczmarska (Citation2004) and Sims et al. (Citation2006), but this cannot be directly studied because the Medlin & Kaczmarska (Citation2004) and Sims et al. (Citation2006) datasets which produced the CMB hypothesis were not publicly available. However, the Medlin et al. (Citation2008) dataset is available and we re-analyse it below. To test the effects of ingroup and outgroup sampling, we created our own large alignment of stramenopile SSU sequences, aligned according to secondary structure (Gutell et al., Citation1985, Citation1992, Citation2002) and used it to test the CMB hypothesis and its robustness. Specifically we address the effect (or lack thereof) of adding distantly related outgroups on inferences of the diatom SSU tree.

Materials and methods

Multiple sequence alignment

We included all 1549 SSU stramenopile sequences available in Genbank as of September 1, 2007. The SSU rDNA sequences were aligned manually with the alignment editor ‘AE2’ (developed by T. Macke, Scripps Research Institute, San Diego, USA; Larsen et al. Citation1993), which was developed for Sun Microsystems’ (Santa Clara, USA) workstations running the Solaris operating system. The manual alignment process involves first positionally aligning homologous nucleotides (i.e. those that map to the same locations and tertiary structure models) into columns in the alignment, maximizing their sequence and structure similarity. For regions with high similarity between sequences, the nucleotide sequence is sufficient to align sequences with confidence. For more variable regions in closely related sequences or when aligning more distantly related sequences, however, a high-quality alignment only can be produced when additional information (here, secondary and/or tertiary structure data) is included.

The underlying SSU rRNA secondary structure model was initially predicted with covariation analysis (Gutell et al., Citation1985, Citation1992). Approximately 98% of the predicted model base pairs were present in the high-resolution crystal structure from the 30S ribosomal subunit (Gutell et al., Citation2002). This model (based on the bacterium Escherichia coli) has been extended to the eukaryotic SSU rRNA (Cannone et al., Citation2002), using covariation analysis to assess eukaryote-specific features. The additional constraints of the eukaryotic model were used to refine the alignment of the stramenopile sequences iteratively until positional homology was established for the entire data matrix.

The initial SSU alignment contained 1549 sequences, with a final length of 3786 columns. Medlin & Kaczmarska (Citation2004) filtered out sequences less than 50% complete and we followed this convention, resulting in a final dataset of 1336 stramenopile sequences of which 673 are diatoms and seven are bolidophytes, which are considered the immediate sister group to diatoms, according to both SSU and chloroplast-encoded rbcL (Daugbjerg & Andersen, Citation1997; Goertzen & Theriot, Citation2003; Andersen, Citation2004). The remaining taxa are more distantly related stramenopiles. The final alignment is available at TreeBASE (http://www.treebase.org/treebase/intro.html) or from the authors. Forty secondary structure model diagrams representing the major diatom lineages are available at http://www.rna.ccbb.utexas.edu/SIM/4D/Diatom_nSSU/. We analysed the data in two datasets: diatoms plus bolidophytes only (DiatBo) and diatoms plus all stramenopiles (DiatStram).

Other datasets

We obtained the Nexus files used for and in Medlin & Kaczmarska (Citation2004) from Dr Medlin. One dataset had 126 sequences and the other had 281 sequences, and we refer to them as the MK126 and MK281 datasets. Both had the same 123 diatom sequences and differed only in that the former used bolidophytes only as the outgroup and the latter sampled broadly across eukaryotes for the outgroups. We also used the Nexus file used to produce of Medlin et al. (Citation2008) from http://www3.interscience.wiley.com.ezproxy.lib.utexas.edu/journal/121395867/suppinfo. That file had 54 sequences, all diatoms but no outgroup, and we refer to that as the M54 dataset.

Fig. 1. Strict consensus of 65 unique equally most parsimonious trees calculated from the stramenopile-outgroup and diatom-ingroup analysis (DiatStram dataset). Only relationships among diatoms are shown.

Fig. 1. Strict consensus of 65 unique equally most parsimonious trees calculated from the stramenopile-outgroup and diatom-ingroup analysis (DiatStram dataset). Only relationships among diatoms are shown.

Phylogenetic analysis

All datasets were subjected to parsimony analysis using the TNT program (Goloboff et al., Citation2003). The full suite of TNT options (sectorial search, ratchet, drift and tree fusion) was used. There is no standard recommendation for use of these algorithms and there are few comparative studies of these algorithms. Within the context of the ratchet, Nixon (Citation1999) argued that for large datasets, it may be better to limit length of searches on individual islands of trees and search more islands. The notion is that exploring a greater range of islands containing optimal trees is more likely to cover the entire diversity of optimal trees in a shorter period of time than exhaustively searching one island. Thus, we took the approach used by Goertzen & Theriot (Citation2003) and Alverson et al. (Citation2006) when employing these newer algorithms. We increased the number of all cycles, rounds and repetitions for sectorial, drift, fusion and ratchet searches 10-fold beyond default values, and used between 100 and 1000 random taxon additions for each run. We saved the resultant trees from each run separately, and then repeated the procedure with a new randomly selected seed number. After each run, we checked that no shorter tree was found, combined trees from all previous runs and then calculated the number of nodes collapsed in their strict consensus. If no shorter trees were found and if no additional nodes were collapsed, we concluded that we had sampled the complete representative set of MP trees, as additional trees would be redundant and unlikely to further erode the resolution of the strict consensus (Nixon, Citation1999; Goertzen & Theriot, Citation2003).

We assessed the parsimony penalty required by constraining each of the Coscinodiscophyceae, Mediophyceae and Bacillariophyceae to monophyly under searches as above. We assessed support for the unconstrained MP trees using nonparametric bootstrap (BS) analysis in TNT with the standard sampling with replacement strategy. We used the new technology search with 10 taxon additions and sectorial, ratchet, drift and tree fusings for each of the 1000 pseudoreplicates of the BS analysis.

The DiatBo, MK 281, MK126 and M54 datasets were subjected to Bayesian analyses. All Bayesian analyses were run with the GTR + G + I model (nucmodel = 4by4, nst = 6, rates = invgamma). These were the settings used by Sims et al. (Citation2006) and also corresponded to the best model for each dataset as selected by MrModelTest (Nylander, Citation2004). All initial runs for all datasets were done at 1 000 000 Markov chain Monte Carlo (MCMC) generations, equal to or greater than the number of generations run by Medlin & Kaczmarska (Citation2004), Sims et al. (Citation2006) and Medlin et al. (Citation2008). Where these papers did not specify other settings for the Bayesian analysis, default settings were used. To test reproducibility of the results, we ran three separate analyses, each with two runs for a total of six independent runs of 1 000 000 MCMC generations each. We also ran one analysis of the DiatBo dataset with two runs (four chains, three heated, one cold) for 10 million generations, saving every 10 000th tree. Finally, we ran the M54 dataset for 50 million generations, saving every 10 000th tree. We assessed whether independent runs in all analyses had sampled the same posterior distribution by comparing independent run (split) posterior probabilities with the compare command in the AWTY program (Wilgenbusch et al., Citation2004). We followed the burn-in periods of Medlin et al. (Citation2004) and (2008) for their datasets when we ran 1 000 000 generations on M54, MK126 and MK281. We used a burn-in of 90% for our DiatBo dataset 1 000 000 and 10 000 000 generation analyses to approximate Sims et al. (Citation2006).

Morphology

We coded the characters of symmetry, presence or absence of mucilaginous matrix, auxospore shape/growth, properizonium and perizonium presence/absence for 34 taxa using Medlin & Kaczmarska's criteria (2004: 258, table 2), and treated all multistate characters as unordered. Since no outgroup or ontogenetic information was provided, the only rooting option was to consider the Coscinodiscophyceae as the outgroup to the remaining diatoms and test for monophyly of the Mediophyceae and Bacillariophyceae. However it is possible to determine if the Coscinodiscophyceae formed a convex group (possibly monophyletic depending on the placement of the root within the unrooted network). Winclada running NONA was used for parsimony analysis, with 10 000 replications, holding 100 starting trees per repetition, and all other parameters set to defaults.

Results

Parsimony analysis

For the DiatStram dataset, 11 runs totalling 2898 random taxon addition repetitions were required to converge on the representative set of MP trees (length (L): 39822; consistency index (c.i.): 0.12; retention index (r.i.): 0.84). We found 22 unique MP trees on the first run. Their strict consensus collapsed 441 nodes. Eight more runs produced 54 more MP trees for a total of 76 trees. However, 11 of these were duplicates and there were only 65 unique MP trees. Their strict consensus collapsed 450 nodes or only nine more than collapsed in the first single run. That we found redundant trees and the reduced yield in topological diversity suggest that we have found the true diversity of all MP trees that could be obtained from the DiatStram dataset ().

For the DiatBo dataset, three runs of 500 random addition sequences seemed to converge on the representative set of MP trees. The strict consensus of the first 139 trees of L: 14094 (c.i.: 0.19, r.i.: 0.84) collapsed 287 nodes, that of the 338 trees of the first and second runs collapsed 288 nodes, and that of the 554 trees of the all three runs combined collapsed 288 nodes. In each of the runs, an MP tree was found within the first 18 random additions indicating that TNT was finding at least one tree of optimal topology very early in the analysis. In addition, 51 of the 554 total trees were identical to trees previously found, indicating that there was some redundancy in the coverage of tree space. Thus, we believe well represents the strict consensus of all equally MP cladograms that might be found in the DiatBo dataset.

Fig. 2. Strict consensus of 503 unique equally most parsimonious trees calculated from the diatom plus bolidophyte (DiatBo) dataset. Only relationships among diatoms are shown.

Fig. 2. Strict consensus of 503 unique equally most parsimonious trees calculated from the diatom plus bolidophyte (DiatBo) dataset. Only relationships among diatoms are shown.

Unconstrained searches in both analyses resulted in non-monophyly for the classes Coscinodiscophyceae and Mediophyceae, and monophyly for the class Bacillariophyceae. In both, the Coscinodiscophyceae was positively paraphyletic (i.e. fully resolved as a ladder-like grade with no polytomies) with the Melosirales sister to a non-monophyletic Mediophyceae plus a monophyletic Bacillariophyceae. The Mediophyceae was positively paraphyletic in the DiatStram analysis, with Chaetoceros and a few other taxa forming a clade sister to the pennates. In the DiatBo analyses the Mediophyceae formed an unresolved polytomy.

Monophyly of the Coscinodiscophyceae and Mediophyceae (i.e. the CMB hypothesis) required little penalty for either large dataset: the CMB hypothesis was only seven steps longer than the unconstrained MP trees for the DiatStram dataset and 10 steps longer for the DiatBo dataset. Arrangements of terminal taxa were similar for results for both datasets, and only the tree for the DiatBo dataset is shown ().

Fig. 3. Strict consensus of 147 unique equally most parsimonious trees calculated from the Bolidomonas plus diatom (DiatBo) dataset with Coscinodiscophyceae and Mediophyceae constrained to monophyly. Only relationships among diatoms are shown.

Fig. 3. Strict consensus of 147 unique equally most parsimonious trees calculated from the Bolidomonas plus diatom (DiatBo) dataset with Coscinodiscophyceae and Mediophyceae constrained to monophyly. Only relationships among diatoms are shown.

Given the relatively low penalty incurred for transforming any optimal tree into the CMB hypothesis, it is not surprising that the BS values along the backbone of the tree were generally quite low. The Bacillariophyceae clade and the Mediophyceae plus Bacillariophyceae clade were the only two backbone nodes to receive BS support values of ≥90% for either dataset.

Parsimony analysis of M54, MK126 and MK281 datasets yielded similar results (trees not shown). The MP tree or trees rejected the CMB hypothesis, and bootstrap values along the backbone were typically less than 50%. The CMB constraint trees were not much longer than the MP trees: four steps longer for the M54 dataset (4530 vs. 4526), seven steps longer for the MK126 dataset (5633 vs. 5626), and 23 steps longer for the MK281 dataset (19 302 vs. 19 325).

Bayesian analysis

Analyses of 1 000 000 generations had clearly not converged on the same posterior distributions among independent runs in analyses of either the DiatBo or M54 datasets. Plots of bipartition posterior probability values between the first pair of runs for each of the two datasets showed many points (i.e. bipartitions) falling directly along the abscissa and ordinate, indicating that some clades found in one analysis (even at BPP values >0.8) were not found in all others (). Furthermore, convergence was not reached with 10 000 000 MCMC generations for the DiatBo dataset or within 20 000 000 MCMC generations for the M54 dataset (). For the DiatBo dataset, topological differences between runs were not minor. In the 10 000 000 generation analysis of DiatBo, Toxarium, Lampriscus, Biddulphiopsis and the Cymatosirales (Toxarium and allies) grouped with Lithodesmiales plus Thalassiosirales (BPP = 0.90) in one run, whereas they (Toxarium and allies) were sister to pennates (BPP = 0.5) in another run. Several species of Pinnularia, a raphid pennate genus in traditional classifications and in our MP analyses, were placed at the base of the diatom tree as sister to Leptocylindrus with BPP values of 0.88 and 0.98 in each of the two runs. While several of the 1 000 000 generation runs recovered a monophyletic Bacillariophyceae, the fact that we did not recover the pennates in any of the 10 000 000 generation analyses clearly indicates that even our longest Bayesian runs were far short of convergence.

Fig. 4. Bipartition partition probability plots of two runs (split runs) from the 1 000 000 Markov Chain Monte Carlo (MCMC) generation Bayesian analysis of our diatom plus bolidophyte dataset (DiatBo: upper plot) and of the Medlin et al. (Citation2008) dataset (M54: lower plot). 90% burn-in used for each.

Fig. 4. Bipartition partition probability plots of two runs (split runs) from the 1 000 000 Markov Chain Monte Carlo (MCMC) generation Bayesian analysis of our diatom plus bolidophyte dataset (DiatBo: upper plot) and of the Medlin et al. (Citation2008) dataset (M54: lower plot). 90% burn-in used for each.

Fig. 5. Bipartition partition probability plots of two runs (split runs) from the 10 000 000 Markov Chain Monte Carlo (MCMC) generation Bayesian analysis of our diatom plus bolidophyte dataset (DiatBo: upper plot) and the 20 000 000 generation Medlin et al. (Citation2008) dataset (M54: lower plot). 90% burn-in used for each.

Fig. 5. Bipartition partition probability plots of two runs (split runs) from the 10 000 000 Markov Chain Monte Carlo (MCMC) generation Bayesian analysis of our diatom plus bolidophyte dataset (DiatBo: upper plot) and the 20 000 000 generation Medlin et al. (Citation2008) dataset (M54: lower plot). 90% burn-in used for each.

We analysed aspects of performance of the M54 data set to obtain a gross estimate of how difficult it might be to reach convergence in a Bayesian analysis of several hundred diatom sequences. The standard deviation of bipartitions between independent runs for the M54 dataset dropped to near zero at about 22 million generations, and thereafter oscillated at ∼0.1 until the analysis was terminated at 50 000 000 generations (). While this might suggest that convergence had been reached by 22 million generations, plotting the sampled trees for the last 28 million generations shows clusters of points off a straight line (). Discarding trees from the first 45 million generations resulted in a BPP plot approximating a straight line. The majority rule consensus tree returned a convex Coscinodiscophyceae and monophyletic Bacillariophyceae, but the Mediophyceae were positively paraphyletic (). Attheya septentrionalis was grouped with the pennates at a BPP of 0.95. This is the placement of Attheya obtained from MP analysis. In fact, incongruence between the Bayesian and MP trees for dataset M54 is restricted to areas where BPP values are below 0.70 (not shown).

Fig. 6. Standard deviation of likelihood scores among independent runs (split runs) vs number of generations for Bayesian analysis of our diatom plus bolidophyte (DiatBo) dataset (upper plot) and of the Medlin et al. (Citation2008) dataset (M54: lower plot). The line in the upper plot represents a power function estimate of split standard deviations out to 50 000 000 Markov Chain Monte Carlo (MCMC) generations.

Fig. 6. Standard deviation of likelihood scores among independent runs (split runs) vs number of generations for Bayesian analysis of our diatom plus bolidophyte (DiatBo) dataset (upper plot) and of the Medlin et al. (Citation2008) dataset (M54: lower plot). The line in the upper plot represents a power function estimate of split standard deviations out to 50 000 000 Markov Chain Monte Carlo (MCMC) generations.

Fig. 7. Bipartition probability plot of two runs from the M54 dataset of Medlin et al. (Citation2008). The upper plot discarded the first 22 million Markov Chain Monte Carlo (MCMC) generations (or 44% burn-in, based on initial minimum at ca. 22 million MCMC generations from ). The lower plot discarded the first 45 million MCMC generations (or 90% burn-in).

Fig. 7. Bipartition probability plot of two runs from the M54 dataset of Medlin et al. (Citation2008). The upper plot discarded the first 22 million Markov Chain Monte Carlo (MCMC) generations (or 44% burn-in, based on initial minimum at ca. 22 million MCMC generations from Fig. 6). The lower plot discarded the first 45 million MCMC generations (or 90% burn-in).

Fig. 8. Majority rule consensus tree (calculated without an outgroup and arbitrarily rooted in the middle of the Coscinodiscophyceae) derived from 50 000 000 Markov Chain Monte Carlo generation Bayesian analysis of the M54 dataset from Medlin et al. (Citation2008) with 90% burn-in. Numbers or symbols below nodes are bipartition posterior probability (BPP) values. Asterisks indicate BPP values ≥95%.

Fig. 8. Majority rule consensus tree (calculated without an outgroup and arbitrarily rooted in the middle of the Coscinodiscophyceae) derived from 50 000 000 Markov Chain Monte Carlo generation Bayesian analysis of the M54 dataset from Medlin et al. (Citation2008) with 90% burn-in. Numbers or symbols below nodes are bipartition posterior probability (BPP) values. Asterisks indicate BPP values ≥95%.

As judged by the still rapidly dropping split standard deviations (not shown), Bayesian analyses of the intermediate-sized MK126 and MK281 datasets had not converged on the same posterior probability distribution at 1 000 000 MCMC generations, again underscoring the difficulty of completing a meaningful Bayesian analysis on even 100 diatom SSU sequences in so few MCMC generations. Our analysis of the MK54 dataset indicates that it might take as many as 50–100 million generations or more to reach convergence on datasets with 600 or more diatom SSU sequences.

Morphological tree

Seven trees of length nine were found. Only the pennates formed a convex group (neither the Coscinodiscophyceae or Mediophyceae were monophyletic, regardless of how the tree was rooted). In the strict consensus, the Thalassiosirales were excluded from the remaining Mediophyceae () because they share all the characteristics of the Coscinodiscophyceae (radial symmetry, globular/isometric auxospore shape/growth, no perizonium or properizonium), and have none of the features peculiar to other Mediophyceae or the Bacillariophyceae.

Fig. 9. Unrooted tree of diatom genera as determined by a parsimony analysis of the morphology matrix of Table 2 in Medlin (2004). Strict consensus of eight trees.

Fig. 9. Unrooted tree of diatom genera as determined by a parsimony analysis of the morphology matrix of Table 2 in Medlin (2004). Strict consensus of eight trees.

Discussion

Our results weakly reject the hypothesis that the Coscinodiscophyceae, Mediophyceae and Bacillariophyceae are each monophyletic (the CMB hypothesis). Only the Bacillariophyceae (pennate diatoms) were monophyletic, whether we included only closely related outgroups (bolidophytes only) or distantly related outgroups (bolidophytes and all other stramenopiles). However, there is little parsimony penalty to constrain trees to the CMB hypothesis for all datasets.

Given that greatly different topologies can be obtained from SSU datasets with little penalty, it is not surprising that estimates of the diatom phylogeny based on SSU sequences vary widely between studies using different taxa, alignments, and optimality criteria. For example, the few studies hinting at the possibility of a monophyletic Coscinodiscophyceae and paraphyletic Mediophyceae or vice versa used relatively few diatom SSU sequences. Very early in the use of SSU data in diatom systematics using 11 diatoms including three Coscinodiscophyceae and one member of the Mediophyceae, Medlin et al. (Citation1993) returned a monophyletic Coscinodiscophyceae. Using 29 diatom SSU sequences they later (Medlin et al. Citation1996 a, Citation b ) returned a monophyletic Coscinodiscophyceae and paraphyletic Mediophyceae. Kooistra & Medlin (Citation1996) analysed that same dataset, experimenting with various approaches to deal with the potential long-branch problem introduced by ‘aberrantly evolving' diatoms; each approach returned a monophyletic Coscinodiscophyceae and paraphyletic Mediophyceae, although relationships within the mediophytes were dependent upon the method used. Kooistra et al. (Citation2003) used 38 diatom SSU sequences, only two of which were Coscinodiscophyceae, both on long branches, returning a monophyletic Coscinodiscophyceae and paraphyletic Mediophyceae. Using 51 diatom SSU sequences, Chepurnov et al. (Citation2008) also returned a monophyletic Coscinodiscophyceae and paraphyletic Mediophyceae. However, they only ran 4 000 000 MCMC generations, so it is unclear if they had reached convergence of topology and posterior probabilities.

In contrast, Cavalier-Smith & Chao (Citation2006), focusing not on diatoms but on a wide range of protists including diatoms, used a wide range of outgroups but only 32 diatom SSU sequences in a distance (neighbor-joining) analysis. While they found moderate (70%) BS support for monophyly of the Mediophyceae, they also found a paraphyletic Coscinodiscophyceae, with the internode excluding Melosirales from other Coscinodiscophyceae receiving slightly higher support than that found for a monophyletic Mediophyceae (BS = 72%). In perhaps the most extreme case of taxon sampling effects using eleven diatom exemplars when studying relationships among alveolates and stramenopiles, Van de Peer et al. (Citation1996) returned monophyly for the centric diatoms as a whole.

It is not simply monophyly (or not) of the centrics, the Coscinodiscophyceae or Mediophyceae that has proved unstable in different SSU analyses. Three studies, which each included more than 100 diatom sequences, offer the opportunity to compare trees calculated under a single optimality criterion (Bayesian inference). These revealed that taxon sampling differences alone may account for very different tree topologies. Based on 123 diatoms (Medlin & Kaczmarska, Citation2004) the Lithodesmiales grouped with the Thalassiosirales (BPP = 1.0), with 181 diatom SSU sequences (Alverson et al., Citation2006) they grouped with the Hemiaulales to the exclusion of the Thalassiosirales (BPP = 1.0) and with an unknown number of diatom sequences (Sims et al., Citation2006) the Lithodesmiales grouped with the Biddulphiales, Triceratiales and Toxarium to the exclusion of the Thalassiosirales (BPP = 1.0). The unpublished dataset for of Sims et al. (Citation2006) has been characterized as including more than 800 ingroup sequences (Medlin et al., Citation2008). It should be noted that alignment methods have varied greatly among the many studies using SSU sequences and these could be a possible source of variation that has yet to be fully explored. This variation has the potential to change diatom SSU tree topology radically (Medlin et al., Citation2008).

Among the many trees generated using SSU, the most radical and controversial trees (Williams & Kociolek, Citation2007) are those that support the CMB hypothesis: the MP tree of 8600+ SSU sequences (123 diatoms) by Medlin & Kaczmarska (Citation2004), the Bayesian tree (800+ diatom sequences with bolidophyte outgroups) by Sims et al. (Citation2006), and the Bayesian tree (54 diatom SSU sequences with no outgroup) by Medlin et al. (Citation2008).

Medlin & Kaczmarska (Citation2004) claimed that their MP tree (8600+ sequences, but only 123 diatoms) was more accurate than their Bayesian tree (same 123 diatoms but only three bolidophyte SSU outgroup sequences) because including distantly related outgroups increased the number of parsimony informative characters. However, while increased taxon sampling within the scope of the problem (i.e. within diatoms) may increase accuracy, increased taxon sampling outside the scope of the problem (adding distant outgroups) will likely decrease accuracy of phylogenetic inference (Hillis, Citation1998; Pollock et al., Citation2002; Hillis et al., Citation2003; Hedtke et al., Citation2006; Verbruggen & Theriot, Citation2008). Medlin & Kaczmarska (Citation2004) cited Bollback (Citation2002) as support for their position, but that paper studied the effects of adding characters, not taxa (ingroup or outgroup), and only in the context of model-based methods, specifically accuracy of model selection for phylogenetic analysis, and is therefore not pertinent. Thus, contrary to the claims of Medlin & Kaczmarska (Citation2004), one could hypothesize that recovery of the CMB tree under parsimony is an artefact of increased error caused by addition of distantly related outgroup taxa. In the light of the literature on taxon sampling, a more substantive claim was made by Sims et al. (Citation2006), who suggested that increased ingroup sampling led to recovery of the CMB hypothesis, this time with high BPP support values.

However, we suggest that recovery of the CMB hypothesis in Medlin & Kaczmarska (Citation2004), Sims et al. (Citation2006) and Medlin et al. (Citation2008) is probably a result of insufficient tree search effort. Medlin & Kaczmarska (Citation2004) used the MP search in ARB, whose most effective heuristic search algorithm employs a combination of Nearest Neighbor Interchange and Kernighan-Lin optimization, which together are less effective than the commonly used Tree–Bisection–Reconnection algorithm, and certainly not as effective as other methods, such as the parsimony ratchet (Nixon, Citation1999). Given the large number of near-optimal trees that support the CMB hypothesis in our dataset, it is likely that a suboptimal search might find any one of these suboptimal trees. Similarly, the Bayesian inference (Sims et al. (Citation2006)) was probably also confounded by insufficient search of tree space. They only ran 1 000 000 MCMC generations. Our analysis of our DiatBo dataset (673 diatoms plus seven bolidophytes) had not reached convergence at 10 000 000 generations. Our analysis of the M54 dataset, presumably the same alignment but with far fewer taxa than used by Sims et al. (Citation2006), seems to have required at least 45 million generations for the burn-in alone. The tree in (Medlin et al. Citation2008) supporting the CMB hypothesis, is clearly an artefact of running far too few MCMC generations, and even if the tree topology is correct, monophyly of the Coscinodiscophyceae is an artefact of arbitrary rooting in the absence of an outgroup.

Thus, our results strongly suggest that the choice of optimality criterion has less influence on trees derived from SSU data than does the proper application of that choice. All methods, alignments and taxon sampling schemes we reviewed or re-analysed returned weak rejection of the CMB hypothesis.

Both Medlin & Kaczmarska (Citation2004) and Sims et al. (Citation2006) argued that morphological data were congruent with their SSU trees. However, the characters discussed are either irrelevant to testing the CMB hypothesis, or ambiguous about it (e.g. spermatozoid structure [both the Coscinodiscophyceae and Mediophyceae have merogenous and hologenous spermatozoids]; pyrenoid structure [one type is apparently symplesiomorphically shared by the Coscinodiscophyceae and Mediophyceae, while pyrenoid structure in the Thalassiosirales is autapomorphic for the order]). Using the Medlin & Kaczmarska (Citation2004) morphological character matrix, our tree excluded the Thalassiosirales from the Mediophyceae on the basis of auxospore characteristics. Nevertheless it was claimed that the particular pattern of auxospore formation under discussion was retained in the Thalassiosirales (Medlin & Kaczmarska, Citation2004: 267). To make this argument under parsimony, the Thalassiosirales would have to be the sister group to all remaining Mediophyceae, a relationship not recovered in either Medlin & Kaczmarska (Citation2004) or Sims et al. (Citation2006).

Complicated scenarios are invoked ad hoc to explain the distribution of the four different Golgi body arrangements. Of the two widely distributed arrangements, the so-called Type 1 (sensu Medlin & Kaczmarska, Citation2004) arrangement was attributed to most of the Coscinodiscophyceae, and the Type 2 arrangement was attributed to the Aulacoseirales (Coscinodiscophyceae), Mediophyceae, and Bacillariophyceae. If Type 1 is apomorphic and Type 2 is not, then there is no evidence from the Golgi arrangement that the Aulacoseirales belong to the Coscinodiscophyceae. If Type 2 is apomorphic, regardless of the interpretation of Type 1, the Golgi character is congruent with our SSU trees and rejects the CMB hypothesis by placing the Aulacoseirales with the Mediophyceae and Bacillariophyceae. Nevertheless, Medlin & Kaczmarksa (2004) argued away this incongruence, explaining the distribution of Golgi body types in terms of ancestral polymorphisms, implicitly invoking unobserved character conditions in unobserved ancestral species for as far back as the common ancestor to red algae and diatoms (Medlin & Kaczmarska, Citation2004: 265): ‘However, G-ER-M units are known from the oomycetes and the red algae, whereas an association of the Golgi around the nucleus is also known in the Labyrinthuloides. Thus, it would appear that both features are present in ancestors of the diatoms and the potential host cells of their plastids. It can be argued that the two traits then segregated themselves in the two separate lineages as they evolved.’

Conclusion

Medlin & Kaczmarska (Citation2004) and Sims et al. (Citation2006) proposed monophyly of each of the Coscinodiscophyceae, Mediophyceae, and Bacillariophyceae. Since the unavailability of the datasets (the only ones to support the CMB hypothesis apart from that of Medlin et al. (Citation2008)) precluded direct reproduction of their results, we assembled datasets of similar size and characteristics. Our results suggest that the CMB hypothesis is rejected by SSU data, albeit very weakly. Similarly, our re-analysis of morphological evidence proposed by Medlin & Kaczmarska (Citation2004) also weakly rejects the CMB hypothesis. Medlin & Kaczmarska (Citation2004) very likely recovered a suboptimal MP tree for their 8600+ sequence dataset and Sims et al. (Citation2006) very likely failed to converge on the true posterior distribution of trees in their Bayesian analysis. Conversely, if Medlin & Kaczmarska (Citation2004) did recover the MP tree or trees and if the Sims et al. (Citation2006) analysis did reach convergence for their dataset, then our results demonstrate that the likelihood of their having done so is highly dependent on taxon sampling and/or sequence alignment. We have demonstrated that the Medlin et al. (Citation2008) tree supporting the CMB hypothesis is an artefact and that it must be concluded that the CMB hypothesis is far from robust, regardless of how one interprets the variation between studies.

In summary, pursuit of a well-supported phylogeny of diatoms seems to be limited as much by the number of characters per taxon as by the number of taxa for which data exist. There is a small but growing rbcL dataset which rejects the CMB hypothesis (Choi et al., Citation2008). Very limited coxI data supports the CMB hypothesis, but analyses so far only include four species (Ehara et al., Citation2000). While nSSU data are a useful addition to the difficult problem of inferring diatom phylogeny, continued use of SSU alone, as Patterson (Citation1994: 185) wrote in a similar context, might simply be an ineffective attempt to ‘… wring truth from recalcitrant data.’

Acknowledgements

ECT was supported by NSF EF 0629410 and the Jane and Roland Blumberg Centennial Professorship in Molecular Evolution. AJA was supported by an NIH Ruth L. Kirschstein NRSA Postdoctoral Fellowship (1F32GM080079-01A1). Both also acknowledge the Tony Institute. RRG and JJC were supported by NIH GM067317.

References

  • Adl , SM , Simpson , AGB , Farmer , MA , Andersen , RA , Anderson , OR , Barta , JR , Bowser , SS , Brugerolle , GUY , Fensome , RA Fredericq , S . 2005 . A new higher level classification of eukaryotes with emphasis on the taxonomy of protists . J. Eukaryotic Microbiol. , 52 : 399 – 451 .
  • Alverson , AJ , Cannone , JJ , Gutell , RR and Theriot , EC . 2006 . The evolution of elongate shape in diatoms . J. Phycol. , 42 : 655 – 668 .
  • Alverson , AJ and Theriot , EC . 2005 . Comments on recent progress toward reconstructing the diatom phylogeny . J. Nanosci. Nanotechnol. , 5 : 57 – 62 .
  • Andersen , RA . 2004 . Biology and systematics of heterokont and haptophyte algae . Am. J. Bot. , 91 : 1508 – 1522 .
  • Bollback , JP . 2002 . Bayesian model adequacy and choice in phylogenetics . Mol. Biol. Evol. , 19 : 1171 – 1180 .
  • Cannone , JJ , Subramanian , S , Schnare , MN , Collett , JR , D'souza , LM , Du , Y , Feng , B , Lin , N , Madabusi , LV Muller , KM . 2002 . The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs . BMC Bioinformatics , 3 : 2
  • Cavalier-Smith , T and Chao , E . 2006 . Phylogeny and megasystematics of phagotrophic heterokonts (Kingdom Chromista) . J. Mol. Evol. , 62 : 388 – 420 .
  • Chepurnov , VA , Mann , DG , Von Dassow , P , Vanormelingen , P , Gillard , J , Inzé , D , Sabbe , K and Vyverman , W . 2008 . In search of new tractable diatoms for experimental biology . BioEssays , 30 : 692 – 702 .
  • Choi , H-G , Joo , HM , Jung , W , Hong , SS , Kang , J-S and Kang , S-H . 2008 . Morphology and phylogenetic relationships of some psychrophilic polar diatoms (Bacillariophyta) . Nov. Hedwig. Beih. , 133 : 7 – 30 .
  • Daugbjerg , N and Andersen , RA . 1997 . A molecular phylogeny of the heterokont algae based on analyses of chloroplast-encoded rbcL sequence data . J. Phycol. , 33 : 1031 – 1041 .
  • Ehara , M , Inagaki , Y , Watanabe , KI and Ohama , T . 2000 . Phylogenetic analysis of diatom coxI genes and implications of a fluctuating GC content on mitochondrial genetic code evolution . Curr. Genet. , 37 : 29 – 33 .
  • Goertzen , LR and Theriot , EC . 2003 . Effect of taxon sampling, character weighting, and combined data on the interpretation of relationships among the heterokont algae . J. Phycol. , 39 : 423 – 439 .
  • Goloboff , P , Farris , J and Nixon , K . 2003 . T.N.T.: Tree analysis using new technology. . Available at: www.cladistics.com
  • Gutell , RR , Lee , JC and Cannone , JJ . 2002 . The accuracy of ribosomal RNA comparative structure models . Curr. Opin. Structural Biol. , 12 : 301 – 310 .
  • Gutell , RR , Power , A , Hertz , GZ , Putz , EJ and Stormo , GD . 1992 . Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods . Nucl. Acids Res. , 20 : 5785 – 5795 .
  • Gutell , RR , Weiser , B , Woese , CR and Noller , HF . 1985 . Comparative anatomy of 16S-like ribosomal RNA . Prog. Nucl. Acid Res. Mol. Biol. , 32 : 155 – 216 .
  • Hedtke , S , Townsend , T and Hillis , D . 2006 . Resolution of phylogenetic conflict in large data sets by increased taxon sampling . System. Biol. , 55 : 522 – 529 .
  • Hillis , DM . 1998 . Taxonomic sampling, phylogenetic accuracy and investigator bias . System. Biol. , 47 : 3 – 8 .
  • Hillis , DM , Pollock , DD , Mcguire , JA and Zwickl , DJ . 2003 . Is sparse taxon sampling a problem for phylogenetic inference? . System. Biol. , 52 : 124 – 126 .
  • Kooistra , W , De Stefano , M , Mann , DG , Salma , N and Medlin , LK . 2003 . Phylogenetic position of Toxarium, a pennate-like lineage within centric diatoms (Bacillariophyceae) . J. Phycol. , 39 : 185 – 197 .
  • Kooistra , WHCF and Medlin , LK . 1996 . Evolution of the diatoms (Bacillariophyta): IV. A reconstruction of their age from small subunit rRNA coding regions and the fossil record . Mol. Phylogenet. Evol. , 6 : 391 – 407 .
  • Larsen , N. , Olsen , G.J. , Maidak , B.L. , Mccaughey , M.J. , Overbeek , R. , Macke , T.J. , Marsh , T.L. and Woese , C.R. 1993 . The Ribosomal Database Project . Nucl. Acids Res. , 21 : 3021 – 3023 .
  • Medlin , LK and Kaczmarska , I . 2004 . Evolution of the diatoms V: Morphological and cytological support for the major clades and a taxonomic revision . Phycologia , 43 : 245 – 270 .
  • Medlin , LK , Kooistra , WHCF , Gersonde , R and Wellbrock , U . 1996a . Evolution of the diatoms (Bacillariophyta): II. Nuclear-encoded small-subunit rRNA sequence comparisons confirm a paraphyletic origin for the centric diatoms . Mol. Biol. Evol. , 13 : 67 – 75 .
  • Medlin , LK , Kooistra , WHCF , Gersonde , R and Wellbrock , U . 1996b . Evolution of the diatoms (Bacillariophyta): III. Molecular evidence for the origin of the Thalassiosirales . Nov. Hedwig. Beih. , 112 : 221 – 234 .
  • Medlin , LK , Kooistra , WHCF and Schmid , A-MM . 2000 . “ A review of the evolution of the diatoms–a total approach using molecules, morphology and geology ” . In The Origin and Early Evolution of the Diatoms: Fossil, Molecular and Biogeographical Approaches , Edited by: Witkowski , A and Sieminska , J . 13 – 36 . Kraków, , Poland : Szafer Institute of Botany, Polish Academy of Sciences .
  • Medlin , LK , Sato , S , Mann , DG and Kooistra , WHCF . 2008 . Molecular evidence confirms sister relationship of Ardissonea, Climacosphenia and Toxarium within the bipolar centric diatoms (Bacillariophyta, Mediophyceae), and cladistic analyses confirm that extremely elongate shape has arisen twice in the diatoms . J. Phycol. , 45 : 1340 – 1348 .
  • Medlin , LK , Williams , DM and Sims , PA . 1993 . The evolution of the diatoms (Bacillariophyta). I. Origin of the group and assessment of the monophyly of its major divisions . Eur. J. Phycol. , 28 : 261 – 275 .
  • Nixon , K . 1999 . The parsimony ratchet, a new method for rapid parsimony analysis . Cladistics , 15 : 407 – 414 .
  • Nylander , JAA . 2004 . MrModeltest v2. Program distributed by the author. , Evolutionary Biology Centre, Uppsala University .
  • Patterson , C . 1994 . “ Null or minimal models ” . In Models in Phylogeny Reconstruction , Edited by: Scotland , R , Siebert , DJ and Williams , DM . 173 – 192 . Oxford, , UK : Oxford University Press .
  • Pollock , DD , Zwickl , DJ , Mcguire , JA and Hillis , DM . 2002 . Increased taxon sampling is advantageous for phylogenetic inference . System. Biol. , 51 : 664 – 671 .
  • Round , FE , Crawford , RM and Mann , DG . 1990 . The Diatoms: Biology & Morphology of the Genera , Cambridge, , UK : Cambridge University Press .
  • Simonsen , R . 1979 . The diatom system: Ideas on phylogeny . Bacillaria , 2 : 9 – 71 .
  • Sims , PA , Mann , DG and Medlin , LK . 2006 . Evolution of the diatoms: Insights from fossil, biological and molecular data . Phycologia , 45 : 361 – 402 .
  • Sorhannus , U . 2004 . Diatom phylogenetics inferred based on direct optimization of nuclear-encoded SSU rRNA sequences . Cladistics , 20 : 487 – 497 .
  • Sorhannus , U . 2007 . A nuclear-encoded small-subunit ribosomal RNA timescale for diatom evolution . Mar. Micropaleontol. , 65 : 1 – 12 .
  • Van De Peer , Y , Van Der Auwera , G and De Wachter , R . 1996 . The evolution of stramenopiles and alveolates as derived by “substitution rate calibration” of small ribosomal subunit RNA . J. Mol. Evol. , 42 : 201 – 210 .
  • Verbruggen , H and Theriot , EC . 2008 . Building trees of algae: Some advances in phylogenetic and evolutionary analysis . Eur. J. Phycol. , 43 : 229 – 252 .
  • Wilgenbusch , JC , Warren , DL and Swofford , DL . 2004 . AWTY: A system for graphical exploration of MCMC convergence in Bayesian phylogenetic inference. . Available at: http://ceb.csit.fsu.edu/awty
  • Williams , DM and Kociolek , JP . 2007 . Pursuit of a natural classification of diatoms: History, monophyly and the rejection of paraphyletic taxa . Eur. J. Phycol. , 42 : 313 – 319 .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.