Tree topology tests - Supplementary material

3.7 Supplementary material

4.6.4 Tree topology tests

To better assess the phylogenetic position of Rhizaria, we conducted topology compari-sons using the approximately unbiased (AU) test [Shimodaira 2002]. For each tested tree, site likelihoods were calculated using CODEML and the AU test was performed using CONSEL [Shimodaira and Hasegawa 2001] with default scaling and replicate values. To test the monophyly of the new assemblage SAR, we first compared our tree (Figure 1-ch.4) to the best possible tree in which Rhizaria were forced to be outside SAR, given topologi-cal constraints corresponding to a trichotomy of unikonts, stramenopiles + alveolates, and the rest of the groups represented as a multifurcation (Supplementary Figure S3-ch.4).

Secondly, we evaluated the placement of Rhizaria within the SAR clade by testing the three possible branching patterns between Rhizaria, stramenopiles, and alveolates.

4.7 Supplementary material

Supplementary Figure S1-ch.4. Best RAxML tree of eukaryotes. Numbers at nodes represent the result of the bootstrap analysis; black dots mean values of 100% (hundred bootstrap replicates were done). Nodes with support under 65% were collapsed.

Supplementary Figure S2-ch.4. MrBayes tree. Numbers at nodes represent the bayesian posterior probabilities.

Supplementary Figure S3-ch.4. Best TREEFINDER tree in which Rhizaria were forced not to belong to SAR.

Supplementary Table S1-ch.4. Abbreviated and complete protein names

Abbreviated name Protein name

actin Actin

14-3-3 14-3-3 protein

acaa1 Acetyl-Coenzyme A acyltransferase 1 akr1 Aldo-keto reductase family 1 aldhy Antiquitin ant1 Adenine nucleotide translocator arf3 ADP-ribosylation factor 3 arpc1 Adaptor-related protein complex 1 atp6 ATPase, H+ transporting, lysosomal calm3 Calmodulin calr Calreticulin

capz Capping protein

cct6A Chaperonin containing TCP1 cct-A T complex protein 1 alpha subunit cct-B T complex protein 1 beta subunit cct-D T complex protein 1 delta subunit cct-E T complex protein 1 epsilon subunit cct-N T complex protein 1 eta subunit cct-T T complex protein 1 theta subunit ctsl1 Cathepsin L1 preproprotein

drg1 Developmentally regulated GTP binding protein ef2 Elongation factor EF2

fh Fumarate hydratase precursor fibri Fibrillarin gdi2 GDP dissociation inhibitor 2

gnb Guanine nucleotide-binding protein, beta-3 subunit

gnb2l Guanine nucleotide binding protein (G protein), beta polypeptide 2-like 1 gnbpa Guanine nucleotide binding protein, alpha activating polypeptide O grc5 60S ribosomal protein L10 QM protein

h3 H3 histone

h4 Histone H4

hla-B HLA-B associated transcript 1

hmt1 HMT1 hnRNP methyltransferase-like 2 hsp70-C Heat shock 70kDa protein

hsp70-mt Heat shock 70kDa protein, mitochondrial form hsp90 Heat shock 90kDa protein 1

if2b Eukaryotic translation initiation factor 2b if2g Eukaryotic translation initiation factor 2g if6 Eukaryotic translation initiation factor 6

ino1 D-myo-inositol-3-phosphate synthase l12e-A 40S ribosomal Protein S12

l12e-D 60S ribosomal Protein L7a

mcm-A minichromosome family maintenance protein 5 metap2 Methionyl aminopeptidase 2

metk S-adenosyl-methionine synthetase

ndf1 NADH dehydrogenase (ubiquinone) flavoprotein 1

nop NOP5/NOP58 protein

nsf1-G Vacuolar protein sorting factor 4b

nsf1-J 26S proteasome AAA-ATPase regulatory subunit 6 nsf1-K 26S proteasome AAA-ATPase regulatory subunit 6a nsf1-l proteasome 26S ATPase subunit 2

nsf1-L 26S proteasome AAA-ATPase regulatory subunit 6b

oplah 5-oxoprolinase orf2 putative 28 kDa protein

osgep O-sialoglycoprotein endopeptidase pace2-A XPA binding protein

phb Prohibitin 2

pmca2 Ca2+-ATPase pp2A-b Protein phosphatase 2

psma-B 20S proteasome alpha 1a chain psma-C 20S proteasome alpha 1b chain psma-E 20S proteasome alpha 1c chain psma-F 20S proteasome alpha 3 chain psma-I 20S proteasome alpha 1e chain psmb-K 20S proteasome beta 7 chain psmb-L 20S proteasome beta 6 chain psmb-M 20S proteasome beta 4 chain psmc6 Proteasome 26S ATPase subunit 6c psmd 26S proteasome-associated pad1 homolog

pyk Pyruvate kinase

rac Small GTP binding protein Rac1 rad51-A DNA repair protein RAD51 ran Ras-related nuclear protein rap1 RAS-related protein RAP1B

rf1 Eukaryotic peptide chain release factor subunit 1 rpl1 60S ribosomal Protein 1

rpl11b 60S ribosomal Protein 11b rpl12b 60S ribosomal Protein 12b rpl15a 60S ribosomal Protein 13 rpl16b 60S ribosomal Protein 16b rpl17 60S ribosomal Protein 17 rpl18 60S ribosomal Protein 18 rpl19a 60S ribosomal Protein 19a rpl2 60S ribosomal Protein 2 rpl20 60S ribosomal Protein 20 rpl27 60S ribosomal Protein 21 rpl3 60S ribosomal Protein 3 rpl30 60S ribosomal Protein 30 rpl32 60S ribosomal Protein 32 rpl4B 60S ribosomal Protein 4b rpl5 60S ribosomal Protein 5 rpl7-A 60S ribosomal Protein 7a rpl9 60S ribosomal Protein 9

rpp0 60S acidic ribosomal protein P0 L10E rps1 40S ribosomal Protein 1

rps10 40S ribosomal Protein 10 rps13a 40S ribosomal Protein 13a rps14 Ribosomal protein S14 rps15 40S ribosomal Protein 15 rps16 40S ribosomal Protein 16 rps18 40S ribosomal Protein 18 rps2 40S ribosomal Protein 2 rps22a 40S ribosomal Protein 22a rps23 40S ribosomal Protein 23 rps3 40S ribosomal Protein 3 rps4 40S ribosomal Protein 4 rps8 40S ribosomal Protein 8 rps9 Ribosomal protein S9

sap40 40S ribosomal protein SA 40kDa laminin receptor 1 stk38 Serine/threonine kinase 38

suca Succinyl-CoA ligase alpha chain mitochondrial precursor?

tif Translation initiation factor

trs TARS protein

tubulin-A Alpha-tubulin tubulin-B Beta-tubulin tubulin-G Gamma-tubulin

ubc Ubiquitin-conjugating enzyme

vata Vacuolar ATP synthase catalytic subunit A vatb Vacuolar ATP synthase catalytic subunit B wd WDR45-like

Supplementary Table S2-ch.4. OTU (Operational Taxonomic Unit) names, number of characters, and percentage of characters included in the final alignment

Taxon Characters

Acanthamoeba castellanii 15112 50.52

Alexandrium 7004 23.41 Alexandrium fundyense (2%)

Alexandrium tamarense (43%)

Arabidopsis thaliana 29383 98.24

Bigelowiella natans 12789 42.76

Blastocystis hominis 11656 38.97

Chlamydomonas 19955 66.72 Chlamydomonas incerta (36%)

Chlamydomonas reinhardtii (69%)

Chlamydomonas sp (1%)

Cryptococcus neoformans 26795 89.59

Cryptosporidium 24814 82.96 Cryptosporidium hominis

(16%)

Cryptosporidium parvum (77%)

Cyanidioschyzon merolae 24066 80.46

Cyanophora paradoxa 13424 44.88

Dictyostelium discoideum 29206 97.65

Drosophila 29131 97.4 Drosophila melanogaster

(92%)

Drosophila pseudoobscura (4%)

Drosophila yakuba (1%)

Eimeria tenella 8077 27

Emiliania huxleyi 9523 31.84

Euglena gracilis 14840 49.61

Galdieria sulphuraria 9140 30.56

Glaucocystis nostochi-nearum

10887 36.4

Guillardia theta 10825 36.19

Gymnophrys cometa 4152 13.88

Hartmannella vermi-formis

11981 40

Histiona aroides 6350 21.23

Homo sapiens 29811 99.67

Isochrysis galbana 12531 41.89

Jakoba 13908 46.5 Jakoba bahamensis (39%)

Jakoba libera (41%)

Karlodinium micrum 9119 30.49

Leishmania 23497 78.56 Leishmania amazonensis (2%)

Leishmania tarentolae (2%)

Malawimonas 13391 44.77 Malawimonas californiana

(43%)

Malawimonas jakobiformis (51%)

Mastigamoeba balamuthi 15012 50.19

Mus musculus 29470 98.53

Neurospora crassa 27930 93.38

Oryza sativa 29192 97.6

Oxyrrhis marina 10169 34

Paramecium 24742 82.72 Paramecium caudatum (2%)

Paramecium tetraurelia (79%)

Pavlova lutheri 10738 35.9

Phaeodactylum tricornum 21096 70.53

Physcomitrella patens 18075 60.43

Phytophthora 25938 86.72 Phytophthora infestans (51%)

Phytophthora palmivora (1%) Phytophthora infestans (51%) Phytophthora palmivora (1%) Phytophthora parasitica (5%) Phytophthora ramorum (7%) Phytophthora sojae (72%)

Plasmodium 25454 85.1 Plasmodium berghei (4%)

Plasmodium yoelii (71%) Plasmodium falciparum (81%) Plasmodium chabaudi (2%)

Porphyra 14825 49.56 Porphyra purpurea (1%)

Porphyra yezoensis (62%)

Quinqueloculina sp 6370 21.29

Reclinomonas americana 14886 49.77

Reticulomyxa filosa 9411 31.47

Schizosaccharomyces pombe

27021 90.34

Tetrahymena 26980 90.2 Tetrahymena pyriformis (8%)

Tetrahymena thermophila (92%)

Thalassiosira pseu-donana

27578 92.2

Theileria 21660 72.42 Theileria annulata (62%)

Theileria parva (59%)

Toxoplasma gondii 19843 66.35

Trypanosoma 25063 83.8 Trypanosoma brucei (76%)

Trypanosoma cruzi (66%)

Ustilago maydis 26469 88.5

Supplementary Table S3-ch.4. Percentage of missing data per species and per genes.

Not shown in this manuscript because of its very large size, but is available at:

http://www.plosone.org/article/fetchFirstRepresentation.action?uri=info:doi/10.1371/journal.pone.0000790.s006 or upon request.

Chapter 5: Phylogenomics reveals a new ‘Megagroup’ including most photosynthetic eukaryotes

F. Burki, Shalchian-Tabrizi K & Pawlowski J

Published in: Biology Letters, 44 : 366-369, 2008

5.1 Project description

Following the same general approach as described in chapter 4, we increased the taxon-sampling and added a few genes to our concatenated alignment to improve further the resolution of the tree. To our great satisfaction, this work led to better resolved deep nodes within the tree of eukaryotes.

5.2 Abstract

Advances in molecular phylogeny of eukaryotes have suggested a tree composed of a small number of supergroups. Phylogenomics recently established the relationships between some of these large assemblages, yet the deepest nodes are still unresolved. Here, we inves-tigate early evolution among the major eukaryotic supergroups using the broadest multigene dataset to date (65 species, 135 genes). Our analyses provide strong support for the clustering of plants, chromalveolates, rhizarians, haptophytes and cryptomonads, thus linking nearly all photosynthetic lineages and raising the question of a possible unique ori-gin of plastids. At its deepest level, the tree of eukaryotes receives now strong support for two monophyletic megagroups comprising most of the eukaryotic diversity.

5.3 Introduction

Resolving the global tree of eukaryotes is one of the most important goals in evolution-ary biology. Molecular phylogenies, morphology, and biochemical characteristics have al-lowed the division of the majority of eukaryotic diversity into five or six putative super-groups (reviewed in [Keeling et al. 2005; Lane and Archibald 2008]); these comprise the opisthokonts and Amoebozoa (united as ‘unikonts’, [Cavalier-Smith 2002]), Plantae (or Ar-chaeplastida), Excavata, Chromalveolata, and Rhizaria (often considered as members of the so-called ‘bikonts’, [Stechmann and Cavalier-Smith 2003a]). Recent phylogenomic recon-structions based on large sequence datasets have been used to infer the relationships be-tween some of these large assemblages, notably Rhizaria have been shown to share a com-mon origin with members of the chromalveolates [Burki et al. 2007; Hackett et al. 2007;

Rodriguez-Ezpeleta et al. 2007a]. However, the order of divergence among the deepest nodes remains uncertain, particularly the relationships between plants, chromalveolates, and other photosynthetic lineages (haptophytes, cryptomonads). In order to investigate early evolution among eukaryotic supergroups, we have assembled the broadest dataset to date (65 species, 135 genes representing 31,921 amino acids) and show that the eukaryotes can be divided into two highly supported monophyletic megagroups and a few less diversi-fied lineages related to the excavates.

Figure 1-ch.5. Bayesian unrooted phylogeny of eukaryotes, with a basal trichotomy representing uncertainties in the relationships between the three groups. Tree obtained from the consensus between two independent Markov chains, ran under the CAT model implemented in Phylobayes. The species colour code corresponds to the type of plastid pigments, as following: purple: chlorophyll a; green:

chlorophyll a+b; red: chlorophyll a+c. The stars represent primary, secondary, or tertiary endosym-biosis. Underlined numbers at nodes represent posterior probabilities (PP) of the analysis performed with the constant sites removed / analysis performed with all sites; other numbers represent the result of the ML bootstrap analysis (BS) - Node 1 below the line: ML analysis of the full-length alignment // ML analysis with category 7 removed / ML analysis with catergories 6 + 7 removed. Black dots correspond to 1.0 PP and 100% BS; black squares correspond to 1.0 PP and the specified values of

5.4 Results

We first performed a bayesian analysis on a species-rich dataset, using the powerful CAT model which has been developed to overcome systematic errors due to homoplasy [Lartillot and Philippe 2004; Lartillot et al. 2007] (Figure 1-ch.5). The tree obtained is in agreement with previously published studies; it strongly supports monophyletic groupings of unikonts (Amoebozoa, fungi, and animals), excavates, plants, stramenopiles + alveolates + Rhizaria (SAR), and haptophytes + cryptomonads (HC). This latter group appears as sister to plants, with 1.0 bayesian posterior probabilities (PP) when the constant sites were removed and 0.92 PP with the full length alignment. Remarkably, the plants + HC clade forms a strongly supported monophyletic megagroup with the SAR assemblage (1.0 PP, node 1), revealing an ancient split in the eukaryote evolution and resolving almost entirely the relationships within most ‘bikont’ supergroups.

This new megagroup received relatively low support (73% bootstrap support, BS) in the maximum likelihood (ML) analysis of the complete dataset (Figure 1-ch.5). However, because we are investigating relationships deriving from very ancient splits in the eukary-otic tree, it is likely that multiple substitutions occurred at several sites in our alignment, decreasing the true phylogenetic signal and rendering standard site-homogeneous models based on empirical matrices of amino-acid replacement (such as WAG) less accurate. To test this further, we investigated the effect of the exclusion of the fastest evolving-sites, which are more likely to be saturated and thus be the cause of model violations [Rodriguez-Ezpeleta et al. 2007b]. Not surprisingly, the removal of the noisiest positions led to a drastic increase in the statistical support for the new megagroup (94% and 97% BS when categories 7 and 6 + 7 were removed, respectively - Figure 1-ch.5).

5.5 Discussion

At its deepest level, the tree of eukaryotes presented here displays only three stems, i.e. the two highly supported megagroups, enclosing the vast majority of eukaryotic species, and the excavates. If the monophyly of excavates is further confirmed and a strong support is recovered for their possible sister position to the new megagroup, we may well be able to provide independent evidence (based on phylogenetic reconstructions) for the concept of the two primary clades of eukaryotes - unikonts and bikonts [Stechmann and Cavalier-Smith 2003b; Richards and Cavalier-Cavalier-Smith 2005]. This model, however, would need to be modified as the widely used dihydrofolate reductase-thymidylate synthase (DHFR-TS) gene fusion is questionable for several reasons (see discussion in [Kim et al. 2006]). Of

course this does not rule out the possibility that some protists, such as Telonemia or the centrohelid heliozoans that have not been placed with confidence yet [Shalchian-Tabrizi et al. 2006a; Sakaguchi et al. 2007], might represent additional independent lineages. But gen-erally, we believe that most eukaryotes fall into one of these megagroups.

As we are getting closer towards a fully resolved phylogeny for the eukaryotes, an ob-vious question of crucial importance is the position for the root. We chose, however, to show an unrooted tree as the absence of compelling information leaves the rooting of the eukaryotic tree an open question. Over the past few years independent data proposed a root lying either between unikonts and bikonts [Stechmann and Cavalier-Smith 2003b] or within excavates, e.g., basal to jakobids [Rodriguez-Ezpeleta et al. 2007a] or on the branch leading to diplomonads/parabasalids [Arisue et al. 2005]. In the absence of evidence for rooting the eukaryotes within the plants + HC + SAR megagroup, the plausible rooting scenarios together with our tree consistently suggest that this assemblage is holophyletic.

Our results bring convincing support for the clustering of almost all photosynthetic groups in a unique clade (with the notable exception of the second-hand green plastids in Euglenozoa, belonging to the excavates), and sustain a single primary endosymbiotic event as also suggested by gene-based models of the import machinery [Mcfadden and van Dooren 2004]. The strongest scenario to date for the evolution of primary plastid-containing species is that a unique endosymbiosis involving a cyanobacterium took place in the last common ancestor of Plantae (see [Bhattacharya et al. 2007]). The trees presented here al-low the possibility that the primary plastid was established even before in one of the an-cestors of the new megagroup, and was subsequently lost and independently replaced by plastids of secondary origin in several lineages (HC, Rhizaria, alveolates and stramenopiles), corroborating the hypothesis of an early chloroplast acquisition in eukaryotes based on the phylogeny of the 6-phosphogluconate dehydrogenase gene [Andersson and Roger 2002] (see also [Nozaki 2005] for a more general discussion). We speculate that the high observable diversity of plastids within the new megagroup can be traced back to its last common an-cestor, and is the consequence of an increased capability of all its members to accept and keep plastids or plastid-bearing cells.

5.6 Materials & methods

Our multigene dataset was assembled according to a custom pipeline, as followed: a) construction of databases made of all existing sequences for species specifically selected for

http://www.ncbi.nlm.nih.gov/ and http://amoebidia.bcm.umontreal.ca/pepdb/searches/

welcome.php ; b) BLAST searches against these databases using as queries the single-gene sequences composing our previously described multiple alignments [Burki et al. 2007]; c) retrieving (with a stringent e-value cutoff at 10-50) and addition of the new homologous copies to the existing single-genes alignments; d) automatic alignments using Mafft [Katoh et al. 2002], followed by manual inspection to extract unambiguously aligned positions; e) testing the orthology, in particular possible lateral or endosymbiotic gene transfer, for each of the selected genes by performing single-gene maximum likelihood (ML) reconstructions using Treefinder (WAG, 4 gamma categories; [Jobb et al. 2004]); the final concatenation of all single-gene alignments was done using Scafos [Roure et al. 2007]. Because of the limited data for certain groups and to maximize the number of genes by taxonomic assemblage, some lineages were represented by different closely related species always belonging to the same genus (supplementary material, this chapter). Potential interesting species with full genome available, such as the excavates Giardia and Trichomonas, or the red algae Cya-nidioschyzon, have been discarded from our taxon sampling because of their extreme rate of sequence evolution or their demonstrated tendency to led to systematic errors in phylo-genies [Rodriguez-Ezpeleta et al. 2007b].

The concatenated alignment was analyzed using both bayesian (BI) and maximum likelihood (ML) frameworks, with Phylobayes v.2.3 [Lartillot and Philippe 2004] and RAxML-VI-HPC v.2.2.3 [Stamatakis 2006], respectively. Phylobayes was run using the site-heterogeneous mixture CAT model and two independent Markov chains with a total length of 10,000 cycles, discarding the first 4000 points as burn-in and calculating the pos-terior consensus on the remaining 6000 trees. The convergence between the two chains was checked and always led to the exact same tree, except for uncertainties in the order of divergence between the glaucophytes, the red algae, and HC. In order to reduce mixing problems of the chains, the constant sites were removed from the alignment in a subse-quent analysis. The convergence was in this case much quicker, after only 5000 cycles (burn-in of 1000), and HC was unambiguously positioned as sister to the Plantae. RAxML was used in combination with the WAG amino acid replacement matrix and stationary amino acid frequencies estimated from the dataset. The best ML tree was determined with the PROTMIX implementation, in a multiple inferences using 20 randomized maximum parsimony (MP) starting tree. Statistical support was evaluated with 100 bootstrap repli-cates. Two independent runs were performed on each replicate, using a different starting tree (maximum parsimony - MP, and the best ML tree), in order to prevent the analysis to get trapped in a local maximum. The tree with the best log likelihood was selected for each replicate, and the 100 resulting trees were used to calculate the bootstraps

propor-tions. To save computational burden, the PROTMIX solution was chosen with 25 distinct rate categories. To minimize potential systematic errors associated with saturation and ho-moplasy, the fast-evolving sites were identified using PAML [Yang 1997], given the 20 to-pologies obtained in the ML analysis. Sites were classified according to their mean site-wise rates and ML bootstrap values were computed from shorter concatenated alignments with sites corresponding to categories 7 and 6 + 7 removed.

5.7 Supplementary material

Supplementary Table S1-ch.5. OTU (Operational Taxonomic Unit) names and chimera construc-tions. Table listing the full species names used to build the concatenated alignment, as well as specify-ing the species used to construct chimera. The new species in our alignment, compared to our previ-ously published dataset [Burki et al. 2007], are marked in bold.

Taxon Species in chimera

Acanthamoeba castellanii Glaucocystis nostochinearum Gr a ci la r i a ch a ngi i

Na egler i a Naegleria_gruberi (90%) Phaeodactylum tricornutum Ph y co m y ces bla kes leea nus Schizosaccharomyces pombe So r gh um bi co lo r

Supplementary Table S2-ch.5. Percentage of missing data per species and per genes. Detailed list of the 135 genes used, precising the amount of missing data per species

Not shown in this manuscript because of its very large size, but is available at:

http://dx.doi.org/10.1098/rsbl.2008.0224 or upon request.

Chapter 6: Early evolution of eukaryotes: two enigmatic

heterotrophic groups are related to photosynthetic

chromalveolates

F. Burki, Inagaki Y, Brate J, Archibald JM, Keeling PJ, Cavalier-Smith T, Horak A, Sakaguchi M, Hashimoto T, Klaveness D, Jakobsen KS, Pawlowski J & Shalchian-Tabrizi K

Submitted to: Genome Biology and Evolution

6.1 Project description

We have seen in the previous chapters that phylogenomics is a powerful tool that al-lows one to infer ancient divergences in the eukaryotic tree. It is also a method of choice to reconstruct the phylogenetic position of species that have remained unplaced in the global phylogeny, despite years of effort using single or few genes (see for example {Minge, 2009, p08461} in the annexes section of this manuscript). In this chapter, we describe our recent analyses of two such orphan groups, telonemids and centrohelid heliozoans. In order to obtain sufficient amounts of genomic data to position these groups using our phyloge-nomic dataset, we sequenced by 454 pyrosequencing dedicated cDNA libraries for Te-lonema subtilis and Raphidiophrys contractilis.

Dans le document A phylogenomic contribution to the eukaryotic tree of life (Page 91-112)