• Aucun résultat trouvé

A phylogenomic contribution to the eukaryotic tree of life

N/A
N/A
Protected

Academic year: 2022

Partager "A phylogenomic contribution to the eukaryotic tree of life"

Copied!
211
0
0

Texte intégral

(1)

Thesis

Reference

A phylogenomic contribution to the eukaryotic tree of life

BURKY, Fabien

Abstract

Depuis quelques années, la phylogénie moléculaire, c'est-à-dire l'étude des relations évolutives entre les êtres vivants en comparant des séquences d'ADN ou d'acides aminés, a profondément modifié notre vision de l'arbre des eucaryotes. Le schéma qui est actuellement accepté par le plus grand nombre voit cinq assemblages majeurs d'organismes regroupant toutes les espèces d'eucaryotes : ce sont les ‘super-groupes' (unikonts, excavates, Plantae, chromalveolates et Rhizaria). Si l'existence de ces super-groupes fait figure de consensus, les relations phylogénétiques qui les lient sont encore très incertaines. La phylogénomique est un nouvel outil de la biologie évolutive permettant d'adresser d'importantes questions qui étaient jusque-là restées sans réponse, faute d'information suffisante. Grâce à l'accumulation de données génomiques, il devient en effet possible d'utiliser un nombre toujours plus important de marqueurs moléculaires (gènes ou protéines) pour une diversité croissante d'organismes en vue de reconstruire les relations phylogénétiques à l'échelle des eucaryotes. Les [...]

BURKY, Fabien. A phylogenomic contribution to the eukaryotic tree of life. Thèse de doctorat : Univ. Genève, 2009, no. Sc. 4077

URN : urn:nbn:ch:unige-18450

DOI : 10.13097/archive-ouverte/unige:1845

Available at:

(2)

Département de zoologie et biologie animale

Professeur Jan Pawlowski

A phylogenomic contribution to the eukaryotic tree of life

THÈSE

Présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention biologie

par

Fabien BURKI

de

Genève (Suisse)

Thèse N

o

4077

G ENÈVE

ReproMail: atelier d’impression à Uni Mail

2009

(3)
(4)
(5)
(6)

Je tiens à exprimer ici ma plus profonde reconnaissance à toutes les personnes qui, de prés ou de loin, m’ont apporté leur soutient et leur confiance tout au long de cette aventure.

Au Professeur Jan Pawlowski tout d’abord… Jan, que dire en particulier il y a tellement de pensées qui me viennent à l’esprit. Nous nous sommes rencontré il y a 8 ans, lorsque je suis venu frapper à ta porte pour un travail de diplôme. 8 ans ! Après une période d’incertitudes où je ne savais plus très bien quel chemin suivre, tu m’a à nouveau accordé ta confiance en m’acceptant comme doctorant, il y a 4 ans et demi.

Depuis lors, ce ne fût que bonheur et découverte. Merci pour la liberté que tu m’as ac- cordée. Merci pour ta disponibilité, ton écoute, tes conseils.

To Professor Kamran Shalchian-Tabrizi. Thank you so much for all your help, our countless discussions over skype, and the aquavit. You’re more than just a collea- gue !

Au Professeur Louisette Zaninetti, pour le souvenir merveilleux de mes années de diplôme qui m’a fait revenir frapper à la porte du labo.

A Juan, pour ton dévouement, ta disponibilité et tes explications toujours telle- ment claires qu’en fait… on les comprend, même s’il s’agit de phylogénie…

A Jackie, pour tes connaissances, ta contribution, et… nos discussions. J’espère de tout cœur que nos chemins se croiseront à nouveau.

A José, merci pour ta disponibilité de chaque instant, ton enthousiasme et ton aide.

A Loic… non je ne le dirai pas ici. Mais toi seul sais de quoi je parle. Simplement merci pour ces 2 années, t’es à jamais dans mon cœur… ouais mais bon…

A Béa, même si ce couloir a été comme une barrière infranchissable… Que ton enthousiasme débordant ne s’éteigne jamais.

A Thierry, eh gaillard c’est pas parce que t’es vaudois que t’as pas le droit de faire partie de la liste. T’es vaudois en fait ? C’est déjà où les Brenets ?

Aux diplômants Michael et Cyril, avec qui le labo s’est d’ores et déjà assuré un avenir radieux.

A Ignacio Bolivar, pour ton aide précieuse durant les balbutiements de ce tra- vail.

Aux ex : Cédric, Ben, Xav, Patrick, Fred, Sasha et Yurika. Parce que vous avez bercé une partie de ma thèse de votre douce folie. Vos départs ont laissé un gros vide.

(7)

Asmund Skjaeveland, Marianne Minge, Dag Klaveness, Kjetill Jakobsen from Oslo ; John Archibald from Halifax ; Patrick Keeling, Ales Horak from Vancouver ; Yuji Inagaki, Tetsuo Hashimoto, Miako Sakaguchi from Tsukuba ; Hervé Philippe from Montréal ; Thomas Cavalier-Smith from Oxford ; Colomban de Vargas, Ian Probert from Roscoff ; Sergey Nikolaev from Geneva.

Aux ordinateurs, aspect non-humain mais indispensable à ce genre travail, et à toutes les personnes qui permettent le fonctionnement de ces clusters. Le Vital-IT de l’Institut Suisse de Bioinformatique et le Bioportal de l’Université de Oslo ont permis pratiquement la réalisation de cette thèse en moins de 20 ans. Je tiens particulièrement à remercier Bruno Nyffeler, Jacques Rougemont et Volker Flegel au Vital-IT, ainsi que Asmund Skjaeveland et Pal Enger au Bioportal pour leur aide inestimable dans les in- nombrables debuggings. Je voudrais aussi remercier Lorenza Bordoli, Vassilios Ioannidid and Laurent Falquet pour leur précieux conseils dans les méandres de l’apprentis bioin- formaticien.

To the Canadian consortium Protist EST Program (PEP) that has made publi- cly available several EST projects, allowing us to greatly improve the taxon-sampling of our alignments.

Je suis infiniment reconnaissant au Fond National Suisse de la recherche scienti- fique, à l’état de Genève, au Département de zoologie et biologie animale et à la Fonda- tion Ernst & Lucie Schmidheiny pour leur soutient financier.

I am very grateful to John Archibald, Kjetill Jakobsen and Michel Milinkovitch for accepting to evaluate my PhD.

A ceux que j’oublie… j’espère peu nombreux.

Enfin, un énorme, immense, gigantisme MERCI aux miens : ma mère, mon père, mon frère. C’est surtout grâce à vos inconditionnels encouragements depuis toujours que j’y suis arrivé.

And… Thanks, Grazie, Merci à toi Babs pour… tout. Merci d’être toi, merci d’être là. Merci d’avoir supporté ces moments difficiles, lorsque le stresse devenait trop fort. Mer- ci pour ton soutient, tes encouragements. Merci, merci, merci. L’aventure continue, en- semble.

(8)

Depuis quelques années, la phylogénie moléculaire, c’est-à-dire l’étude des relations évolutives entre les êtres vivants en comparant des séquences d’ADN ou d’acides aminés, a profondément modifié notre vision de l’arbre des eucaryotes. Le schéma qui est actuelle- ment accepté par le plus grand nombre voit cinq assemblages majeurs d’organismes regrou- pant toutes les espèces d’eucaryotes : ce sont les ‘super-groupes’ (unikonts, excavates, Plantae, chromalveolates et Rhizaria). Si l’existence de ces super-groupes fait figure de consensus, les relations phylogénétiques qui les lient sont encore très incertaines. La phylo- génomique est un nouvel outil de la biologie évolutive permettant d’adresser d’importantes questions qui étaient jusque-là restées sans réponse, faute d’information suffisante. Grâce à l’accumulation de données génomiques, il devient en effet possible d’utiliser un nombre toujours plus important de marqueurs moléculaires (gènes ou protéines) pour une diversité croissante d’organismes en vue de reconstruire les relations phylogénétiques à l’échelle des eucaryotes. Les récentes publications d’études relatant les analyses d’immenses alignements multigéniques (plus de 120 gènes concaténés) englobant tous les super-groupes ont montré qu’il était désormais possible de remonter toujours plus loin dans le temps et de résoudre des relations évolutives pour lesquelles seul un très grand nombre de données contient un signal suffisant.

S’inscrivant complètement dans cette nouvelle approche génomique de la phylogé- nie, nous avons en premier lieu obtenu des librairies d’EST (Expressed Sequence Tag) pour 3 espèces appartenant au super-groupe Rhizaria : deux foraminifères, Reticulomyxa filosa et Quinqueloculina sp., et un Cercozoa Gymnophrys cometa (récemment renommé Limnofila borokensis). Un total de plus de 4500 séquences représentant des gènes exprimés au moment de l’extraction d’ARN ont été analysé, ce qui constitue à ce jour le plus vaste jeu de données disponibles pour cet assemblage majeur d’eucaryotes. Apparus à la fin des années 1990, les Rhizaria sont un groupe pour l’instant définis uniquement sur la base de séquences moléculaires. Bien que reconnus comme étant l’un des cinq super-groupes d’eucaryotes, il était jusqu’à notre étude restés en dehors de la discussion alimentée par la phylogénomique à cause de l’insuffisance des données. Notre objectif de départ était simple : utiliser des alignements multigéniques pour 1) confirmer la monophylie des Rhizaria et, surtout, 2) positionner ce super-groupe dans l’arbre des eucaryotes, autrement dit investi- guer les relations phylogénétiques existantes entre les divers super-groupes présumés.

Nous avons donc commencé par obtenir environ 2000 ESTs pour R. filosa (chapi- tre 2) ce qui nous a permis, en utilisant également la seule librairie de Rhizaria publique de l’époque (Bigelowiella natans), de construire un alignement comprenant 85 protéines (environ 13'000 acides aminés) et 37 espèces (dont deux Rhizaria) (chapitre 3). Les Rhi- zaria faisaient ainsi leur entrée dans le domaine de la phylogénomique. Cette étude est ve- nue confirmer, en utilisant pour la première fois des dizaines de gènes, que ce groupe avait en effet un ancêtre commun. Elle n’a par contre pas pu établir de façon convaincante sa position dans l’arbre des eucaryotes, notre alignement ne contenant pas suffisamment de signal phylogénétique.

Nous avons ensuite continué notre échantillonnage pour augmenter nos données, en séquençant plus de 2500 ESTs pour deux nouvelles espèces de Rhizaria (Quinqueloculina

(9)

de données: 123 gènes représentant environ 30'000 acides aminés pour 50 espèces d’eucaryotes équitablement réparties à travers l’arbre (chapitre 4). Cette fois-ci, de façon très intéressante, l’analyse de cet alignement multigénique indique que deux des super- groupes supposés (Rhizaria et chromalveolates) sont en fait intimement liés. Nous propo- sons dans cette étude un nouvel arbre ne contenant plus que quatre assemblages majeurs d’eucaryotes, et regroupons les Rhizaria et des membres des chromalveolates sous la nou- velle appellation ‘SAR’ (Stramenopiles, Alveolates, Rhizaria). La monophylie de ‘SAR’ im- plique que la plus importante biodiversité connue de protistes partage un ancêtre commun.

Cette nouvelle relation a également d'importantes conséquences, notamment sur la com- préhension que l'on a de l'évolution des organismes photosynthétiques à travers les diverses endosymbioses.

Pour estimer l’impact de l’échantillonnage taxonomique sur la résolution de nos ar- bres phylogénétique, nous avons ensuite étendu notre alignement en incluant d’avantage d’espèces (65 espèces) (chapitre 5). L’analyse de cet alignement a permis de remonter encore plus loin dans le temps en révélant la relation phylogénétique entre ‘SAR’ et les plantes (Plantae). Cette relation est intrigante, notamment car elle lie évolutivement en- tres elles toutes les espèces photosynthétiques.

Nous avons également utilisé l’approche phylogénomique pour essayer de résoudre la position phylogénétique d’espèces pour lesquelles les analyses basées sur quelques gènes seulement étaient restées infructueuses (chaptire 6). Précisément nous avons investigué l’origine de deux groupes énigmatiques d’eucaryotes, les télonémides and centrohélides qui sont parmi les derniers sans réelle position au sein de l’arbre des eucaryotes, et suggérons leur appartenance au même clade contenant également les haptophytes et cryptomonades.

Ce groupe correspond en fait à un nouvel assemblage majeur d’eucaryotes qui, de part les espèces qui le compose et sa place évolutive, devient central pour comprendre la réparti- tion des plastes résultants de l’endosymbiose secondaire avec une algue rouge. Nous avons par ailleurs participé à l’étude d’une autre espèce sans position phylogénétique claire, Bre- viata anathema, mais qui est cruciale dans notre compréhension des premières phases de l’évolution des eucaryotes (voir annexes).

Finalement, nous sommes actuellement en train de travailler sur la datation molé- culaire de l’arbre des eucaryotes. Savoir à quelles périodes les principaux groupes d’eucaryotes ont divergé les uns des autres est fondamental pour mieux comprendre ces étapes évolutives majeures. Jusqu’à présent ces études ont souffert d’une part du manque de données moléculaires pour suffisamment d’espèces et d’autre part du manque de points de calibration fossiles précis pour donner un cadre à l’inférence temporelle. Nous proposons ici de combler ces deux problèmes en séquençant massivement trois espèces possédant un bon bilan microfossile et en utilisant ces nouvelles calibrations dans une datation moléculai- re basée sur notre alignement multigénique. Les motivations de ce projet sont exposées dans le chapitre 7.

(10)

Here is listed, in alphabetic order, the affiliation of all authors who participated (or are par- ticipating) in this work, outside the Department of zoology and animal biology at the Uni- versity of Geneva:

John M. Archibald: Dalhousie University, Department of Biochemistry and Molecular Biology, Halifax, Nova Scotia, B3H 1X5, Canada

Jon Brate: University of Oslo, Department of Biology, N-0316 Oslo, Norway

Thomas Cavalier-Smith: University of Oxford, Department of Zoology, South Parks Road, OX1 3PS, UK

Colomban de Vargas: Station Biologique, Equipe Evolution du Plancton et PaléoOcéans, 29682 Roscoff, France

Tetsuo Hashimoto: University of Tsukuba, Center for Computational Sciences, Institute for Bio- logical Sciences, Ibaraki 305-8577, Tsukuba, Japan

Ales Horak: University of British Columbia, Botany Depatrment, Vancouver, BC, V6S 1T4, Can- ada

Yuji Inagaki: University of Tsukuba, Center for Computational Sciences, Institute for Biological Sciences, Ibaraki 305-8577, Tsukuba, Japan

Kjetill S. Jakobsen: University of Oslo, Department of Biology, N-0316 Oslo, Norway

Patrick J. Keeling: University of British Columbia, Botany Depatrment, Vancouver, BC, V6S 1T4, Canada

Dag Klaveness: University of Oslo, Department of Biology, N-0316 Oslo, Norway Marianne A. Minge: University of Oslo, Department of Biology, N-0316 Oslo, Norway

Sergey I. Nikolaev: University of Geneva, Department of Genetic Medicine and Development, 1 rue Michel-Servet, 1211 Geneva, Switzerland

Hervé Philippe: Université de Montréal, Centre Robert Cedergren, Département de Biochimie, Montréal, Québec H3T1J4, Canada

Ian Probert: Station Biologique, Equipe Evolution du Plancton et PaléoOcéans, 29682 Roscoff, France

Miako Sakaguchi: University of Tsukuba, Center for Computational Sciences, Institute for Biologi- cal Sciences, Ibaraki 305-8577, Tsukuba, Japan

Kamran Shalchian-Tabrizi: University of Oslo, Department of Biology, N-0316 Oslo, Norway Asmund Skjaeveland: University of Oslo, Department of Biology, N-0316 Oslo, Norway

(11)
(12)

F o r e w o r d . . . 1

A b s t r a c t . . . 3

C h a p t e r 1 :

I n t r o d u c t i o n . . . 7

1.1 Motivation ... 9

1.2 The tree of eukaryotes ... 10

1.2.1 World of Kingdoms... 10

1.2.2 Molecular r-evolution: the SSU rRNA ... 10

1.2.3 Time for deconstruction ... 14

1.2.4 Groundwork for reconstructing ... 15

1.2.5 Where is the root?... 19

1.2.6 Plastid evolution... 20

1.3 Phylogenomics ... 21

1.3.1 A definition... 21

1.3.2 How does phylogenomics work? ... 22

1.3.3 Stochastic and systematic errors... 25

1.3.4 So: can phylogenomics answer important questions?... 26

1.3.5 The case of EGT in the context of phylogenomics... 28

C h a p t e r 2 :

A n a l y s i s o f e x p r e s s e d s e q u e n c e T a g s f r o m a n a k e d f o r a m i n i f e r a n R e t i c u l o m y x a f i l o s a . . . 3 1

2.1 Project description ...32

2.2 Abstract ... 33

2.3 Introduction ... 33

2.4 Result & discussion ...34

2.4.1 Sequencing & clustering... 34

2.4.2 Comparisons with databases... 36

2.4.3 Functional annotation ... 38

2.5 Materials & methods...39

2.5.1 Cells and culture conditions ... 39

2.5.2 cDNA construction and ESTs sequencing... 39

2.5.3 Sequence processing and analysis... 40

(13)

m u l t i g e n e p h y l o g e n y o f u n i c e l l u l a r

B i k o n t s . . . 4 1

3.1 Project description ... 42

3.2 Abstract... 43

3.3 Introduction... 43

3.4 Results ... 45

3.4.1 Sequences and alignments... 45

3.4.2 Phylogenetic position of Rhizaria ... 46

3.5 Discussion ... 50

3.6 Materials & methods ... 54

3.6.1 Construction of the alignment... 54

3.6.2 Phylogenetic analyses... 55

3.6.3 Testing phylogenies ... 56

3.7 Supplementary material... 57

C h a p t e r 4 :

P h y l o g e n o m i c s r e s h u f f l e s t h e e u k a r y o t i c s u p e r g r o u p s . . . 6 5

4.1 Project Description ... 66

4.2 Abstract... 67

4.3 Introduction... 67

4.4 Results ... 69

4.4.1 Single-gene analyses and concatenation ... 69

4.4.2 Phylogenetic position of Rhizaria ... 71

4.5 Discussion ... 72

4.6 Materials & methods ... 74

4.6.1 Sampling, culture and constructions of cDNA libraries ... 74

4.6.2 Construction of the alignments ... 74

4.6.3 Phylogenomic analyzes... 75

4.6.4 Tree topology tests ... 76

4.7 Supplementary material... 77

C h a p t e r 5 :

P h y l o g e n o m i c s r e v e a l s a n e w ‘ M e g a g r o u p ’ i n c l u d i n g m o s t p h o t o s y n t h e t i c e u k a r y o t e s . . . 8 5

5.1 Project description ... 86

5.2 Abstract... 87

5.3 Introduction... 87

5.4 Results ... 89

5.5 Discussion ... 89

(14)

r e l a t e d t o p h o t o s y n t h e t i c

c h r o m a l v e o l a t e s . . . 9 5

6.1 Project description ...96

6.2 Summary ... 97

6.3 Results and Discussion... 97

6.3.1 Evolutionary origin of Telonemia and Centroheliozoa... 97

6.3.2 An emerging group of great diversity, and the expansion of chromalveolates... ... 102

6.3.3 Implications for plastid evolution ... 103

6.4 Experimental procedures ... 104

6.4.1 Cultures... 104

6.4.2 cDNA library construction and 454 pyrosequencing... 104

6.4.3 Contig assembly and sequence alignment... 104

6.4.4 Phylogenetic analyses ... 105

6.5 Supplementary material ... 107

6.5.1 Analyses after removing both, or one of T. subtilis or R. contractilis... 107

6.5.2 Topology comparisons based on the supermatrices... 108

6.5.3 Supplementary Table and Figures... 109

C h a p t e r 7 :

G e n e r a l c o n c l u s i o n s a n d p e r s p e c t i v e s . . . 1 1 5

7.1 Achievements ... 117

7.2 Origin and spread of chlorophyll-c containing plastids, and the early photosynthetic eukaryotes... 118

7.3 A molecular time-scale for eukaryote evolution: combining phylogenomics and the continuous microfossil record... 121

7.4 Other perspectives ...124

C h a p t e r 8 :

L i t e r a t u r e c i t e d . . . 1 2 7

C h a p t e r 9 :

A n n e x e s . . . 1 4 7

9.1 Other projects in which I have been involved during my PhD... 147

9.2 Journal-formatted copy of the published chapters... 147

9.3 Articles related to our work ... 148

(15)
(16)

Foreword

This manuscript describes the research I started in September 2004 as a PhD stu- dent in molecular phylogenetics of eukaryotes. At that time, the evolutionary tree of eu- karyotes was in a period of reconstruction after that the weaknesses of rDNA-based phylo- genies were demonstrated. Four and a half years later much progress has been made, as we shall see, and a new picture for the evolutionary history of eukaryotes is emerging.

The following chapters are arranged in a chronological manner, and I have tried to account for the gradual changes permitted by the successive release of new data. Of course the very last, most complete trees that we obtained only a couple of weeks ago could al- most say it all, summarizing by themselves the main results of this thesis. But they would not accurately detail the journey I wish to relate in this manuscript, a journey through the recent modifications of the eukaryotic tree.

Chapters 2, 3, 4 and 5 correspond to the work we published during the course of this PhD. They are preceded by a general introduction and followed by a yet unpublished chapter that explains our most recent results. Finally I conclude the main part of this manuscript by some important comments that have been raised by our research, and pre- sent the motivations for an ongoing project that address the timing of eukaryote evolution.

The central chapters all start with a brief ‘project description’ that was added here for the sake of clarity: its purpose is to situate the reader in the context of the time the study was performed.

Because this manuscript is largely a collection of papers, I would like to warn the reader: it inevitably contains redundancy, particularly in the different introduction and discussion sections.

To obtain a unity in the format of the different chapters, I chose to present the manuscript version of the published articles. If the reader prefers the journal-formatted version, the original papers can be found in the annexes.

(17)
(18)

Abstract

Resolving the global eukaryotic tree of life remains one of the most important and chal- lenging tasks facing biologists. A phylogeny supporting the evolutionary relationships among all eukaryotic lineages would provide a fundamental framework for broad compara- tive genomics, as more and more completed genomes for an always broader taxonomic sampling are being released. For example, an important question directly related to the structure of the tree of eukaryotes concerns the origin and spread of photosynthesis. Indeed the process of endosymbiosis, that eventually led to plastids, has been responsible for some of the most significant events in evolution but to fully tackle this question, a robust tree is the first requirement as it provides the support on which different hypotheses can be tested.

In the last decade, molecular phylogenetic trees have gradually assigned most of eu- karyotes to one of five or six putative very large assemblages, the so-called "supergroups".

These comprise the ‘Opisthokonts’ and ‘Amoeboza’ (often united in the ‘Unikonts’), ‘Ar- chaeplastida’ or ‘Plantae’, ‘Excavata’, ‘Chromalveolata’, and ‘Rhizaria’. The strength of the evidence supporting these supergroups has been subject to much debate and, importantly, the relationships between them are yet to be confirmed. These uncertainties are largely due to the limited amounts of data available until recently for most parts of eukaryotic diversity. In particular, only a small fraction of the unicellular eukaryotes has been subject to molecular studies, leading to important imbalances in phylogenies and preventing re- searchers to reliably infer deep evolutionary relationships.

However, several sequencing efforts within the last couple of years have permitted a radical change in the inference of phylogenetic trees at high taxonomic levels (i.e., among the supergroups). Reconstructing the evolutionary history among the eukaryotic groups is no longer seen as a job for a few genes (with all the uncertainties related to this lack of data) or many genes but poor taxon samplings. Instead, huge phylogenomic datasets, in- volving the analysis of more than 100 genes, can now be used in order to reconstruct the evolutionary steps that led to the current diversity of eukaryotes.

(19)

and started based on a simple observation: phylogenomic studies at the eukaryotic level were all lacking Rhizaria, which was an important problem when one considers that Rhi- zaria represent one fifth of the recognized supergroups and include a large diversity of very different taxonomic groups of protists. The reason for this was that genomic data were es- sentially absent from databases. Thus we generated Expressed Sequence Tags (ESTs) for several rhizarian species. We first obtained ESTs for a species belonging to the important group of foraminifers (chapter 2). We then reported a large-scale analysis of eukaryote phylogeny including data for 2 rhizarian species, meaning that for the first time representa- tives of every supergroup were analyzed together using a phylogenomic approach (chapter 3). Our results, based on a dataset of 37 species and 85 genes, confirmed the putative mo- nophyly of Rhizaria. This was of interest as this supergroup is still defined only by molecu- lar characters. At the same time this project shed light on the great difficulties one would face when trying to infer the evolutionary relationships between the major groups of uni- cellular bikonts, even when more than 10'000 amino acid sites are involved. Overall our trees were poorly resolved within the bikont part and we concluded that what was needed was longer alignments and, perhaps more importantly, a better taxonomic sampling.

Following our previous conclusions, we attempted to build a much larger phylogenomic dataset in order to investigate the phylogenetic relationships between all the eukaryotic supergroups (chapter 4). This was carried out by generating new genomic data and survey- ing public databases to construct both a longer and taxonomically broader alignment. Our new dataset contained 49 species and 123 genes. Very interestingly, this matrix contained enough phylogenetic signal to confidently resolve several ancient divergence points in the evolutionary history of eukaryotes. Of particular significance, it supported a very robust relationship between Rhizaria and two main clades of the supergroup chromalveolates (stramenopiles and alveolates): the ‘SAR’ grouping. We showed the existence of consistent affinities between assemblages that were thought to belong to different supergroups, thus not sharing a recent common ancestor. These new relationships had important conse- quences for our understanding of the evolutionary history of eukaryotes. Notably, Rhizaria became a new player that cannot be ignored when addressing questions related to the pu- tative single red algal origin of the chlorophyll-c containing plastids among the chromalveo- lates.

To test further our alignment in order to investigate even earlier evolution among eu- karyotes, we significantly updated our matrix with several publicly available species to

(20)

the latest phylogenetic methods, allowed us to obtain a tree in which, at its deepest level, only three stems were displayed, i.e. two highly supported megagroups, enclosing the vast majority of eukaryotic species, and the excavates that were of uncertain position. Our re- sults brought convincing support for the clustering of almost all photosynthetic groups in a unique mega-clade. We speculated that the observable diversity of plastids within the new megagroup could be traced back to its last common ancestor, and is the consequence of an increased capability of all its members to accept and keep plastids or plastid-bearing cells.

Phylogenomics is helpful to infer ancient relationships between the eukaryotic super- groups, it can also be used to address the evolutionary origin of lineages that have proven challenging thus far (referred to ‘orphan’ lineages). We undertook massively parallel 454 sequencing of two such groups of uncertain phylogenetic position, Telonemia and Centro- heliozoa (chapter 6). These groups were of great interest because they both include only heterotrophic organisms, yet based on weak hints they have been suggested to be related to photosynthetic members of the chromalveolates. Our analyses of 72 species and 127 genes brought the first reliable support for the placement of telonemids and centrohelids into an expanded chomalveolate megagrouping, also containing Rhizaria, most likely closely related to haptophytes and cryptomonads. Thus, these two lineages are from now on of key importance in further investigations to understand the distribution of red algal-derived plastids. We also participated in a phylogenomic study of another orphan group, the brevi- ate amoebae, that is of crucial significance for our understanding of early transitions in eu- karyote evolution (see annexes).

Finally, we conclude this manuscript with general comments on our work, and give some possible futures directions (chapter 7). In particular, we present our motivations for an ongoing project for inferring a molecular time-scale for major evolutionary events of eu- karyote evolution. In order to reduce artifacts in molecular dating due to small amounts of data or lack of reliable calibrated nodes, we massively sequenced one diatom and 2 cocco- lithophorids which will allow us to combine phylogenomics and micropaleontology (by means of the well-documented continuous microfossil record).

(21)
(22)

Chapter 1: Introduction

(23)
(24)

1.1 Motivation

Is it an important achievement to resolve the evolutionary tree of life? The answer is yes, without any possible doubt. Beyond the pure interest that drives the research of most scientists –the world, us humans, would be very different without our comprehension of fundamental and not necessarily critical topics– having a fully resolved phylogenetic tree including all organisms is the needed framework for studies aimed at understanding the acquisition and evolution of countless characters. A tree is the reference for selecting key species that have the potential to answer important evolutionary questions using, for ex- ample, comparative genomics. Evolutionary studies help to place comparisons in perspec- tive so that one can understand how, when, and sometimes why some similarities and dif- ferences in genomes arose [Eisen and Fraser 2003]. Once strongly biased to organisms rele- vant for human well-being (of economical or medical importance), the taxonomic distribu- tion of species for which extensive genomic data sets are available has now increased dra- matically, and will continue to expand thanks to new sequencing technologies and lower costs. We have entered a very exiting period where we may begin to revisit questions of general interest, such as the origin of multicellularity or photosynthesis, and find precise answers by digging into the mass of data that continues to accumulate. The tree of life has not been immune to change with these new data either: it has undergone a profound re- shuffling revealing a number of relationships between major lineages that were previously unknown and formerly inaccessible.

Being in line with these new possibilities, the aim of our project was to explore the po- tential of phylogenomics –i.e., the use of large data sets in molecular phylogenetics– in re- solving ancient evolutionary relationships within the eukaryotic tree of life. Our initial fo- cus was on one eukaryotic supergroup in particular, Rhizaria, as it was mainly ignored in phylogenomic studies (chapter 2, 3 and 4). We later got interested in other important questions involving deep evolution of eukaryotes such as the phylogenetic position of the plants and the other photosynthetic species (chapter 5), the placement of “orphan” species (chapter 6) or the molecular dating of the tree of eukaryotes (chapter 7). In this manu- script, I will discuss the recent topological changes of the tree in the light of our results.

But before I shall start with several general considerations that are necessary to place this work in context.

(25)

1.2 The tree of eukaryotes 1.2.1 World of Kingdoms

Inferring the evolutionary relationships among living beings is not a recent matter, and scientists have not waited for the advent of the molecular or the “all genomic” eras to pro- pose trees depicting how species are related to one another. Among all the abundant work foregoing the democratization of molecular phylogenies, one can cite three landmark studies mainly based on morphology and nutrition modes that proposed trees that have entered general biology textbooks for decades: the 3 kingdoms of Haeckel in 1866, Figure 1-ch.1 (Plantae, Protista and Animalia) [Haeckel 1866], the 4 kingdoms of Copeland in 1938, Fig- ure 2-ch.1 (Monera, Protista, Plantae, and Animalia) [Copeland 1938], and the 5 kingdoms of Whittaker in 1969, Figure 3-ch.1 (Monera, Protista, Plantae, Fungi, and Animalia) [Whittaker 1969]. Ultimately these systems were quite similar, the authors essentially modified the boundaries delimitating the kingdoms to go from three to five as the knowl- edge accumulated. They all represented schemes in which the evolutionary processes led to several transitions from a basal pool of apparently simple, largely unicellular organisms – the protists–, to more elaborate multicellular organisms. Although these proposals suc- ceeded in roughly recognizing several major assemblages (e.g., Fungi, Animals), they nota- bly failed to resolve their relationships and, importantly, account for the fundamental paraphyletic and complex nature of the protist lines. Despite the fact that they were very important pieces of work and are still extremely useful when trying to embrace the huge diversity of life, certainly because they represent a “natural”, very pedagogical, way of classifying the organisms (the Whittaker's system, for example, is still taught in most high school biology lessons), it is not my intention here to explore further their strengths and weaknesses.

1.2.2 Molecular r-evolution: the SSU rRNA

Instead I would like to consider in more detail another kind of phylogeny that has re- placed the above classical phenotype-based approach: the phylogenies inferred by compar- ing sequences of DNA or amino acids, i.e. molecular systematics, which have revolution- ized our understanding of evolutionary relationships. In 1965, Zuckerkandl and Pauling [Zuckerkandl and Pauling 1965] argued that collating of informative molecules would per- mit the evaluation of evolutionary relatedness. They were obviously right and since that time molecular phylogeny has been regarded as the tool of choice for reconstructing evolu- tionary histories –this is particularly true for protists where the interpretation of morpho-

(26)

Figure 1-ch.1. Haeckel’s three kingdoms.

From [Haeckel 1866].

Figure 2-ch.1. Copeland’s four kingdoms.

From [Copeland 1938].

Figure 3-ch.1. Whittaker’s five kingdoms.

From [Whittaker 1969].

(27)

The first molecular works aimed at determining the evolutionary relationships among eukaryotes date back to the mid eighties and principally depended on the small subunit ribosomal RNA (SSU rRNA) [Sogin et al. 1986; Friedman et al. 1987; Sogin et al. 1989;

Woese et al. 1990; Sogin 1991], although the large subunit (LSU rRNA), to a lesser extent, also contributed to phylogenetic reconstruction (e.g., [Perasso et al. 1989]). These pioneer studies were all characterized by a handful of deeply diverging protist lineages (e.g., Giardia, Trichomonas, Microsporidia), progressively emerging from the distant prokaryotic root, and followed by a densely branched “crown” nesting most eukaryotic diversity (Fig- ure 4-ch.1). From these early molecular analyses, evolutionists drew the following principal features:

1) As a result of the huge genetic diversity in SSU rRNA, the deep eukaryotic branches seemed to exceed the depth of branching within the entire prokaryotic world [Sogin et al. 1986].

2) Consequently, eukaryotes became distinct very early in the history of life, and were thought to be likely as old as the eubacteria and archaebacteria [Sogin et al. 1989].

3) The lowermost lineages of the eukaryotic tree were usually simple, most of which liv- ing parasitically within animal hosts and, importantly, lacking organelles, in particular the mitochondrion [Friedman et al. 1987]. This notion of primitive amitochondrial eu- karyotes (the “Archezoa” hypothesis [Cavalier-Smith 1989]) was in fact being discussed prior to the publication of molecular phylogenies that wrongly supported its validity. It postulated that mitochondria-lacking eukaryotes had diverged before the acquisition of mitochondria through endosymbiosis and had evolved under anaerobic conditions. So when the first SSU rRNA trees including such organisms came out, specifically showing a deeper branching than any previously known eukaryotic sequences [Friedman et al.

1987; Sogin 1989; Sogin et al. 1989], the general consensus converged towards the postu- late that these amitochondrial lineages were genuinely primitive, relicts of an ancient world devoid of oxygen (Figure 4-ch.1).

4) The apical part of the SSU rRNA tree, the so-called crown, contained major clades that branched near a common point, as if their divergence occurred nearly simultane- ously [Sogin 1991]. Here were included, among others, the animals, fungi, plants, and diverse protist lineages that now form the alveolate grouping. Because the branching pattern among these groups could not be resolved, it was suggested that they origi- nated in a massive radiation [Knoll 1992]. This lack of a clear order of divergence

(28)

eira et al. 2000], leading some to propose the “big-bang” hypothesis [Philippe et al.

2000a] which postulated that most eukaryotic phyla emerged in a relative short period of time, thus not enough phylogenetic signal could accumulate in the sequences.

Figure 4-ch.1. A typical SSU rRNA tree of eukaryotes, as it was being published in the mid-nineties.

Plastid-bearing lineages are indicated in colors approximating their respective pigmentation. From [Embley and Martin 2006].

In the nineties, as more and more species were being sequenced, intermediate groups appeared in between the Archezoa members and the eukaryotic crown. Similarly to the amitochondrial species, these newcomers were characterized by a high rate of evolution producing long branches in phylogenetic reconstructions. A classical example are the Fo- raminifera whose both SSU and LSU rRNA showed a mid-position in the tree [Pawlowski et al. 1994; Pawlowski et al. 1996].

Interestingly enough, this view of the eukaryotic tree relied almost entirely on a single molecular marker (the SSU rRNA, although as mentioned above a few others started to be used), and the gene trees were very much interpreted as the organismal tree. Unfortu- nately this marker proved to be highly mutationally saturated at the eukaryotic level, with very variable evolutionary rates between species [Philippe and Laurent 1998]. Because this characteristic was shown to be prone to the Long Branch Attraction (LBA) artifact, in which two distant species with fast evolving sequences are erroneously clustered together [Felsenstein 1978], the SSU rRNA topology became highly suspicious [Embley and Hirt

(29)

1998; Philippe and Laurent 1998]. Furthermore, two other important requirements for accu- rate phylogenetic inferences were not respected: the availability of a well sampled diversity of species and the use of appropriate tree reconstruction methods [Hendy and Penny 1989;

Lecointre et al. 1993; Huelsenbeck 1997; Brinkmann et al. 2005]. Indeed the taxon sam- pling at the beginning of molecular systematics was rather sparse, with often a single repre- sentative per major lineage, which increased the sensitivity of LBA by leaving unbroken the basal long branches. Likewise, simplistic approaches for inferring phylogenies (distances computed or parsimony) together with the use of unrealistically simplified models of evolu- tion were a serious brake for the resolution of the eukaryotic tree.

So the situation in the late nineties was a tree of eukaryotes very much based on a single molecular marker, with recognized shortcomings in its capability for being able to infer phylogenetic relationships at deep taxonomic levels.

1.2.3 Time for deconstruction

Important discrepancies between the SSU rRNA trees and a growing number of pro- tein-coding gene phylogenies started to bring alternative hypotheses for evolutionary rela- tionships within eukaryotes. Besides the diversification of molecular markers, much better methods (probabilistic) and models of evolution as well as broader taxonomic samplings became available. These new topologies, although often incongruent between them [Philippe et al. 2000b], essentially consisted in moving to (or close to) the crown species that were diverging much earlier in SSU rRNA trees. One can mention the studies of Mi- crosporidia (- and -tubulins, [Keeling and Doolittle 1996] or TBP, [Fast et al. 1999]), the slime molds (elongation factor-1, [Baldauf and Doolittle 1997]), or an important work in- vestigating eukaryotes as a whole by combining four proteins (- and -tubulins, actin, and elongation factor-1) [Baldauf et al. 2000]. At the same time relationships between some major groups were recovered with reasonable statistical support and, in some in- stances, also combining data gained from rare genomic changes such as the inser- tion/deletion character (indel). For example the specific associations between animals and fungi [Baldauf and Palmer 1993], green plants and red algae [Moreira et al. 2000] or a su- pertaxon grouping the alveolates and stramenopiles [Baldauf et al. 2000] began to appear.

Strikingly, in a very similar way that earlier phylogenetic trees supported the Archezoa hypothesis, genes derived from the mitochondrial symbiont were progressively discovered in species that apparently lacked mitochondria as these lineages were relocating within the eukaryotic crown (thus leaving the “primitive” part of the tree) [Clark and Roger 1995;

(30)

mitosomes [Tovar et al. 1999], that are most likely remnants of mitochondria. Altogether, the most parsimonious explanation is that mitochondria were ancestrally present in all eu- karyotes, but have been secondarily lost repeatedly or degenerated into small organelles in some lineages, hence totally invalidating the Archezoa hypothesis [Keeling 1998].

An important consequence of this relocation of the no longer primitive eukaryotes within the crown was that the transition between the prokaryotic outgroup and the extant species was in fact not a progressive transformation involving intermediate forms. The con- cept of the eukaryotic crown itself had become obsolete as any living eukaryotes actually belong to it, which implied a great reduction of the evolutionary distances between the former “basal” and “apical” lineages. Furthermore it invalidated the suggestion that eu- karyotes were extremely ancient and reduced their evolutionary diversity below that of prokaryotes.

1.2.4 Groundwork for reconstructing

These profound modifications of the structure of the eukaryotic tree led to the concept that most, if not all, diversity can be assigned to one of several major assemblages: the

“supergroups” (Figure 5-ch.1). Reassembling the evolutionary history of eukaryotes was obviously not the result of a single study, but very much a matter of uniting several types of data into one comprehensive picture. Despite what has been mentioned above, single- gene trees continue to be a valuable source of information when they are combined with an appropriate knowledge of potential artifacts, because they are generally built with taxon-rich alignments. Thus, by correctly interpreting several individual trees one might be able to discern general tendencies in phylogenies. However, as more and more genomic data accumulated, it became possible to assemble larger datasets that contain in principle more phylogenetic signal, so great possibilities were given to address further evolutionary questions. Finally, discrete molecular characters such as indels, gene fusions, or gene order have also been useful in reevaluating the eukaryotic tree as they are independent of phylo- genetic reconstruction, although much caution is here as well required because these mark- ers are not free of misleading errors [Bapteste and Philippe 2002].

When this working hypothesis was first summarized in a paper, eight supergroups were recognized [Baldauf 2003]. This review was notably relevant because it accounted for the

“true diversity of life”, i.e. the discovery of non-cultured minute organisms, nano- or pico- in size, that were scattered across the tree. Soon after and regularly since then, reviews are being published updating the tree of eukaryotes with the lastest minor modifications, es- sentially representing the same scheme for the eukaryote evolution [Simpson and Roger 2004; Adl et al. 2005; Keeling et al. 2005; Lane and Archibald 2008]. These trees, unrooted,

(31)

all display a basal polytomy with five or six branches representing the supergroups that emerge from a common point, the order of divergence among these groups being very much uncertain (Figure 5-ch.1). Importantly, the supergroups hypothesis represents a con- sensus for the tree of eukaryotes, the most accurate we have so far, but by no means an unshakable scheme. The existence (that is, the monophyletic origin) for most of the super- groups is still highly arguable. Generally, parts of these hypothesized major assemblages have been reasonably shown to have a common origin, but we currently lack evidence for the supergroups as a whole, including all postulated lineages (this is less true for the opisthokonts, commonly robustly supported, see [Parfrey et al. 2006] for a broad discus- sion).

Figure 5-ch.1. One of the numerous schemes for the current view of the eukaryotic evolution, repre- senting the six hypothesized supergroups of eukaryotes. From [Lane and Archibald 2008].

(32)

Below I briefly introduce these six supergroups:

O p i s t h o k o n t s: This supergroup contains animals and fungi [Cavalier-Smith and Chao 1995], which are thought to have evolved independently from unicellular lineages belonging to the paraphyletic assemblage Choanozoa [Lang et al. 2002; Cavalier-Smith and Chao 2003c; Steenkamp et al. 2006; Ruiz-Trillo et al. 2008; Shalchian-Tabrizi et al.

2008], also included in it. It is putatively united by the presence of a single posterior flagellum in several representatives [Cavalier-Smith and Chao 1995], as well as much molecular-based evidence (e.g., single-genes [Baldauf and Palmer 1993; Wainright et al.

1993], 4 genes [Baldauf et al. 2000], 143 genes [Rodriguez-Ezpeleta et al. 2005], amino acid insertion/deletion [Baldauf and Palmer 1993]). It is currently the most reliable su- pergroup, but some continue to argue for a close evolutionary relationship between animals and green plants instead [Stiller 2007].

A m o e b o z o a: This supergroup includes mostly amoeboid protists (that is cells with pseudopodia) such as the classical amoeba with lobose pseudopodia but also slime moulds and some amitochondrial lineages. Evidence that it is a monophyletic group, not very strong at the moment, has emerged only recently and is based on single and multigene phylogenies [Baldauf et al. 2000; Bapteste et al. 2002; Fahrni et al. 2003;

Smirnov et al. 2005], as well as a gene fusion in mitochondrial genome of the two spe- cies that were investigated [Lonergan and Gray 1996].

Opisthokonts and Amoebozoa are often united in a larger supergroup, U n i k o n t s [Cavalier-Smith 2002], that is supported by several rare genomic changes (see section 1.2.5) [Stechmann and Cavalier-Smith 2002; Stechmann and Cavalier-Smith 2003b; Richards and Cavalier-Smith 2005] as well as several single (e.g., [Baldauf and Palmer 1993]) and a grow- ing number of multigene phylogenies (e.g., [Rodriguez-Ezpeleta et al. 2007a]).

P l a n t a e (or A r c h e p l a s t i d a ): This supergroup is comprised of the three main lineages of primary photosynthetic organisms, thus corresponding to the group where plastids with two membranes first evolved by primary endosymbiosis with a cyanobac- teria are found: glaucophytes, green plants, and red algae. Its monophyly has been gen- erally accepted because of the parsimonious explanation for a single origin of primary plastid [Palmer 2003; Keeling 2004; Mcfadden and van Dooren 2004; Reyes-Prieto et al.

2007; Archibald 2009] (although see [Prechtl et al. 2004; Nowack et al. 2008] for exam- ples of more recent independent primary endosymbioses), but other views, in particular an earlier divergence of the red algae, are still strongly debated [Nozaki et al. 2003; No- zaki 2005; Nozaki et al. 2007; Stiller 2007; Maruyama et al. 2008]. The use of genomic data has recently recovered strong support for a monophyletic assemblage in several

(33)

studies, both based on chloroplast [Martin et al. 2002; Chu et al. 2004; Hagopian et al.

2004; Rodriguez-Ezpeleta et al. 2005] and nuclear genes [Rodriguez-Ezpeleta et al. 2005;

Rodriguez-Ezpeleta et al. 2007a].

C h r o m a l v e o l a t a: This supergroup is doubtlessly the most debated, because of the lack of clear and simple evidence supporting it and its central role in the under- standing of eukaryote evolution. It encompasses at present four diverse groups, mixing phototrophy and heterotrophy: stramenopiles (heterokonts), cryptomonads, haptophytes (altogether the chromists [Cavalier-Smith 1998a]), and alveolates. This grouping results from the proposition that the number of plastids originated by secondary endosymbiosis (i.e. involving two eukaryotes) should be limited in evolution because of the real com- plexity in establishing a protein targeting system in a nascent plastid [Cavalier-Smith 1999]. Specifically, the chromalveolate hypothesis postulates that a single secondary en- dosymbiosis with a red alga took place in the ancestor of all chromalveolates, giving rise to an orthologous plastid in all its descendants. The consequence of this is that the host lineages must be related, a condition that is generally not respected as a whole even with big alignments (haptophytes and cryptomonads often branch elsewhere in the tree, or are not supported as sister to the rest of the chromalveolates) [Harper et al.

2005; Patron et al. 2007], but see [Hackett et al. 2007]. On the other hand, plastid data often recover a common origin for the photosynthetic members of this supergroup [Yoon et al. 2002; Khan et al. 2007], but this does not rule out the possibility that the red plastids were acquired independently via serial endosymbioses to reach the current situation. In favor of the hypothesis are also two specific gene duplications that unde- niably cluster the plastid-targeted proteins of the chromalveolates [Harper and Keeling 2003; Patron et al. 2004], but the relationships of the cytosolic version are much more ambiguous.

R h i z a r i a: This supergroup is the most recently recognized assemblage and is pres- ently only defined based on molecular data, commonly including organisms bearing

“root-like reticulose or filose pseudopodia” [Cavalier-Smith 2002; Cavalier-Smith 2003].

In addition to typically amoeboid taxa, Rhizaria also include a large diversity of free- living flagellates, amoeboflagellates, and parasitic protists. The first presage for this grouping was a clade formed by the euglyphid testate amoebae and the photosynthetic chlorarachniophytes in SSU rRNA phylogeny [Bhattacharya et al. 1995]. This clade was later enlarged to also include zooflagellate species and the plasmodiophorid plant parasites [Cavalier-Smith and Chao 1996-1997], leading to the creation of the phylum

(34)

pected result was later confirmed by the discovery of a one or two amino acids inser- tion in the polyubiquitin polymers [Archibald et al. 2003a; Bass et al. 2005], and analy- ses of the large subunit of RNA polymerase gene [Longet et al. 2003] and SSU rDNA [Berney and Pawlowski 2003]. The taxonomic composition of Cercozoa was progres- sively expanded by including various zooflagellates [Atkins et al. 2000; Kuhn et al.

2000], gromiids [Burki et al. 2002], testate amoebae [Wylezich et al. 2002], filose and re- ticulate protists [Nikolaev et al. 2003], and radiolarians [Polet et al. 2004]. A strong support for Rhizaria, composed of all previously included taxonomic groups, plus Des- mothoracida and Taxopodida, was recovered in a combined analysis of actin and SSU rDNA genes [Nikolaev et al. 2004]. The rhizarian supergroup is growing continuously by new inclusions such as the marine flagellate ebriids [Hoppenrath and Leander 2006], the amoeboid Corallomyxa [Tekle et al. 2007], the parasitic plasmodial Paradinium [Skovgaard and Daugbjerg 2008] and the soil flagellate Sainouron [Cavalier-Smith et al.

2008].

E x c a v a t a: This supergroup is composed of diverse heterotrophic protists, many of which are anaerobic and/or parasitic, characterized by a distinctive feeding groove and two flagella in most, but not all, of these organisms [Simpson 2003]. It is tentatively as- sembled in one monophyletic entity by a combination of molecular and morphological data [Simpson 2003], but to date a robust evidence is still lacking, although a recent phylogenomic study recovered moderate support for this supergroup [Hampl et al.

2009].

The four supergroups described above are often known as the B i k o n t assemblages [Cavalier-Smith 2002].

1.2.5 Where is the root?

The answer to this central question in our understanding of eukaryotic evolution is…

next question please! Indeed we do not really now at present where the root lies, and all bets are off. The most common way for rooting a phylogenetic tree is the use of an exter- nal group (outgroup) that position the root of the ingroup lineages and gives a direction to the evolution. The natural outgroup for the eukaryotic tree consists of prokaryotes, usually belonging to the archaebacteria [Woese et al. 1990; Woese 2002; Pace 2006], even more pre- cisely to the Crenarchaeota line as recently shown [Cox et al. 2008]. Unfortunately this approach has proven to be unsuccessful with the current models of evolution, due to the very high genetic distances between eukaryotes and their outgroup resulting in artifactual placements of the fastest evolving eukaryotes at the base of the tree [Philippe and Germot 2000; Brinkmann et al. 2005].

(35)

An alternative method for rooting a phylogenetic tree relies on complex genetic changes, which are expected to be rare. Today, perhaps the most commonly cited position for the eukaryotic root is between unikonts and bikonts, as deduced from several rare changes. Firstly the presence of a gene fusion in most tested bikonts made of TS and DHFR, two genes that are separated in unikonts and bacteria [Philippe et al. 2000b;

Stechmann and Cavalier-Smith 2002; Stechmann and Cavalier-Smith 2003b]. Following a parsimonious logic, this fusion occurred only once and no reversal took place afterwards, the character TS-DHFR would be a derived character for the bikonts, implying a root out- side this “monophyletic” group. However this scenario is questionable, notably because the taxon sampling for which the presence of the TS-DHFR fusion has been tested so far is scarce [Embley and Martin 2006], some of the putative basal eukaryotic lineages (such as diplomonads and parabasalids [Arisue et al. 2005]) lack these genes, and fusion-bearing spe- cies seem to be phylogenetically related to unikonts [Kim et al. 2006]. Yet other characters are in agreement with the unikonts-bikonts basal bifurcation, for example a shared derived duplication for the phosphofructokinase gene of unikonts [Stechmann and Cavalier-Smith 2003b], or a particular type of myosin (type II) specific, again, to unikonts [Richards and Cavalier-Smith 2005]. To add even more uncertainties, a puzzling character suggests in- stead a root within excavates, basal to jakobids. Acting against the unikonts-bikonts split, the jakobid mitochondrial DNA encodes a bacterial-like RNA polymerase that is different from all other eukaryotes studied to date [Lang et al. 1997]. Hence, the position for the root of the eukaryotic tree remains an open question, and more rare genomic characters associ- ated with better phylogenies are required.

1.2.6 Plastid evolution

Our work, as we describe in the following chapters, turned out to be intimately related to the fundamental question of plastid evolution. Plants and algae acquired photosynthesis through primary endosymbiosis, in which a free living prokaryote (related to modern-day cyanobacteria) was engulfed, retained and integrated by a heterotrophic eukaryote. All members of the supergroup Plantae possess plastids of primary origin, bounded by 2 mem- branes of cyanobacterial type [Gould et al. 2008; Archibald 2009]. These plastids have sub- sequently spread across the tree of eukaryotes by secondary endosymbioses (also tertiary and serial endosymbioses), involving the uptake of either green or red algal endosymbionts by secondary eukaryotic hosts and resulting in plastids with 3 or 4 membranes [Gould et al. 2008; Archibald 2009]. Lineages harboring secondary plastids are found in three different supergroups: Excavata (green plastids), Rhizaria (green plastids), and Chromalveolata (red

(36)

It is generally accepted that primary endosymbiosis occurred only once in evolutionary history [Reyes-Prieto et al. 2007]. On the other hand, how many time secondary endosym- biosis took place is uncertain because it has not yet been possible to infer robust phyloge- netic relationships among all secondary photosynthetic species. It is largely recognized that 2 independent events explain the green plastids of euglenids (belong to Excavata) and chlorarachniophytes (belong to Rhizaria) [Rogers et al. 2007]. However, the number of sec- ondary endosymbiosis that led to the red plastid distribution we observe today is much more debated. One hypothesis is that all secondary red plastids derive from a single endo- symbiosis: this is known as the chromalveolate hypothesis (see section 1.2.4) [Cavalier- Smith 1999]. But because heterotrophic species are present in most photosynthetic clades, photosynthesis (or perhaps even plastids) must have been lost in multiple lineages to ex- plain the current patchy distribution. Thus, at face value, multiple independent acquisi- tions of secondary red plastids is so far a valid alternative postulate [Sanchez-Puerta and Delwiche 2008; Archibald 2009; Bodyl et al. 2009].

1.3 Phylogenomics 1.3.1 A definition

Phylogenomics is a ten year old discipline of evolutionary biology that has followed the increasing availability of complete genome or genomic data (mainly ESTs). It was formal- ized when Eisen [Eisen 1998] first invented the term phylogenomics to describe an ap- proach combining comparative genomics and evolutionary information. Later on he pro- posed a short and general definition that describes very well what phylogenomics is about:

“intersection of evolution and genomics” [Eisen and Fraser 2003]. The main idea behind is that studying genomes alone, without an evolutionary perspective, greatly narrows down the potential of such research.

Since these initial definitions, the scope of phylogenomics has extended and now in- cludes two mains fields:

1) The prediction of molecular function by homology, through the inference of evolution- ary processes underlying the appearance of protein families, integrating experimental data in these computer-based analyses (reviewed in [Sjölander 2004]).

2) The inference of species relationships using genomes and genomic data.

I will explain here in more detail this second aspect, which represents in my opinion the true phylogenomic approach (i.e., phylogeny + genomics) and was the approach I used

(37)

to tackle the different questions asked in this work. It is hard to set a precise limit above which one can call a dataset “a phylogenomic dataset”, but for the sake of simplicity I will consider studies analyzing more than 50 genes to be phylogenomic in nature. I will de- scribe the obligatory cautiousness when employing phylogenomics, and how it can help in reconstructing the phylogeny of species.

1.3.2 How does phylogenomics work?

Generally, two possibilities have been explored for inferring phylogenies using phyloge- nomics [Delsuc et al. 2005] (Figure 6-ch.1):

Figure 6-ch.1. Methods of phylogenomics inference. From [Delsuc et al. 2005].

S e q u e n c e - b a s e d m e t h o d s : As in any phylogenetic reconstruction, a primordial step needs here as well to be carefully performed in order to ensure that a “vertical trans- mission” of the characters is respected: the homology assessment. This is not an easy task because genes of interest for inferring evolutionary history are often poorly sampled, so de- ciding between orthology and paralogy is not always possible (impossible to differentiate between independent gene losses or unsampled genes in some species, for example). Simi- larly, horizontally (or laterally) transferred genes (HGT) can be difficult to pinpoint with a

(38)

alignments is usually datasets of at least a thousand ESTs, finding a sufficient number of genes where orthology can be deduced is nevertheless doable. On the other hand, for this very same reason that phylogenomics deals with huge amounts of data, it is fair to assume that even if little undetected paralogy or HGT remain after careful checks it is unlikely that they will dominate the genuine phylogenetic signal [Lake and Rivera 2004].

The method of choice for creating a set of orthologous genes is the inference of phylo- genetic trees for every single-gene alignment, thus requiring the individual genes to be aligned and unambiguous positions to be selected. This is much more precise than a sim- pler and faster selection of species/sequences based solely on BLAST results [Altschul et al. 1990], which is known to be unreliable. Indeed the BLAST algorithm does not take into account evolutionary information, so that genes appearing to be the most similar based on BLAST hits are often not each others closest relative in term of phylogeny, leading to false positive insertions of species [Koski and Golding 2001].

After this first and important step, two options are conceivable, that is the superma- trix or the supertree approaches [Delsuc et al. 2005]. The supermatrix approach corre- sponds to the concatenation of all selected single-genes into one super-alignment, and sub- mitting it to classical phylogenetic reconstruction methods (the more reliable being at pre- sent the probabilistic methods, Maximum Likelihood [Felsenstein 1981] and Bayesian [Huelsenbeck et al. 2001]). The strategy that is generally applied is to consider each con- catenated sequence as one “gene” and ignoring the evolutionary specificities of each [Philippe et al. 2004; Philippe et al. 2005; Rodriguez-Ezpeleta et al. 2005; Rokas et al. 2005;

Wiens 2005; Delsuc et al. 2006; Patron et al. 2007; Rodriguez-Ezpeleta et al. 2007a]. It pre- sumes that the different discordant histories, if any, contained in each gene will be aver- aged away by the combined analysis of numerous characters. Another strategy that has also been tested, to a lesser extent, is to allow a different set of parameters for each gene in order to more adequately describe different tempos and modes of evolution [Bapteste et al.

2002; Philippe et al. 2004; Patron et al. 2007], but results were generally not significantly different from the “cruder” approach above, questioning its real utility.

Because every single-gene that makes up the concatenation is in principle subject to its own selection of species, upon sequence availability, missing entries are the rule and they are generally not equally distributed (some species have many missing positions, oth- ers have nearly none). Potentially they could drastically lower the resolution of a tree, or induce artifacts due to model violations. Missing data occur even when complete genomes are available because genes can be independently lost, duplicated, or horizontally trans- ferred. This feature of phylogenomic alignments could be a serious drawback, making this

(39)

discipline a nice approach in theory but practically impossible to perform. Fortunately em- pirical and simulation studies have shown that the percentage of missing data can actually be high, up to 90%, and yet the overall signal remains [Wiens 2003; Driskell et al. 2004;

Philippe et al. 2004]. This is especially true in a phylogenomic context as the number of sites present in a large concatenated alignment remains high, even for species with a lot of missing data. Furthermore, it seems that adding even incomplete taxa is beneficial and improve phylogenetic accuracy by breaking long branches [Wiens 2005]. However this con- cern has in my opinion not been investigated thoroughly enough, and precise issues such as the influence of the distribution of missing data (evolutionary close or not to species with no missing data) still need to be specifically discussed. Otherwise risks exist that the sup- posedly weak influence of missing data is taken as face value, so that many highly incom- plete species would not be treated with the necessary caution.

The second sequence-based approach is the supertree, which differs from the superma- trix in that it combines the trees, generated individually based on the single-genes, and not the single-genes themselves. In practice this method has barely been employed [Philip et al. 2005; Fitzpatrick et al. 2006] and comparative efficiency studies of supermatrix and supertree, especially in a phylogenomic context, are needed to shed light on the benefits and disadvantages of both. Until then phylogenomics will likely be based almost entirely on the supermatrix approach, owing to its much greater hindsight.

W h o l e - g e n o m e f e a t u r e m e t h o d s : These methods, relatively new, do not directly rely on multiple-sequence alignment and generally cannot be applied to incom- plete genomic sampling. They provide great promise for the near future as very valuable independent and complementary possibilities for testing phylogenetic trees, when complete genomes will be available for a larger diversity and improvements made in their implemen- tation. Because they are precisely based on entire genomes, one can assume that these kind of data truly reflect the organismal evolution, or at least better approximate it than single-, or even multiple-gene phylogenies. Moreover, events under investigation here, such as alteration of the gene-order or gene-content in a species are supposed to be extremely rare [Rokas and Holland 2000], thus not sensitive to homoplasy.

Looking at the gene-order or gene-content comes to considering each chromosome as a linear (or circular, for example in the case of mitochondria or chloroplasts genomes) order- ing of genes [Moret and Warnow 2005], from which evolutionary relationships are puta- tively inferred. These methods use the number of shared orthologous genes between ge- nomes as a similarity measure [Korbel et al. 2002]. In its most simplistic application the

Références

Documents relatifs

On the other end of the spectrum of biological interactions, rDNA metabarcodes affiliated to groups of known parasites were ~90 times more diverse than

Deux autres études portant sur le régime Saltz ont été, pour l’une, sus- pendue (étude en métastatique : étude N9741 présentée plus loin) et pour l’autre,

Stell ver tre- tend sind hier das Uni ted King dom Of fi- ce of Science and Tech no lo gy in Eng land, das Ob ser va toire des Sciences et des Tech- ni ques in Frank reich, das

A ten-point 'Bali Action Plan' was developed, com- prising the following: (1) To establish by 1992 a world- wide network of national parks and protected areas ex- emplifying

Méthodologie de simulation numérique avec FLAC2D Itasca manuel Schéma du cas étudié Procédure de simulation Graphique montrant les états de stabilité des cavités pour un RMR

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

À l’instar des messages de type « pain » présentés plus tôt dans ce document, les messages de type « camt » sont détaillés selon les informations recueillies dans

Abstract. The study of evolutionary relationships is an important endeavor in the field of Bioinformatics. The fuzzification of ge- nomes led to the introduction of a