• Aucun résultat trouvé

Towards comprehensive surveys of environmental eukaryotic diversity using high-throughput sequencing

N/A
N/A
Protected

Academic year: 2022

Partager "Towards comprehensive surveys of environmental eukaryotic diversity using high-throughput sequencing"

Copied!
379
0
0

Texte intégral

(1)

Thesis

Reference

Towards comprehensive surveys of environmental eukaryotic diversity using high-throughput sequencing

LEJZEROWICZ, Franck

Abstract

La description rapide, exhaustive et précise de la diversité des communautés biologiques reste un challenge majeur en écologie. L'approche de séquençage à haut-débit d'ADN ribosomiques extraits de l'environnement et amplifiés par PCR (métabarcoding) est prometteuse car elle permet d'identifier de nombreuses espèces dans de nombreux échantillons, mais reste affectée par de nombreux biais techniques et biologiques. Ma thèse sur articles est basée sur les modèles biotique eucaryote (notamment foraminifère) et abiotique marin benthique (notamment abyssal) afin (i) de documenter certains biais et de caractériser les signaux écologiques dans les données de métabarcoding et (ii) d'appliquer le métabarcoding pour reconstruire et surveiller des écosystèmes passés et modernes, respectivement. J'identifie la diversité moléculaire des organismes vivants en ciblant l'ARN environnemental, celle des organismes non-fossiles en ciblant de petits fragments d'ADN ancien et celle de potentiels bioindicateurs de pollution en comparaison avec l'approche traditionnelle. La grande diversité des approches de [...]

LEJZEROWICZ, Franck. Towards comprehensive surveys of environmental eukaryotic diversity using high-throughput sequencing. Thèse de doctorat : Univ. Genève, 2015, no.

Sc. 4851

URN : urn:nbn:ch:unige-788371

DOI : 10.13097/archive-ouverte/unige:78837

Available at:

http://archive-ouverte.unige.ch/unige:78837

(2)

Universté de genève

Département de génétique faculté des sciences

et évolution Professeur Jan Pawlowski

Towards comprehensive surveys of environmental eukaryotic diversity

using high-throughput sequencing

THÈSE

Présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention biologie

par

Franck LEJZEROWICZ

de

Metz (France)

Thèse n

4851 Genève

ReproMail: Atelier d’impression à Uni Mail

2015

(3)
(4)
(5)
(6)
(7)
(8)

Remerciements

Merci à tous ceux grâce à qui j’ai pu commencer et achever ce travail de cinq années et demi, et dont je ne me serais jamais cru capable. Ce document représente de multiples efforts collaboratifs, qui m’ont ouvert à un imaginaire et des perspectives inespérés.

Je suis particulièrement redevable à Jan Pawlowski pour m’avoir permis de contri- buer aux nombreux projets qui ont éveillé en moi un enthousiasme certain et une motivation à toute épreuve. Jamais je n’aurais pu continuer sans toute la liberté dont tu m’as fait cadeau et la confiance que tu m’as donné. J’ai énormément appris et il m’en faut encore.

— Des abysses de mon être, merci.

Merciquençage massif à toi Philippe. Jamais je n’aurais eu un intérêt pour les Sciences sans les années et kilomètres de conversations délirantes et déjà riches d’idées.

Merci d’encaisser avec humour mes crises de nerfs, nos burn-outs surnuméraires et tout le reste — je suis impatient de mettre au point un nouveau projet avec toi,just another bullet-list

Un grand merci à toutes celles et ceux qui m’ont supporté pendant ces années et qui m’ont aidé à réaliser ce travail acharné avec professionnalisme et compréhension : Jackie, Maria, Emanuela, José, Béatrice, Ilham, Laure, Slim, Loïc, Roberto, Ivan, Sofia, Joana, Joanna, Florian, Sasha, Alexandra, Cyril, et tous ceux que j’oublie... Je vous souhaite tout le meilleur et je souhaite continuer à apprendre de vous tous.

Merci spécial à Roland Marmeisse pour m’avoir ouvert les portes de la Recherche, à Juan Montoya pour celles du combat quotidien qu’est le travail de scientifique, à Daniel Ariztegui pour celles de son espace de labo, à Richard Christen et Luca Capello pour celles de l’informatique, à Raphaël Morard qui m’a transmis la fièvre foramini- fère, à Ludovic Orlando pour ses précieux conseils au début de ma thèse, ainsi qu’à Colomban de Vargas et toute l’équipe BioMarks pour l’accueil dans le grand bassin du métabarcoding.

Merci à toute l’équipe de Neuchâtel : Edward Mitchell, Enrique Lara & Co. — C’est toujours un plaisir de sentir votre passion pour la recherche, encore plus dévote qua le mienne.

Merci à Caroline Betto-Colliard pour son aide — surtout pour l’organisation de notre activité CUSO.

Je remercie avec toute ma gratitude l’ensemble de l’Université de Genève et notam- ment Denis Duboule et le Département de Génétique et Evolution – tout particuliè- rement Valérie & Corinne – ainsi que le Fond National Suisse pour la Recherche et le Canton de Genève pour leur soutien financer continu et généreux.

Many thanks and warms regards to Angelika Brandt for organizing very successful deep-sea expeditions and granting us with repeated opportunities to approach the won- derful deep-sea realm, along with Marina Malyutina, Laura Würzberg, Enrico Schwabe,

(9)

Nils Brenke, Kirill Minin and many others.

My profound regards to Tomas Cedhagen — collaborating on RV Polarstern with you was an unforgettable experience.

I am very grateful to Andrew Gooday, Ramiro Logares and Thorsten Stoeck for accepting to review this work.

˜

Merci à ceux qui m’ont permis de tisser des liens d’amitié à Genève : Melissa, Delphine, Anjalie, Asli, Carlos, Chloé, Lucie, Pedro, Alexis, Yohan & Stéphane, Ðorđe, Javier, les collègues de Miquel, aux p’tis gars de la Datcha qui m’ont fourni en fumigène pour la dernière nuit d’écriture et à tout le staff du TGIF, société secrète dont je n’aurais jamais compris les rouages ... ©

Merci tout particulèrement à Seb, l’ami que l’on croise partout et qu’on espère toujours. Merci aussi à Laurent, Olivier, Denis Charmot, Raoul de Bonneville, Charles, Olive, Ivan, Steph, Michel, Sam, Sophie, Liona et les tous les p’tis keupons, Julie... K Aux amis deKambé : Alexis & Kwi-hee, Jaehyun et Yohan (mais t’es remercié deux fois - j’y crois pas t’es fourré dans tous les bons plans, toi ! disons que c’est pour le petit Yohan cette fois) - quand est-ce qu’on répète ? Jamais ? Okay ! Pas grave y’a concert demain. Y

Aux amis du groupe fantôme (je pensais à Brainless Kids Excretions) : Rodrigo et Miquel, je crois que vous êtes vraiment les meilleurs – et je vous attendais bordel !E

Aux amis deBotox 4 Daisy: Sev, K-line & Mr.E avec vous j’apprend et expérimente sur une toute autre dimension – dont il ne sera pas question dans cette thèse ;) H

Merci aux deux côtés de la scène culturelle et alternative genévoise pour m’avoir surpris tellement de fois — grandiose ! Merci à l’Usine qui pour traverser la tempête de ces temps de cloisonnement culturel et d’avilissement spirituel a besoin de soutien intellectuel et bienveillant. Globalement, merci à toutes celles et ceux qui m’ont mani- festé de l’attention. ý

Merci à toutes mes amies et à tous mes amis : Julie & Jeff, Crismaldi et sa fa- mille, Pierre, Fab, Thibaut, Quentin, Jordan, Virgile, Gui, Lisa, Selma, Claudia &

Mat, Matt Bidou, Leslie, Chloé et Lucile, Samy, Rafiki, Nico, Charles, Biber, Bojan, Afgang, Fanny, Stephanie, un bon repas, du bon vin (ou pas) et surtout de la bonne musique – c’est tout ! ☼

L’ensemble de ma famille m’est présent à l’esprit quand je rends compte de mon parcours. Rémi et Cédric - je vous dédie ce pavé. Maman & Papa, même si vous savez que j’aurais toujours fait un bon choix : merci de m’avoir offert celui d’étudier. Merci pour les années de soutien et d’amour. ª

(10)

Résumé en français

Notre compréhension de la diversité environnementale et la description des communau- tés biologiques complexes ont été améliorées par les méthodes moléculaires. L’approche prometteuse dite de métabarcoding permet d’identifier les espèces de l’environnement via le séquençage d’une région conservée de l’ADN, ici l’ADN ribosomique (ADNr).

Le séquençage à haut-débit (SHD) a montré que la composante eucaryote des commu- nautés environnementales reste encore à décrire et à comprendre, mais que l’étude de sa diversité est riche d’interprétations. Cependant, le grand nombre de séquences et d’échantillons traités par SHD sont affectés par de nombreux biais. Cela a engendré une nouvelle génération de questions quant à la pertinence du métabarcoding :Quelles dimensions de l’ADN environnemental sont transmises par SHD et quelles informa- tions délivrent-elles? Comment et quel signal environnemental peut être isolé à partir des données de SHD? Cette thèse s’inscrit dans la dynamique de recherche actuelle en métabarcoding, et notamment à l’urgence en matière de gestion écosystémique.

Les objectifs majeurs de cette thèse sont (i) d’identifier et de documenter les biais et dimensions des données qui caractérisent l’analyse de la diversité environnementale par SHD afin (ii) d’appliquer le métabarcoding à la reconstruction des environnements passés et au monitoring des ecosystèmes modernes. Nos travaux sont basés sur les fo- raminifères parce que le laboratoire est expert de leur diversité moléculaire, mais aussi parce qu’ils sont de bons indicateurs paléontologiques et écotoxicologiques, et donc représentent un modèle idéal. Nous prêtons un intérêt particulier à l’environnement benthic abyssal car il (i) abrite une grande diversité de foraminifères, (ii) est inexploré mais bientôt sujet à perturbation anthropique et (iii) présente des caractéristiques per- tinentes à l’étude des biais de préservation (accumulation d’ADN extracellulaire) et d’échantillonnage (microhétérogénéité d’habitats) de l’ADN environnnemental.

La plupart des travaux présentés sont publiés. Je les ai articulés selon un fil conduc- teur logique et non chronologique. Mon introduction générale est une revue des avancées et défis relatifs à la génération et l’analyse des données de métabarcoding (Chapitre 1). J’y introduis des aspects importants liés à la résolution taxinomique et spécifiques à l’ADN environnemental, ainsi que des résultats non publiés qui supportent l’idée que l’examination des méthodes d’enrichissement et d’analyse doit prévaloir sur celle des technologies de séquençage. J’y présente également les résultats non publiés d’une méta-analyse de toutes les séquences Sanger générées à partir de sédiments des grands fonds pour faire status quo et illustrer la transition vers le SHD.

(11)

Une première partie de ce travail contient les aspects fondamentaux de nos travaux.

Nous testons l’intérêt des molécules d’ARN pour capturer la diversité des foraminifères benthiques actifs, par séquençage Sanger (Chapitre 2). Nous montrons que l’identi- fication taxinomique basée sur l’ADNr entier est également possible avec la région hypervariable ciblée par SHD. Le chapitre suivant correspond à notre première étude par SHD, où seule cette région pouvait être séquencée (Chapitre 3). Nous utilisons un contrôle interne pour filter un jeu de données qui reste l’un des plus gros jamais gé- néré par SHD, avec plus de 78 millions de séquences pour 31 sédiments des abysses du monde entier. La ré-analyse de ce jeu de données montre que le métabarcoding peut ac- céder dans les abysses au signal biogéographique des éspèces cryptiques de foraminifères planctoniques (Chapitre 4). Puis, je documente la représentativité des données de SHD à différentes échelles d’échantillonnage pour montrer que les réplicats biologiques sont pertinents pour l’étude des sédiments profonds (Chapitre 5). Finalement, j’introduis le phénomène du mistagging qui engendre des contaminations croisées entre échantillons multiplexés. Nous détaillons ce phénomène et proposons un nouveau filtre (Chapitre 6).

Ce chapitre conclue la partie fondamentale de notre recherche en soulignant le problème de la contamination, auquel tout projet de métabarcoding est particulièrement sensible.

La seconde partie de ma thèse contient les aspects appliqués de nos travaux. D’abord, nous testons l’analyse d’ADN ancien préservé dans le sédiment marin : le métabarco- ding permet d’accéder au bilan paléontologique représenté par la diversité des taxons qui ne fossilisent pas (Chapitre 7). Cependant, l’information transmise par l’ADN an- cien ne corrobore pas nécessairement celle du bilan fossile classique (Chapitre 8). En- suite, nous essayons le métabarcoding à l’évaluation de la qualité environnementale.

Nous analysons la réponse des assemblages benthiques à la présence de fermes de pois- sons, en se basant soit sur les foraminifères selon une approchede novo qui consiste en l’analyse de leur diversité et leur distribution en fonction de gradients environnemen- taux (Chapitre 9), soit sur les métazoaires en comparant les valeurs d’indices biotiques calculés à partir du SHD ou du comptage de spécimens (Chapitre 10).

Pour conclure, je décris les aspects universels des données de métabarcoding que je discute dans le cadre des résultats de cette thèse et de la littérature récente (Chapitre 11). La progression rapide des techniques de séquençage et d’analyse bioinformatique suscite la nécessité d’expériences afin de parfaire notre compréhension des biais d’échan- tillonnage des séquences et pour obtenir des données quantitatives.

(12)

Co-authors affiliations

Science is by essence participatory and I would list all researchers that enhanced our understanding of Life. Here are only those with whom I had the chance to collaborate.

Christian Baal: University of Vienna, Department of Palaeontology, Vienna, Austria Dipankar Bachar: Centre National de la Recherche Scientifique, UMR 6543 and Univer- sité de Nice-Sophia-Antipolis, Unité Mixte de Recherche 6543, Centre de Biochimie, Faculté des Sciences, Nice, France

Loïc Baerlocher: FASTERIS SA, 1228 Plan-les-Ouates, Switzerland Kenneth D. Black: SAMS, Scottish Marine Institute, Oban, Argyll, UK

Astrid Bracher: Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeres- forschung, Bremerhaven, Germany

Angelika Brandt: Zoological Institute and Zoological Museum, Biocentre Grindel, Ham- burg, Germany

Tomas Cedhagen: Department of Biological Sciences, Aquatic Biology, Aarhus Univer- sity, Aarhus, Denmark

Wee Cheah: Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung, Bremerhaven, Germany

Richard Christen: Centre National de la Recherche Scientifique, UMR 6543 and Univer- sité de Nice-Sophia-Antipolis, Unité Mixte de Recherche 6543, Centre de Biochimie, Faculté des Sciences, Nice, France

Kate F. Darling: School of GeoSciences, University of Edinburgh, Edinburgh, UK and School of Geography and GeoSciences, University of St Andrews, UK

Johan Decelle: Station Biologique de Roscoff, CNRS UMR7144 EPPO, UPMC Université Paris 6, Roscoff, France

Colomban de Vargas: Station Biologique de Roscoff, CNRS, UMR 7144, Roscoff, France and Sorbonne Universités, Université Pierre et Marie Curie (UPMC) Université Paris 06, UMR 7144, Station Biologique de Roscoff, Roscoff, France

Jonathan Drew: Environmental Technologies, Coastal and Freshwater Group, Cawthron Institute, New Zealand

Philippe Esling: IRCAM, UMR 9912, Université Pierre et Marie Curie, Paris, France

(13)

Laurent Farinelli: FASTERIS SA, 1228 Plan-les-Ouates, Switzerland

Emmanuelle Geslin: Université d’Angers, UMR6112 CNRS LPG-BIAF – Bio-Indicateurs Actuels et Fossiles, Angers, France

Christian Göcke: Forschungsinstitut und Naturmuseum Senckenberg, Frankfurt am Main, Germany

Dorte Janussen: Forschungsinstitut und Naturmuseum Senckenberg, Frankfurt am Main, Germany

Frans J. Jorissen: Université d’Angers, UMR6112 CNRS LPG-BIAF – Bio-Indicateurs Actuels et Fossiles, Angers, France

Nigel Keeley: Environmental Technologies, Coastal and Freshwater Group, Cawthron In- stitute, New Zealand; Institute of Marine Research, Bergen, Norway

Daniel Kersken: Institute of Biological Sciences, University of Rostock, Rostock, Germany Michal Kuçera: MARUM, Center for Marine Environmental Sciences, University of Bre- men, Bremen, Germany

Dewi Langlet: Université d’Angers, UMR6112 CNRS LPG-BIAF – Bio-Indicateurs Actuels et Fossiles, Angers, France

Magdalena Ła¸cka: Institute of Oceanology Polish Academy of Sciences, Sopot, Poland Béatrice Lecroq: Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland

Wojciech Majewski: Institute of Paleobiology, Polish Academy of Sciences, Warsaw, Poland

Pedro Martinez Arbizu: DZMB, Senckenberg am Meer, Wilhelmshaven, Germany Édouard Metzger: Université d’Angers, UMR6112 CNRS LPG-BIAF – Bio-Indicateurs Actuels et Fossiles, Angers, France

Raphael Morard: MARUM, Center for Marine Environmental Sciences, University of Bremen, Bremen, Germany

Stefan Mulitza: MARUM, Center for Marine Environmental Sciences, University of Bre- men, Bremen, Germany

Joanna Pawłowska: Institute of Oceanology Polish Academy of Sciences, Sopot, Poland

(14)

Cyril Obadia: Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland

Ludovic Orlando: Centre for GeoGenetics, Natural History Museum of Denmark, Uni- versity of Copenhagen, Copenhagen, Denmark

Magne Østerås: FASTERIS SA, 1228 Plan-les-Ouates, Switzerland

Mikkel Pedersen: Centre for GeoGenetics, Natural History Museum of Denmark, Uni- versity of Copenhagen, Copenhagen, Denmark

Loïc Pillet: ADMM UMR 7144, CNRS, Station Biologique de Roscoff, Roscoff, France Xavier Pochon: Environmental Technologies, Coastal and Freshwater Group, Cawthron Institute, New Zealand; Institute of Marine Science, University of Auckland, New Zealand Bettina Riedel: University of Vienna, Department of Limnology and Oceanography, Vi- enna, Austria

Meike A. Seefeldt: Ruhr-University Bochum, Bochum, Germany

Enrico Schwabe: Bavaria State Collection of Zoology, München, Germany

Michael Stachowitsch: University of Vienna, Department of Limnology and Oceanogra- phy, Vienna, Austria

Witold Szczuciński: Institute of Geology, Adam Mickiewicz University in Poznań, Poz- nań, Poland

Gritta Veit-Köhler: Senckenberg am Meer, DZMB – German Centre for Marine Biodi- versity Research, Wilhelmshaven, Germany

Thomas A. Wilding: Ecology Department, SAMS, Scottish Marine Institute, Oban, Ar- gyll, UK

Suzie A. Wood: Environmental Technologies, Coastal and Freshwater Group, Cawthron Institute, New Zealand; Environmental Research Institute, University of Waikato, New Zealand

Marek Zaja¸czkowski: Institute of Oceanology Polish Academy of Sciences, Sopot, Poland Martin Zuschin: University of Vienna, Department of Palaeontology, Vienna, Austria

(15)

Abstract

Our understanding of environmental diversity and the description of complex biological com- munities greatly improved under the light of molecular methods. The promising approach referred to as metabarcoding allows the identification of species in the environment through the sequencing of a conserved DNA region, usually the ribosomal RNA gene (rDNA). High- throughput sequencing (HTS) revealed that the eukaryotic component of environmental com- munities remains poorly described and documented, but that the study of its diversity may yield rich ecological interpretations. However, along with improvements in terms of sequenc- ing depth and sample multiplexing capacity appeared a variety of biases. With these biases arose a new generation of questions relevant to any metabarcoding study: Which dimensions of environmental DNA are transmitted through HTS and what information does it convey?

What sequence signal could be isolated for HTS data and how does it reflect environmental patterns? This phD thesis is well ascribed into the current metabarcoding research dynamic, and notably as a tool for ecosystem management.

The main objectives were (i) to identify and document the biases and dimensions of molecular data that characterize environmental HTS diversity analyses in order to (ii) apply metabarcoding to the reconstruction of past environments and to the biomonitoring of mod- ern ecosystems. Our work focuses on Foraminifera not only because the laboratory developed an extensive knowledge on the molecular diversity of this protistan phylum, but also because it entails good palaeontological and ecotoxicological indicators, and hence represents an ideal model taxon. We also focus on the deep-sea benthic environment because it (i) hosts particu- larly diverse foraminiferal assemblages, (ii) represents an unexplored environment threatened by anthopogenic disturbance for the monitoring of which metabarcoding shows promise and (iii) features specific settings relevant to the investigation of biases related to the preservation (accumulation of extracellular DNA) and sampling (habitat micropatchiness) of environmen- tal DNA.

Most of the present work is published. I endeavored to articulate the articles according to a logical instead of a chronological thread. My general introduction is a review of the advances and challenges that pertain to HTS metabarcoding data generation and analysis (Chapter 1). I introduce aspects related to the taxonomic resolution of environmental rDNA markers and specific to environmental DNA material. I incorporate unpublished results that support the examination of PCR and data processing methods over that of the various HTS platforms. I also present unpublished results of a re-analysis of all Sanger sequences gener- ated from deep-sea sediments to make astatus quo and illustrate the transition from Sanger sequencing to HTS.

(16)

The first part entails the fundamental aspects of our research. We tested the RNA molecules to capture the diversity of active benthic foraminiferans, by Sanger sequencing (Chapter 2). We show that taxonomic assignments based on the full rDNA marker can be achieved with the hypervariable region targeted in subsequent HTS projects. The next chap- ter corresponds to our first HTS study where only this region could be sequenced (Chapter 3).

We highlight the use of internal controls for the filtering of this data set, composed of over 78 million sequences for 31 worldwide deep-sea samples. The re-analysis of this data set shows that HTS metabarcoding can retrieve in the abyss the fine biogeographic signal of planktonic foraminifera (Chapter 4). Then, I document the representativeness of HTS sequence data at the various sampling scales of a nested design to evidence the relevance of deep-sea sediment replicates (Chapter 5). Finally, I introduce themistaggingphenomenon responsible for cross- contamination events among multiplexed samples. We detail this phenomenon and propose a new filtering solution (Chapter 6). This chapter concludes the fundamental part of our re- search by underlining the underrated contamination issue, to which any HTS metabarcoding project is highly brittle.

The second part entails the applied aspects of our work. First, we test the analysis of ancient DNA preserved in subsurface marine sediments: metabarcoding is promising to ac- cess the paleontological record of the diversity represented by non-fossilizing taxa (Chapter 7). Yet, further research is needed as the information conveyed by ancient DNA material does not necessarily corroborates that of the classical microfossil record (Chapter 8). Then, we test metabarcoding for the assessment of environmental quality. We analyze the response of benthic assemblages to the presence of fish farms based either on foraminiferans and de novo analyses of their diversity and distribution along environmental gradients (Chapter 9), or on metazoans by comparing biotic indices derived from HTS or the picking of specimens (Chapter 10).

To conclude, I discuss the achievement of this thesis in the framework of a thorough review of the universal aspects of HTS metabarcoding data analysis and diversity patterns, that I expose based of the results of the most recent literature (Chapter 11). The rapid progression of sequencing technologies and computational methods prompts further experiments in order to fully comprehend sequence sampling biases and ensure the recovery of quantitative data.

(17)
(18)

Table of Contents

1

Introduction

1

1.1 Background and motivation . . . 1

1.2 Environmental diversity of eukaryotes . . . 3

1.2.1 Ecological importance . . . 3

1.2.2 The molecular revolution . . . 5

1.2.2.1 The ribosomal RNA gene marker . . . 6

1.2.2.2 Reference databases . . . 9

1.2.3 The Foraminifera model . . . 11

1.2.4 The deep-sea benthic environment . . . 15

1.3 Metabarcoding . . . 18

1.3.1 Approach description and relevance . . . 18

1.3.2 Sequencing technologies . . . 20

1.3.3 Biases and limitations . . . 22

1.3.3.1 The problem of chimeras . . . 23

1.3.3.2 The threat of contamination . . . 23

1.3.3.3 Analysis pitfalls . . . 25

2

Identifying active foraminifera: metatranscrip- tomics

27 2.1 Project description . . . 27

2.2 Abstract . . . 28

2.3 Introduction . . . 28

2.4 Material and methods . . . 30

2.4.1 Sampling sites and specimen sorting . . . 30

2.4.2 RNA and DNA extractions, cDNA synthesis . . . 30

2.4.3 PCR amplification, cloning and sequencing . . . 31

2.4.4 Sequence analysis . . . 32

2.5 Results . . . 32

2.5.1 Sequence data . . . 32

2.5.2 Sequence identification . . . 33

2.5.3 DNA versus cDNA datasets . . . 33

2.5.4 Bathymetric distribution . . . 34

2.6 Discussion . . . 36

2.7 Supplementary Information . . . 42

3

Ultra-deep sequencing in deep-sea sediments

45 3.1 Project description . . . 45

3.2 Abstract . . . 46

3.3 Introduction . . . 46

(19)

3.4 Material and methods . . . 48

3.4.1 Sampling . . . 48

3.4.2 DNA Extraction, Amplification, and Massive Sequencing . . . 49

3.4.3 Reads Filtering . . . 50

3.4.4 Reads Identification . . . 50

3.4.5 OTU Delineation . . . 51

3.4.6 Phylogenetic Reconstruction . . . 52

3.5 Results . . . 52

3.5.1 Short Tags Analysis . . . 52

3.5.2 Foraminiferal Richness . . . 53

3.5.3 Geographic Ranges . . . 56

3.6 Discussion . . . 57

3.6.1 DNA Microbarcodes Offer High Taxonomic Resolution . . . 57

3.6.2 Cosmopolitan Taxa Are Widespread in the Deep Sea . . . 59

3.6.3 Early Lineages Dominate Deep-Sea Foraminiferal Assemblage . . . 60

3.6.4 Ultra-Deep Sequencing Offers a Powerful Tool for Exploring Deep-Sea Richness . . . 61

3.7 Supplementary Information . . . 62

4

Pelagic diversity from abyssal sediments

67 4.1 Project description . . . 67

4.2 Abstract . . . 68

4.3 Introduction . . . 68

4.4 Material and methods . . . 70

4.4.1 Data downloading . . . 70

4.4.2 Data extraction and reads identification . . . 70

4.4.3 Statistics . . . 72

4.5 Results . . . 73

4.6 Discussion . . . 75

4.6.1 Towards molecular Paleoceanography . . . 78

5

Patchiness of deep-sea benthic Foraminifera

81 5.1 Project description . . . 81

5.2 Abstract . . . 82

5.3 Introduction . . . 82

5.4 Material and methods . . . 84

5.4.1 Sampling sites and experimental planning . . . 84

5.4.2 DNA extraction and PCR amplification . . . 85

5.4.3 Library preparation and high-throughput sequencing . . . 86

5.4.4 Sequence analyses . . . 87

5.5 Results and discussion . . . 88

5.5.1 High-throughput sequencing data set . . . 88

5.5.2 Taxonomic composition and diversity . . . 89

5.5.3 Replicate heterogeneity and common sequences . . . 93

5.5.4 Nested samples representativeness . . . 97

5.6 Conclusions . . . 99

5.7 Supplementary Information . . . 100

(20)

6

Accurate multiplexing and filtering for metabar-

coding

117

6.1 Project description . . . 117

6.2 Abstract . . . 118

6.3 Introduction . . . 118

6.4 Material and methods . . . 120

6.4.1 Cloned sequence samples . . . 121

6.4.2 Tagged primer design . . . 122

6.4.3 PCR amplification, library preparation and sequencing . . . 122

6.4.4 Multiplexing design experiments . . . 125

6.4.4.1 Detection protocols . . . 125

6.4.4.2 Saturated design and Latin Square Design . . . 125

6.4.4.3 Mock community replicates . . . 126

6.4.5 Computational analysis . . . 126

6.4.5.1 Quality filtering and paired-end reads assembly . . . 126

6.4.5.2 Clone sequence assignment . . . 126

6.4.5.3 Mistagging-based filter . . . 127

6.5 Results . . . 128

6.5.1 Single-tagging mayhem . . . 128

6.5.2 Double-tagging saturation . . . 130

6.5.3 Latin Square Design . . . 132

6.5.4 Mock community replicates and mistagging-based filtering . . . 133

6.6 Discussion . . . 137

6.7 Supplementary Information . . . 143

7

Ancient DNA in deep-sea sediments

167 7.1 Project description . . . 167

7.2 Abstract . . . 168

7.3 Introduction . . . 168

7.4 Material and methods . . . 169

7.5 Results . . . 172

7.5.1 Sediment type and age . . . 172

7.5.2 Microfossils . . . 173

7.5.3 Molecular data . . . 174

7.6 Discussion . . . 180

7.7 Supplementary Information . . . 182

8

Ancient DNA in the Svalbard shelf

187 8.1 Project description . . . 187

8.2 Abstract . . . 188

8.3 Introduction . . . 188

8.4 Material and methods . . . 190

8.4.1 Study area . . . 190

8.4.2 Sampling . . . 190

8.4.3 Dating . . . 191

8.4.4 Sorting and identification of foraminifera . . . 192

8.4.5 DNA extraction, PCR amplification and cloning . . . 193

(21)

8.4.6 High-throughput sequencing . . . 194

8.4.7 Data analysis . . . 194

8.5 Results . . . 195

8.5.1 Sediment type and age . . . 195

8.5.2 Micropalaeontological study . . . 195

8.5.3 Molecular data . . . 196

8.6 Discussion . . . 201

8.7 Conclusions . . . 205

8.8 Supplementary Information . . . 205

9

Environmental monitoring on benthic foraminifera communities

215 9.1 Project description . . . 215

9.2 Abstract . . . 216

9.3 Introduction . . . 216

9.4 Material and methods . . . 218

9.4.1 Sampling . . . 218

9.4.2 Environmental DNA and RNA extractions, cDNA synthesis . . . 219

9.4.3 PCR amplification and high-throughput sequencing . . . 219

9.4.4 Sequence analyses . . . 220

9.4.5 Fixed foraminiferal assemblage . . . 222

9.4.6 Redox . . . 222

9.5 Results . . . 223

9.5.1 NGS data statistics . . . 223

9.5.2 Taxonomic assignment . . . 223

9.5.3 Comparison of foraminiferan communities in molecular and morpho- logical data . . . 224

9.5.4 Foraminiferal community comparisons within and among DNA and RNA data sets . . . 226

9.5.5 OTUs richness along environmental gradients . . . 226

9.5.6 Statistical correlation between species abundance and environmental gradients . . . 228

9.6 Discussion . . . 229

9.6.1 High-throughput accuracy . . . 229

9.6.2 Foraminifera as bioindicators . . . 231

9.6.3 Advantages and limitations of NGS metabarcoding . . . 232

9.7 Conclusions and perspectives . . . 233

9.8 Supplementary Information . . . 235

10

Metazoa sequencing and morphology for biomon- itoring

239 10.1 Project description . . . 239

10.2 Abstract . . . 240

10.3 Introduction . . . 240

10.4 Material and methods . . . 242

10.4.1 Sampling . . . 242

10.4.2 Molecular analyses . . . 243

(22)

10.4.3 Bioinformatics . . . 244 10.4.4 Biotic indices . . . 245 10.5 Results . . . 245 10.5.1 Morphotaxonomic analyses . . . 245 10.5.2 HTS data statistics . . . 246 10.5.3 Taxonomic composition . . . 247 10.5.4 Metazoan OTUs distribution . . . 248 10.5.5 Biotic indices . . . 251 10.6 Discussion . . . 252 10.7 Supplementary Information . . . 257 11

General discussion and perspectives

273 11.1 HTS metabarcoding data dimensions and conveyed information . . . 273 11.1.1 Investigating active species and ecologically-relevant data . . . 273 11.1.2 Possible implications of specific environmental DNA features . . . 275 11.1.2.1 Extracellular DNA molecules . . . 275 11.1.2.2 The spectrum of DNA damage . . . 276 11.1.3 PCR controls on diversity and ecological signals . . . 277 11.1.3.1 PCR implementation choices . . . 277 11.1.3.2 From amplicons to operational units . . . 278 11.1.3.3 Rarity is patterned by relative abundance skews . . . 286 11.2 Future developments and challenges . . . 289 11.2.1 Sequencing technologies . . . 289 11.2.2 Sequence sampling . . . 290 11.2.3 Quantitative analysis . . . 291 11.2.3.1 rDNA copy number variation . . . 291 11.2.3.2 rDNA sequence polymorphisms . . . 292 11.2.3.3 On the Origin of rDNA . . . 292 11.3 Concluding remarks . . . 293

12

Other collaborations

295

12.1 Review: Next-Generation Environmental Diversity Surveys of Foraminifera:

Preparing the Future . . . 295 12.2 Accurate assessment of the impact of salmon farming on benthic sediment

enrichment using foraminiferal metabarcoding . . . 295 12.3 Palaeoceanographic changes in Hornsund Fjord (Spitsbergen, Svalbard) over

the last millennium: new insights from ancient DNA . . . 296 12.4 Foraminiferal survival after long-termin situ experimentally induced anoxia . 297 12.5 Algal pigments in Southern Ocean abyssal foraminiferans indicate pelagoben-

thic coupling . . . 297 12.6 The infauna of three widely distributed sponge species (Hexactinellida and

Demospongiae) from the deep Ekström Shelf in the Weddell Sea, Antarctica . 298

Appendix

336

(23)
(24)

List of Figures

Chapter 1

1.1 The eukaryotic tree and taxon sampling bias . . . . 4 1.2 nMDS of the aligned sequence distances for ciliates . . . . 8 1.3 Taxonomic skews of reference databases . . . . 10 1.4 Foraminiferal SSU rDNA hypervariables regions . . . . 12 1.5 Entropy profile of foraminiferal SSU rDNA sequence alignment . . . . 13 1.6 Foraminiferal reference sequence variability and HTS sequence assignment . . . . 14 1.7 Deep-sea Sanger sequence diversity . . . . 17 1.8 The HTS metabarcoding workflow . . . . 19 1.9 Comparison of Ion Torrent PGM and Illumina MiSeq sequence qualities . . . . 21

Chapter 2

2.1 Comparison of foraminiferal assemblages inferred from DNA and cDNA datasets . . . 35 2.2 Taxonomic composition of metagenetic samples along bathymetric transects . . . . 37 2.3 Rose-bengal stained specimen found sequenced . . . . 38 2.4 Shared OTUs between three bathymetric range classes . . . . 42

Chapter 3

3.1 Taxonomic composition of deep-sea foraminifera communities . . . . 54 3.2 Phylogenetic diversity of assigned OTUs . . . . 55 3.3 Cosmopolitanism patterns in deep-sea foraminiferal OTUs . . . . 58 3.4 Rarefaction curves . . . . 62 3.5 Taxonomic resolution of the 37f hypervariable signatures . . . . 63

Chapter 4

4.1 Geographic location of the analyzed samples worldwide . . . . 71 4.2 Relative proportions of planktonic Foraminifera genotypes . . . . 74 4.3 MNDS of the samples similarities at three taxonomic resolution levels . . . . 76 4.4 Macro ecological pattern of spatial diversity known as the mid-domain effect . . . . . 77

Chapter 5

5.1 Sample location across the Southern Ocean and sample scheme . . . . 85 5.2 Read sequence data processing and quality filtering statistics . . . . 90 5.3 Taxonomic composition of nested samples across core replicates . . . . 92 5.4 Abundance-ranked monothalamous Foraminifera for each station . . . . 93 5.5 Distribution of OTUs and reads among and within sampling scales . . . . 95 5.6 Distribution of the sequence reads in the nested samples per taxon . . . 100

(25)

Chapter 6

6.1 Multiplexing designs and library preparation . . . 120 6.2 Pairwise target clone sequence distances . . . 124 6.3 Single-tagging mayhem . . . 129 6.4 Double-tagging saturation . . . 131 6.5 Latin Square Design and de-saturation . . . 133 6.6 Re-sequencing of the same 10 PCR products on SAD or LSD . . . 134 6.7 Number of reads per ISU closely-assigned to a clone . . . 135 6.8 Mistagging-based filtering and intersection sets of PCR replicates . . . 138 6.9 Conserved abundances between mock communities templates and ISUs . . . 139 6.10 Taxonomic specificity of the reverse foraminiferal primer s15 . . . 143 6.11 Experimental designs and molecular workflow . . . 144 6.12 Clone-to-sample heat maps for per taxon and run . . . 150 6.13 Single tagging mayhem (replicate) . . . 151 6.14 Primer-to-primer network for each single-tagging library . . . 152 6.15 Example of mistagging pattern at the ISU level . . . 153 6.16 Distribution of ISUs in each possible intersection of PCR replicate samples . . . 154 6.17 Distributions of ISUs and reads per category of number of replicates . . . 155

Chapter 7

7.1 Sediment composition: photomicrographs . . . 173 7.2 Vertical changes in mean grain size in the ancient DNA cores . . . 174 7.3 Foraminifera pictures from analyzed cores . . . 175 7.4 Radiolaria pictures from analyzed cores . . . 176 7.5 PCR amplifications of different size fragments in deep-sea subsurface sediment samples 177 7.6 Diversity across cores . . . 178 7.7 Foraminiferans and radiolarians data based on micropalaeontology and ancient DNA . 179 7.8 OTU richness and sequence abundance of taxa found in microfossil record . . . 182

Chapter 8

8.1 Sampling location and bathymetry in the Hornsund fjord . . . 191 8.2 Age–depth model of the studied core . . . 196 8.3 Foraminiferal taxonomic composition: micropalaeontological vs. aDNA sequencing . . 198 8.4 Venn diagram of the taxonomic composition: cloning vs. HTS . . . 199 8.5 Venn diagram of the foraminiferal diversity: morphology vs. DNA . . . 199 8.6 Number of individuals and sequences for selected species . . . 201 8.7 Taxonomic composition of the monothalamous foraminiferans . . . 202

Chapter 9

9.1 OTUs distribution and taxonomic assignments: DNA vs. RNA . . . 224 9.2 Comparisons of foraminifera communities across samples and type of molecule . . . . 227 9.3 Relationship between foraminiferal OTU richness and environmental gradients . . . . 228 9.4 NMDS plot of OTU sequence abundance relationships with environmental gradients . 229 9.5 Sampling locations . . . 235 9.6 Forward reads quality . . . 236 9.7 Reverse reads quality . . . 236 9.8 Abundance distribution vs. distance to cage the most sequenced species . . . 237 9.9 Morphological vs. sequence counts for DNA and RNA . . . 237

(26)

9.10 Shannon diversity vs. environmental gradients . . . 238

Chapter 10

10.1 Morphological species and molecular OTU richness . . . 247 10.2 Taxonomic composition . . . 249 10.3 Sequence abundances of bioindicator taxa per class of distance to the cage . . . 250 10.4 Infaunal Trophic Index (ITI) and AZTI’s Marine Biotic Index (AMBI) . . . 252 10.5 Relationship between biotic index values derived morphological and molecular data . . 253 10.6 Sequencing quality: undetermined positions . . . 257 10.7 Sequence reads filtering . . . 257 10.8 Molecular vs. morphotaxonomy detection at each assigned taxonomic level . . . 258 10.9 Non-metric multidimensional scaling plots . . . 259 10.10Proportions of taxa associated with an ecological groups . . . 259 10.11Raw sequencing read distribution and high-level taxonomy . . . 260 10.12Primer specificity of V4 to Metazoa . . . 260 10.13Presence of reference sequences for morphological species . . . 261 10.14Distribution of reads across taxa OTUs . . . 261

Chapter 11

11.1 Distribution of foraminiferal HTS sequences across OTUs and samples . . . 287

(27)
(28)

List of Tables

Chapter 1

1.1 Re-analyzed Sanger sequences dataset . . . . 16

Chapter 2

2.1 Samples location . . . . 30 2.2 Number of identification taxa per sample . . . . 34 2.3 Descriptive statistics of analyzed SSU rDNA and SSU rRNA datasets. . . . 43 3.1 Sampling sites and collecting methods. . . . 48 Chapter 3

3.2 Number of OTUs and reads present in one or more than one zone . . . . 57 3.3 Sequence data filtering statistics . . . . 64 3.4 Taxonomic distribution of OTUs at the order level . . . . 65 3.5 Cosmopolitan OTUs data . . . . 66

Chapter 4

Chapter 5

5.1 Samples coordinates . . . . 84 5.2 Sequences of the tagged primers used . . . 100 5.3 Tagged primer combination used to label samples . . . 103 5.4 HTS data filtering . . . 104 5.5 Number of reads per taxon . . . 105 5.6 Taxa detection in core replicates . . . 109 5.7 Read abundance of partially occurring taxa . . . 110 5.8 Representativeness of the replicate samples . . . 111 5.9 Representativeness of the cores . . . 114 5.10 Representativeness of the casts . . . 115

Chapter 6

6.1 Tag to sample compositions of the 7 multiplexing libraries . . . 145 6.2 Mock communities composition . . . 156 6.3 Tagged primers . . . 157 6.4 Read filtering statistics . . . 160 6.5 Matches with less than 23 differences to one of the foraminiferal clone sequences . . . 161 6.6 Global mistagging statistics . . . 162

(29)

6.7 Number of (true and fake) samples found . . . 163 6.8 Biases removal by replicate intersection filtering . . . 163 6.9 Venn replicates statistics of various filtering . . . 165 6.10 Summary of biases removal . . . 166

Chapter 7

7.1 Sampling coordinates . . . 169 7.2 Sediment samples ages . . . 173 7.3 PCR primers . . . 182 7.4 PCR primer sets, amplification conditions and massive sequencing . . . 183 7.5 Granulometric data for each sediment layer . . . 184 7.6 Foraminiferal microfossil counts . . . 185 7.7 Radiolarian microfossil counts . . . 186

Chapter 8

8.1 AMS 14C BP dates and calibrated dates of shell samples . . . 192 8.2 Relative abundance (%) of dominant fossil taxa in the analysed samples . . . 197 8.3 Species list for the micropalaeontologic record . . . 205 8.4 OTUs list for the molecular record . . . 207

Chapter 9

9.1 NGS reads filtering statistics . . . 221 9.2 Reads and specimens abundances of the ten most abundant OTUs . . . 225

Chapter 10

10.1 Sampling sites and metadata . . . 242 10.2 Tag-to-sample information. . . . 262 10.3 Morphological inventory . . . 264 10.4 Sequencing data filtering statistics . . . 268 10.5 Pairwise Wilcoxon Mann-Whitney tests . . . 270 10.6 Pairwise Wilcoxon Mann-Whitney tests among DNA or among RNA samples . . . 271

Chapter 11

(30)

Chapter 1 Introduction

”Science, my lad, is made up of mistakes, but they are mistakes which it is useful to make, because they lead little by little to the truth.”

— Jules Verne

1.1 Background and motivation

Life is represented by an astonishing diversity of organisms interacting in a multitude of ecosystems that sustain the survival of our own species. Biodiversity is key for the maintaining of ecosystem services, as in the marine environment Worm et al. (2006) where its magnitude is essentially unknown (Appeltans et al., 2012). Even though the conservation of biodiversity has become a priority, the study of biological communities in the environment remains challenging, even in the simple terms of their taxonomic composition and species richness. In addition to compiling biodiversity censuses, under- standing the biogeography and distribution of species has always constituted another important challenge in order to better inform conservation and biomonitoring programs Velasco et al. (2015). Hence, the efficiency of management practices depends not only on our ability to describe the diversity of communities in discrete environments, but also to analyze their response to past and present environmental changes.

The study of environmental diversity has undergone major developments in becom- ing a multidisciplinary field of research driven by the rapid advances of high-throughput DNA sequencing (HTS) technologies. Environmental genetic data can provide direct evidence on ecosystems taxonomic diversity and functioning,.encouraging the devel- opment and testing of innovative molecular tools. The metabarcoding approach has proved particularly promising for exhaustive environmental diversity screening. It re- lies on the sequencing of a specific gene marker enriched from environmental DNA ex- tracts, which usually corresponds to the ribosomal RNA gene (rDNA) used for species identification purposes (Taberlet et al., 2012a). It has been employed to investigate the diversity of microbial communities, and has revealed that a large proportion of

(31)

both the prokaryotic and eukaryotic kingdoms correspond to novel taxa and lineages.

Beyond its undeniable power to discover biodiversity and thus to address the existence of arare biosphere, HTS metabarcoding also has the capacity to survey numerous en- vironmental samples simultaneously, thus allowing the design of a new generation of experiments for the testing of ecological hypotheses. However, the kinds of information conveyed by environmental DNA and that can be used for comprehensive environmen- tal diversity surveys are not yet fully explored, and the various biases that affect the generation and analysis of HTS data are not fully understood. Hence, the improve- ments of methods and concepts associated with molecular approaches are not only justified, but mandatory in order to accompany the paradigm-shifts that HTS is lead- ing to. This thesis reports the collaborative work of molecular biologists, taxonomists, informaticians, ecologists as well as geologists and paleontologists with whom I have explored the potential and reliability of the metabarcoding approach.

Our main research goals are to identify, document and if possible solve several tech- nical limitations that hamper the accurate interpretation of interesting environmental diversity patterns. The generic terms "interesting" and "patterns" are chosen here on purpose because even though HTS metabarcoding data essentially conveys taxonomic information relevant to the question "What is there?", there are many possible metabar- coding applications. Objectives and research questions should be clearly formulated in order to clarify which aspects of HTS metabarcoding data need to be addressed and analyzed in order to reveal meaningful diversity patterns. For instance, in our first ap- plication of HTS, we tested the ability of early HTS techniques to identify foraminiferal species in deep-sea sediment samples in order to decipher patterns of species dominance (Chapter 3). Hence, we relied on sequence counts and occurrence across samples. Our subsequent research required robust and well-planned experimental designs in order to describe patterns of sequence sampling (Chapter 5), of cross-contamination (Chapter 6), or of diversity response to environmental gradients (Chapter 9 and 10).

This thesis is a contribution towards more comprehensive surveys of environmen- tal eukaryotic diversity, through the implementation of pioneering HTS experiments and analyses. I document and discuss the useful signals and associated biases that characterize the metabarcoding approach, and that need to be considered in apply- ing it as a tool to address contemporary issues related to the sustainability of natural environments.

(32)

1.2 Environmental diversity of eukaryotes

The diversity of eukaryotes is often reduced to that of animals because animals com- prise emblematic models in biology (e.g. Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans). However, the eukaryotic kingdom is composed of highly di- versified and mostly microbial lineages, whose extant diversity is largely undersampled Del Campo et al. (2014). From an environmental perspective, the undersampling issue is aggravated by the fact that the phylogenetic diversity is necessarily exceeded by the diversity of ecological species entities that interact and fill the niche space of various environments. This is especially the case for the supergroup Rhizaria (Fig. 1.1) to which belongs the model group of this thesis, the Foraminifera, and within which has evolved highly specialized ecological species associated with metabolic capabilities that remain to be characterized Burki et al. (2013). The discovery of a new meiofaunal- sized species exhibiting equally surprising metabolic capabilities in an extreme, anoxic environment suggests that even the Metazoa remains undersampled (Danovaro et al., 2010b).

1.2.1 Ecological importance

Filling the eukaryotic diversity knowledge gap is beneficial not only for the recon- struction of evolutionary histories He et al. (2014); Katz and Grant (2015), but also for biomonitoring purposes. Indeed, current ecotoxicological biomarkers and biotic in- dices depends on autecological information Haddad et al. (2008), or at least on the isolation of species for sensitivity experiments Lelong et al. (2012) in order to ascer- tain the impact of a pollution on community structure and composition Vasseur and Cossu-Leguille (2003). Unfortunately, the ecological optima of microbial eukaryotes and meiofaunal-sized metazoans are poorly known, whereas these small eukaryotes are readily identified by HTS metabarcoding. Hence, we explored the potential of HTS metabarcoding for biomonitoring either in the absence of autecological data using de novo analyses (Chapter 9) or by incorporating the scarce, yet available autecological knowledge (Chapter 10).

Microbes in the environment are often interpreted as metabolic units responsible for key biogeochemical processes (Falkowski et al., 2008), but the integration of these processes into ecological hypotheses is often limited to the bacterial component of mi-

(33)

Figure 1.1: The eukaryotic tree and taxon sampling bias. This tree summarizes the diversity of eukaryotic lineages and the major taxa from which full genome sequences are available (Thick branch: many genomes; dashed branch: no genome). From Burki and Keeling (2014)

crobial communities, e.g. to explain the carbon cycling in soils (Schimel and Schaeffer, 2012) or its fixation in oceans (Hügler and Sievert, 2011). However, several micro- bial eukaryotic lineages are directly responsible for major ecosystem processes, such as the primary production in oceans (Field et al., 1998), while entire lineages of micro- bial eukaryotes correspond to heterotrophic (e.g. Cercozoa, Fungi) or parasitic (e.g.

Apicomplexa, Oomycota) (Boudouresque, 2015) taxa. Hence, eukaryotic cells in the environment could possibly be interpreted as ecological units determining population dynamics and thus regulating the magnitude of these processes. It is noteworthy that the discipline of microbial ecology started with the experimental demonstration of the competitive exclusion principle based on protists (Gause, 1934). Despite this early

(34)

start, in my opinion, the potential of molecular techniques for experimental (Charvet et al., 2014) or field microbial ecology Foulon et al. (2008) has not yet been fully explored.

1.2.2 The molecular revolution

Sequencing the DNA extracted from environmental samples to recover taxonomic in- formation profoundly transformed our perception of eukaroytic diversity. A quarter of a century has elapsed since the first environmental DNA (eDNA) diversity surveys, that were based on the Sanger sequencing of PCR amplified and cloned fragments of prokaryotic rDNA (Giovannoni et al., 1990; Ward et al., 1990), and about fifteen years since their first applications to eukaryotes (Diez et al., 2001; Moon-van der Staay et al., 2001). These pioneering studies revealed the breadth of the uncultured diversity and magnified species richness estimates of microbial eukaryotes in contrasting envi- ronments. Phylogenetic analyses based on the long eDNA sequences generated by the Sanger technology (> 1 kb) led to the identification of novel putative phyla such as the picobiliphytes Not et al. (2007) or the rappemonads Kim et al. (2011).

The sequencing of closely related phylotypes in various environments not only con- solidated the eukaryotic tree, but also allowed the emergence of global distribution patterns. This has been perfectly exemplified by marine Stramenopiles (MAST), a group that comprises a clade biogeographically restricted to polar regions (MAST-4) Massana et al. (2006). Refined examinations of this clade indicated that, although it may not be the most diverse group, it may represent up to 10% of the heterotrophic eukaryotic cells in samples, and hence play an important role in trophic processes Logares et al. (2012). At this point, I would like to indicate that I already reviewed this state-of-the-art on assessing eukaryotic diversity using molecular methods in my contribution to the only chapter dealing with eukaryotes of a recent textbook on the fundamentals and applications in environmental microbiology (Boudouresque, 2015).

Nevertheless, it is noteworthy that early molecular explorations of the environmen- tal sequence diversity resulted in the acquisition of valuable data that still constitutes a strong basis for current methodological developments. Indeed, reference sequence databases include entries corresponding to phylotypes only known from environmen- tal Sanger sequencing surveys, but no entries corresponding to undetermined HTS metabarcoding sequences. Moreover, the sequencing of numerous gene marker po-

(35)

sitions facilitates cross-study meta-analyses (which is one of the reasons why we al- ways sequence the same fragment for foraminifera), and allows the design of robust probes forFluorescence In Situ Hybridization(FISH) experiments in order to highlight the morphology of environmental phylotypes Massana et al. (2006). Yet, interesting methodological pipelines are being assayed to uncover the morphotypes beyond HTS metabarcoding sequences, which therefore increases the detection sensitivity and con- firms that such sequences are genuine Gimmler and Stoeck (2015).

Nowadays, HTS data is routinely included in environmental diversity surveys, at least as far as microbial communities are concerned. Hence, introducing the general outcomes of the molecular revolution is relevant to the contents of this thesis, where I expand on the results of the most recent HTS metabarcoding applications.

1.2.2.1 The ribosomal RNA gene marker

The ribosomal RNA gene (rDNA) is the main marker for the taxonomic assignment and diversity analysis of environmental sequencing data. The rDNA is present in all living organisms and thus conveys a universal evolutionary signal. It is used as a sur- rogate for the species entities defined by the phylogenetic species concept developed on the basis of analyses of the small subunit of the rDNA (SSU rDNA) (Woese et al., 1990). The rDNA provided the basis for a classification scheme relying on the taxo- nomic nomenclature of the Linnean system, including phylogenetically erected lineages to which HTS metabarcoding sequences could be placed. Prokaryotes are now classi- fied according to a phylogenomic species concept built from metagenomic data. Indeed, from complex metagenomic data have been discovered new phyla providing support for deep branching relationships (Brown et al., 2015a) and extracted rDNA sequence tags allowing the accurate description of environmental communities Logares et al. (2014).

The situation is more controversial for microbial eukaryotes because insightful varia- tion at the morphological or ecological levels are not necessarily corroborated by the variation of the rDNA sequence (Grattepanche et al., 2014). Indeed, it seems that the evolutionary signal conveyed by SSU rDNA sequences is scrambled by uneven rates of evolution and mutational saturation (Philippe and Laurent, 1998), and for phylogenetic purposes has been largely replaced by the signal contained in arrays of protein-coding genes (Katz and Grant, 2015; Sierra et al., 2013). Even though the taxonomic assign- ment of environmental sequences and the phylogenetic analyses of rDNA sequences are different procedures, the former may rely on methods and knowledge developed for the

(36)

latter. For instance, I apply phylogenetic-based methods for the assignment of meta- zoan sequences in the study investigating metabarcoding for environmental monitoring using faunal biotic indices (Chapter 10).

The rDNA encodes the rRNA that during the maturation of the ribosome ma- chinery folds to form stem-loop secondary structures. The primary structures of these stem-loops correspond to hypervariable sequence regions that are usually named by the letterV associated with a number indicating the rank of the stem-loop with respect to the full rRNA secondary structure. Usually, one or a few rDNA hypervariable regions are targeted by PCR amplification in order to describe the diversity of environmental communities. Various regions have been tested for eukaryotes [e.g. V1-V3 (Pochon et al., 2013), V2 (Geisen et al., 2015) or V9 (Amaral-Zettler et al., 2009; Pawlowski et al., 2011a; Brannock and Halanych, 2015)]. The V4 region has been recommended because it offers the highest taxonomic coverage (Pawlowski et al., 2012) and because it is present in most reference sequences (Hu et al., 2015). Moreover, the V4 region has been shown to better reflect the variation entailed within the full-length SSU rDNA (Dunthorn et al., 2012) as well as the results of shotgun sequencing metagenomics (Tremblay et al., 2015). Hence, we employed this region for our HTS metabarcoding survey of Metazoa (Chapter 10).

HTS metabarcoding essentially delivers taxonomic information, which allows not only the assignment of environment sequences but also the delineation of environmental species, often referred to as Operational Taxonomic Units (OTUs) or Molecular Oper- ational Taxonomic Units (MOTUs) (Blaxter et al., 2005). Intense research efforts have been devoted to reconcile or refine the morphological species classification by taking advantage of the taxonomic resolution of markers such as the rDNA or the COI genes (Puillandre et al., 2012a,b), independently for various taxa (e.g. brachiopods: Bitner and Cohen (2015); sponges: Dohrmann et al. (2012)). Indeed, before the metabar- coding application of a candidate marker region, it is wise to evaluate its taxonomic signal at various taxonomic levels as different hypervariable regions offer different res- olutions across taxa (Hadziavdic et al., 2014; Pernice et al., 2013). The taxonomic resolution of rDNA markers are being described in great details for foraminiferans (Göker et al., 2010; Morard et al., 2015; Pawlowski and Lecroq, 2010), haptophytes (Egge et al., 2015b), ciliates (Dunthorn et al., 2012; Stoeck et al., 2014), diatoms (Luddington et al., 2012), marine stramenopiles (Massana et al., 2014), radiolarians

(37)

(Decelle et al., 2014) or even at coarse taxonomic level for all major eukaryotic lineages (Pernice et al., 2013). Extensive evaluations performed independently for each lineage are highly relevant because environmental sequences could first be roughly sorted into unambiguous, non-overlapping bins thanks to higher-level taxonomic sequence signa- tures Lejzerowicz et al. (2014); Pawlowski et al. (2014a), sequence base composition (or k-mer frequencies) (Cole et al., 2007) or sets of diagnostic positions (Sarkar et al., 2008). Indeed, the distances derived from sequence alignments might confound taxo- nomic classification and lump together sequences from different higher-level taxa, as could be predicted for ciliates (Fig. 1.2).

−0.2 −0.1 0 0.1 0.2 0.3

−0.25

−0.2

−0.15

−0.1

−0.05 0 0.05 0.1 0.15 0.2 0.25

Figure 1.2: Non-metric multidimensional scaling of the aligned sequence distances for cil- iates. This figure demonstrates that a distance computed from the alignment of sequences belonging to completely different families could be smaller than that computed within a family, supporting the taxonomic binning of sequences prior to the computation of pairwise sequence alignments for clustering. Unpublished results.

Moreover, for each taxon bin, an optimal set of parameters could be defined in order to optimize the sequence alignments undertaken for precise, species-level assignments as well as for the clustering of sequences into OTUs. Indeed, the treatment of gaps in pairwise sequence alignments is generally set to some default behavior by different clustering algorithms (e.g. consecutive gap counted as one gap inSwarm (Mahé et al., 2015) or each gap counted separately as inmothur (Schloss and Westcott, 2011)). This is highly relevant as some groups such as the ciliates and foraminiferans comprise natu- rally polymorphic species that only diverge in the length of homopolymer stretches, for which assimilating multiple contiguous gaps as one mutational event would be appro-

(38)

priate (Grattepanche et al., 2014). For ciliates, the primary structure of the SSU rDNA V4 region is valuable (Dunthorn et al., 2012), but its secondary structure also carries useful information at the genus level (Wang et al., 2015). In fact, ongoing advances towards the characterization of the evolution of ribosomal RNA folding and structure hold great promise, and may even allow the resurgence of SSU rDNA-based phylo- genies. Indeed, recent releases of phylogenetic softwares now incorporate secondary structure information as discrete characters for inference (Stamatakis, 2014). Docu- menting how evolution shaped the primary (and if applicable the secondary) structure of diverse rDNA markers with respect to phylogenetic systematics and ecological pref- erences observed in the environment represents a long-term common thread that will form a strong basis for future descriptions of HTS marker taxonomic and ecological resolutions.

1.2.2.2 Reference databases

Diversity analyses of environmental rDNA sequence data de factobenefits from exten- sive reference database resources, because the rDNA is the only marker sequenced for most species (Del Campo et al., 2014). In contrast to the mitochondrial COI marker for which a standardized database system has been established (Ratnasingham and Hebert, 2007, 2013), the accumulation of rDNA sequences in public sequence reposi- tories has been rather anarchic. Notable efforts to curate and homogenize taxonomic annotations led to the establishment of trustworthy rDNA databases such as SILVA (Yilmaz et al., 2014) and PR2(Guillou et al., 2013). However, a major problem remains:

databases are heavily skewed against microbial eukaryotes (Del Campo et al., 2014) (Fig. 1.3). Solutions are emerging from the endeavors of specialists to construct taxon- specific databases, such as the PFR2 database for planktonic foraminifera (Morard et al., 2015), PhytoREF for phototrophic organisms (Decelle et al., 2015) or the foram- Barcoding Project for benthic foraminifera (http://forambarcoding.unige.ch).

The incompleteness of reference databases is a well-recognized limitation for tax- onomic assignments. In some instances, an eDNA sequence could be only distantly related to its closest reference and thus assigned an incorrect taxonomic position, high- lighting the need to fill database gaps using methods targeting poorly represented lineages (Lynch et al., 2012). The use of specific primers to retrieve rare sequences that regularly appear in HTS data is promising but cumbersome; as few as 1 out of 3 amplifications may be successful (Neufeld et al., 2008). In other instances, an environ-

(39)

mental sequence may be closely related to the reference sequences of two different taxa, and hence could not be classified with confidence at the species level (Liu et al., 2008).

In this case, an unclassified sequence could be assigned to a higher taxonomic level based on a taxonomic consensus of its closest matches Lecroq et al. (2011); Lejzerow- icz et al. (2013b); Pawlowski et al. (2014b), phylogenetic placement methods (Dunthorn et al., 2014) or by computational methods allowing a trade-off between sequence qual- ity and user-defined penalties (Clemente et al., 2011). Otherwise, the marker has to be changed or increased in length (He et al., 2013; Huse et al., 2008; Pernice et al., 2013), as suggested in the case of HTS sequences matching multiple full-length rDNA sequences from different species (Lie et al., 2014).

Metazoa

Fungi Embryophyta Pro sts

Genomes Described species Total1 8S rDNA

Environmental 18S rDNA

(A) (B)

(C) (D)

2 001 573 22 475

1 165 1 758

Figure 1.3: Taxonomic skews of reference databases. The proportions of metazoans, fungi, and land plants versus all the other eukaryotes in different databases: CBOL ProWG: Con- sortium for the Barcode of Life Protist Working Group (A), all 18S rDNA sequences (grouped at 97% similarity) in GenBank (B), only the environmental 18S rDNA (grouped at 97% sim- ilarity) in GenBank (C). Complete or draft genomes in GOLD: Genomes OnLine Database (D). The numbers indicate the total number of entries in each database. Adapted from Del Campo et al. (2014)

.

Références

Documents relatifs

CONCLUSIONS: The use of rpoB gene for metabarcoding analysis is a promising approach to accurately explore the diversity of bacterial communities because of its best

The 5S rDNA of the bivalve Cerastoderma edule: nucleotide sequence of the repeat unit and chromosomal location relative to 18S-28S rDNA... Original

Then, the assignment to genus or species level is based on distances calculated from Needleman-Wunsch alignments between complete 37f sequences and subsets of reference

Genes used for molecular phylogeny of kinetoplastids The SL RNA gene has been repeatedly used to explore trypanosomatid diversity using either parasites isolated in culture or

ont propos´e une exp´erience intracavit´e de fluorescence dispers´ee de la mol´ecule NiH [Hill 90] dans laquelle une mini source `a pulv´erisation cathodique ´etait ins´er´ee dans

Keywords: Computational Statistics, High-dimensional data, Dimension reduction, Compression, Variable selection, Logistic regression, Sparse Partial Least Squares, Probabilistic

Dans ce chapitre, on définit ce système dans des coordonnées héliocentriques, On démontre l’analyticité du hamiltonien séculaire, au voisinage du point circulaire non incliné

Joye (2004) au déplacement dans l’espace d’entités concrètes ou abstraites qui, pendant leur voyage, peuvent faire l’expérience d’un changement de statut – du