Assessment of the deep-sea foraminiferal richness by massive sequencing with

CHAPTER 3: “HIDDEN” RICHNESS REVEALED BY MOLECULAR TOOLS

3.3 Assessment of the deep-sea foraminiferal richness by massive sequencing with

Béatrice Lecroq¹,Loïc Baerlocher², Laurent Farinelli², Magne Osteras², José Fahrni¹ and Jan Pawlowski¹

1Department of Zoology and Animal Biology, University of Geneva, Switzerland.

2FASTERIS SA, 1228 Plan-les-Ouates, Switzerland.

Manuscript in preparation

Abstract

Development of massive sequencing methods has opened new perspectives for the assessment of species richness in environmental samples. This approach is based on the principle that species can be identified by very short DNA fragments. In this preliminary study, (1) we investigated the variable regions of SSU rDNA in search for the most efficient barcode for identification of foraminiferal species; and (2) we used the chosen barcode to assess diversity of foraminifera in four sediment samples from Arctic Ocean. Based on our large database of the partial SSU rDNA sequences, we choose a foraminifera-specific expansion segment of the helix 37 (region I), as having the highest barcoding potential. This region was amplified by PCR targeting foraminifera and the 36 bp long fragment situated at the 5’ end was massively sequenced by Solexa analyzer. Some of the phylotypes resulting from these analyses have been identified to the species level confirming the efficiency of that fragment. It also appeared that the greatest part of obtained sequence data correspond to the undetermined monothalamous taxa. We argue that most of them could be unknown rather than unidentifiable and would reflect a bias in our database artificially enriched in large, well studied species.

Introduction

It became obvious during last decades and mainly through the example of prokaryotes (Wintzingerode et al., 1997) that the cultured organisms do not reflect accurately natural diversity. A new field of microbial molecular ecology has risen up to investigate environmental diversity. Environmental extractions have rapidly revealed numerous undescribed taxa, not only within Bacteria and Archea but also among small eukaryotes (Dawson and Pace, 2002; Edgcomb et al., 2002; Epstein and García, 2008; López-García and Moreira, 2008; Lopez-Garcia et al., 2001; Moon-van der Staay et al., 2001).

Although this kind of analyses resolved the problem of selective laboratory culturing bias, they remained hedged by different steps of the molecular process itself (mainly PCR amplification and cloning). Recently, new methods of massive sequencing reducing cloning limitations were applied to environmental samples focusing on microbial diversity (Huber et al., 2007; Miller et al., in Press; Sogin et al., 2006). The uncovered genetic diversity of

microbes was far beyond all previous predictions leading to the discovery that the majority of phylotypes are represented by only few sequences (rare biosphere). In order to determine the uttermost part of the environmental richness (including rare taxa), technologies shifted toward increasing number of short sequences rather than analysis of fewer sequences of longer size.

Therefore, to develop massive sequencing approach for environmental survey it is crucial to find a relevant and short genetic marker sufficiently universal and informative regarding the taxonomic group of interest. For foraminifera, ribosomal DNA appears as a good region to start this investigation.

The ribosomal RNA genes are widely used in molecular systematics of foraminifera.

They are commonly applied to resolve the phylogenetic relationships between and within major taxonomic groups (de Vargas et al. 1997, Pawlowski et al. 2002, 2003, reviewed in Pawlowski 2009), to place the new foraminiferal species (Pawlowski et al. 2002, Cedhagen et al. 2009) and to examine the intraspecific variations in particular morphospecies or genera, in relation to their geographic distribution and morphology (de Vargas et al. 1999, Tsuchiya et al. 2000, Holzmann and Pawlowski 2000, Hayward et al. 2004, Darling et al. 2004, Pawlowski et al. 2008). There are almost 3000 sequences of ribosomal RNA genes of foraminifera deposited in the Genbank (2917 in April 2009). Most of these sequences correspond to a 3’ fragment of the SSU rDNA, which is most commonly used in molecular phylogenetic studies of foraminifera. The other sequences correspond to the ITS region, which is used to examine the intraspecific variations (de Vargas et al. 2001, Pawlowski et al.

2007, Tsuchiya et al. 2009) and to a 5’ fragment of the LSU gene (covering the variable regions D1 and D2), which was specifically used to analyse the genetic variations in the genus Ammonia (Holzmann et al. 1996, Holzmann and Pawlowski 2000, Hayward et al. 2004).

Foraminiferal ribosomal genes are highly divergent compared to other eukaryotes.

They possess numerous substitutions in the most conserved regions of the SSU, resulting in a strong acceleration of their stem lineage (Pawlowski and Berney 2003). They are also much longer than the typical eukaryotic genes. The length of foraminiferal SSU rRNA genes ranges from 2300 to over 4000 bp, while in other eukaryotes the length of these genes averages 2000 bp. Most of the additional length is due to a series of expansion zones and insertions (Pawlowski 2000), which are especially prominent in the SSU rDNA but are also found in the LSU rDNA. Some of these variable regions are specific to foraminifera, the other are

expansion of typical eukaryotic variable domains. They are present in the RNA as suggested by reverse transcriptase (RT) sequencing of the rRNA (Pawlowski et al. 1994, 1996) and RT PCR experiments (Habura et al. 2004). The position of variable regions in the 3’ fragment of the SSU was established based on the predicted secondary structure in the case of rotaliid foraminifera (Ertan et al. 2004). Their variations in different taxonomic groups were analysed by Habura et al. (2004) and Grimm et al. (2007).

The rates of substitution in variable regions of foraminiferal rRNA genes are unusually high. They vary between the taxonomic groups (Pawlowski et al. 1997), with the most rapid evolutionary rates observed in some planktonic species (de Vargas and Pawlowski 1998).

Changes in variable regions of the SSU rDNA were used to distinguish geographically separated populations of subpolar planktonic morphospecies and to define their genotypes (Darling et al. 2000). Analysis of these regions also revealed an unexpectedly high diversity of monothalamous foraminifera (Pawlowski et al. 2002). The variable regions D1 of the LSU rDNA were used to revise the taxonomy of the genus Ammonia (Holzmann and Pawlowski 2000).

All these studies show that the variable regions of the SSU and LSU rDNA in foraminifera have potential to become good barcodes for species identification in this group.

However, in most of the cited studies, the analysed rDNA fragment contained several variable regions and the taxonomic resolution of each of them is unknown. Nothing is known about the minimum length of rDNA fragment necessary for species distinction in foraminifera and other groups of eukaryotes. Examining this issue is crucial for application of massive sequencing technologies for species identification and interpretation of the short sequences produced by these methods. Therefore, before starting our massive sequencing experiments, at first we have examined six variable regions of the SSU one by one and evaluated the level of their taxonomic resolution for selected species and genera.

Once we selected the most promising barcoding region, we applied it to examine the foraminiferal richness in four sediment samples from Arctic Ocean. The global environmental DNA was extracted and selectively amplified to target only foraminiferal species. An extremely short fragment of the PCR product (designated as a relevant barcode) was then massively sequenced with Solexa analyser to assess the foraminiferal richness in our samples.

Material and Methods

Identification of barcode for foraminiferal taxa

An alignment of the SSU rDNA 3’ fragment, including about 1000 sequences representing all major taxonomic groups of foraminifera and large databases for selected genera and species, was analysed. The alignment has been carefully checked using Seaview (Galtier et al. 1996), with special attention paid to the variable regions. The sequences of each region were cut off the global alignment and analysed separately. The foraminiferal specific regions were identified based on an alignment of 300 eukaryotes (Berney and Pawlowski 2006). For each region the secondary structure model was predicted using MFold program as implemented in http://mfold.bioinfo.rpi.edu/. The limits of each region were refined based on secondary structure models previously published for foraminifera (Ertan et al., 2004; Grimm et al., 2007; Habura et al., 2004) and other eukaryotes (Wyust et al. 2002). The divergence between sequences was calculated using the DNA distance matrix program implemented in BioEdit 7.0.5.2.

Sediment sampling and DNA extraction

Four surface sediment samples (SFA2-5), of which depth and coordinates are presented in Table 3.3.1 and Fig. 3.3.1, were collected during RV Polarstern cruise ARK XXII/2 (Arctic Ocean, 2007). For each sample, 5 mL of the first centimetre layer have been collected and immediately frozen. PowerMax^TM Soil kit (Mo Bio) was used to extract DNA from each 5 mL sample of sediment according to the protocol (except for step 4, where vortexing time has been extended to 40 minutes). Extraction products were then stored at -20°C for further analyses. Additionally, DNA of cultured foraminiferan Reticulomyxa filosa was extracted and processed as other environmental samples. Reticulomyxa filosa DNA was used as a reference (SFA6) to be compared with other samples and to improve our interpretation of the sequencing results.

Table 3.3.1. Depth, coordinates and sampling method (BC: box corer, MC: multicorer)

SFA 2 SFA 3 SFA 4 SFA 5

Depth 3538 m 3519 m 221 m 4443 m

83°42'58N 83°41'54N 80°59'40N 87°4'14N Coordinates 60°38'27E 60°40'55E 34°0'13E 104°39'57E

Collected with BC MC MC BC

Figure 3.3.1. Map of the sampling

DNA amplification and massive sequencing

In the first step, partial SSU rDNA (about 400 bp) was amplified by PCR (15 cycles, 50°C for annealing temperature) with a set of foraminiferal specific primers containing Solexa adaptators at each end. In the second step, PCR products were reamplified (for additional 10 cycles) and attached to the surface of the Solexa flow cell channels by adaptators. The last step consisted in solid-phase bridge amplification and sequencing by incorporation of labelled reversible terminators (for detailed method see Appendix F, Fig. F1).

Reads analysis and phylotypes identification

The millions of 36 bp sequences or “reads” obtained for each samples have been analysed by grouping the reads into phylotypes and then identifying the phylotypes.

Grouping the reads into phylotypes:

All datasets were screened to select only sequences meeting required quality. This has reduced the number of sequences for all samples but has provided datasets with a lower error rate. A certain minimum base average quality value was required over the whole read and a maximum of one base of lower quality was allowed. Reads with more than 30 identical bases were discarded as well as reads containing more than one “N”.

The datasets have been profiled for the copy number of each sequence leading to a list of unique sequences and a number of reads for each of them. Sequences displaying less than two reads were discarded before the grouping step as well as those with one “N” or with 30 or more identical bases. The profiled files containing the filtered unique sequences have been screened from the most abundant sequences to the least abundant ones. Sequences with 1 to 3 mismatches were clustered into groups represented by “reference sequences”. Sequences were considered one by one (from that with the higher copy number to that with the lower one) and compared with the reference sequence of each group. Each sequence was attributes to the group, of which reference sequence had the highest copy number and not more than 3 mismatches with the sequence considered. If no such group existed, a new group was created with the current sequence as reference for this group. If a sequence could be attributed to one group with 3 mismatches and to a second group of lower total reads with only one or two mismatches, it was nevertheless attributed to the second group.

When two sequences are compared during the grouping step, indels can artificially increase the number of mismatches, since all the bases can be shifted. In such a case, sequences are wrongly attributed to two distinct groups even if they are closely related. For that reason, after the grouping step, all the reference sequences were aligned with ClustalX program (Larkin et al. 2007) to detect misalignment. Phylotypes were finally formed, based on this alignment, by referring to sequence divergence values. The threshold value for phylotypes distinction was fixed at 0.0834 (i.e. three mismatches for 36 bp).

Identification of phylotypes:

Phylotypes were identified according to our foraminiferal database, which contains a total of 1248 partial SSU rDNA sequences and includes described and undescribed species as well as undetermined species from environmental samples or those squatting the test of other foraminifera. This database was used to blast the phylotypes of each sample, except SFA6 (R.

filosa). The relevance of the 10 best matches resulting from this local blast has been evaluated using the following criteria: a sequence matching with any of our 36 bp phylotypes should have a similar region longer than 14 bp and start at one of the 3 first bases of the phylotype.

In the case of conflict between several satisfying results at species level, the sequences were compared at generic level. In the case of a second conflict, the following groups have been considered: monothalamous, textulariids, rotaliids, lageniids and miliolids. In the case of a third conflict between the results, the phylotype was defined as “undetermined”. By this way, phylotypes have been assigned to the uttermost precise taxon.

Results and discussion

Identification of barcode for foraminiferal taxa

1) Characterization of variable regions

The 3’ fragment of the SSU rDNA analysed in the majority of foraminiferal studies comprises six variable regions (Fig. 3.3.2). Three of these regions are specific to foraminifera, while three others correspond to the typical eukaryotic variable segments. The nomenclature of these regions changes from author to author (Table 3.3.2).

Table 3.3.2. Names given to different regions in literature

Reference I II III IV V VI

Ertan et al. 2004 F1 F2 V5 V6 F3 V7

Habura et al. 2004 I II na na na na

Grimm et al. 2007 37/e1 41/e1 V7 45/e1 46/e1 Tp49

Schweizer et al. 2008 F4 F5 V7 V8 F6 V9

This study 37/f 41/f 43/e 45/e 45/f 49/e

Figure 3.3.2. Schematized secondary structure of the SSU rRNA gene showing the analysed fragment (in red) and the position of the six variable regions (modified after Schweizer et al. 2008)

Figure 3.3.3. The predicted secondary structure of the foraminifera-specific expansion zone 37/f of the SSU rRNA for representatives of the major groups of foraminifera: monothalamids (R.

filosa, M. hyalostriata), rotaliids (E. exigua), textulariids (Textularia sp.) and miliolids (Marginopora vertebralis).

Here, we called the variable regions according to the number of concerned helix and the letter “f” or “e” designating whether the region is found exclusively in foraminifera or it is present also in other eukaryotes, respectively. The position of each region, its length and primary and secondary structure are presented below.

Helix 37/f (region I): This expansion zone is situated in prolongation of the helix 37 (Fig. 3.3.2). It started as a large loop followed by a long helix, interrupted by a small loop in some species (Fig. 3.3.3). At the 5’ end, it is delimited by GG, which remain unchanged in almost all sequences, except the Spirillinids and Ammodiscus, which possess GA or GT, respectively. Moreover, the first nucleotides remain relatively unchanged across the taxonomic groups. At the 3’ end, the region ends usually by T(C)AAATA, but it is much more variable and may also ends by TAAAATA (Micrometula), TAAATTA (Cylindrogullmia), CAAAGA (Capsammina). The region can be extremely short (13 nt) in some freshwater environmental sequences (EnvLeman27) but exceptionally can reach up to 123 nt, in a monothalamiid “Notrhabdammina”.

Helix 41/f (region II): This region is an expansion of the loop that connects helixes 39, 40 and 41. According to the predicted secondary structure, it is composed of several more or less long helixes separated by small loops. This region was extensively discussed by Ertan et al. (2004) and Habura et al. (2004), who presented secondary structure models for several species. However, the borders of this region are not well defined. Based on rotaliid sequence database, Ertan et al. (2004) proposed that the region starts with TCTATA at 5’ end and terminates with GAAAGC at 3’ end. Our alignment shows certain variability in the preceding bases (TTT which change to TCT in Crithionina and TAT in Carterina) and therefore we include these bases into this region. At 3’ end the 41/f terminates with GAAA in most of forams, although this sequence can be strongly modified in some species (Hipocrepinella hirudinea). The 41/f is much longer than the 37/f, with its length varying from 87 nt in Vellaria to 360 nt in Crithionina granum, with average length of 150-200 nt in most of the species.

Helix 43/e (V7) (region III): This region corresponds to the helix 43 and its extension - helices 43/e1-43/e4 in the secondary model of Wuyst et al. (2002). At the 5’ end, the region is delimited by a relatively stable motif CTTGTT, which change in miliolids into TTTATT, and

is followed usually by a motif GCC (GCT in miliolids), which form the beginning of the helix 43. At the 3’ end, the variable region include the motif GGC at the end of helix 43, followed by few variable bases, which form a loop preceding the helix 43 and terminates with conserved motif AACTAGAG. In most of foraminifera the helix 43 is followed by a loop and a short helix. The region is relatively short in most of species, counting only 26 nt in Psammosphaera-like 3918 and 28 nt in Cribrothalammina alba, but can exceptionally reach up to 279 nt in Notodendrodes hyalinosphaira and 278 nt in Astrammina rara.

Helix 45/e (V8) (region IV): This region is situated between helices 45 and 46. It corresponds to the region 45/e1 in Wuyst et al. (2002) and is considered as a part of eukaryotic variable region V8 (Grimm et al. 2007). In many eukaryotes, the region comprises two loops separated by a short stem GAG/CTT. In foraminifera, the expansion region 45/e1 starts after GAG, with a large loop followed by more or less long stem. The region begins with a motif CATCTC and ends with a motif GGTAAAG. Its length varies from 30 nt in Androsina to 215 nt in Notodendrodes (1931) and 202 nt in Hemisphaerammina bradyi.

Helix 45/f (region V): This region is situated just after the helix 45’ and is also considered as a part of the region V8 (Grimm et al. 2007). In most of eukaryotes the helix 45 is followed by a large loop starting with a characteristic more or less modified motif GTGATGGGG. In foraminifera, the helix 45 differs from other eukaryotes and is followed by a relatively short expansion zone specific to this group. At the 5’end, the flanking conserved sequence ATGATT corresponds to the complementary branch of helix 45 (45’). The region ends with the sequence GTCAATT, followed by a conserved motif CATGGTGGGG. The length of 45/f varies between 19 nt in Cyclorbiculina to 266 nt in an undetermined allogromid 1847 and 248 nt in Leptammina spp.

Helix 49 (V9) (region VI): This region corresponds to the helix 49 present in all eukaryotes and is designated as variable region V9 (Neefs et al. 1990). This is a very long stem situated at the 3’ end of the SSU rRNA, followed by a short helix 50 (Wyust et al. 2002).

In the majority of the foraminifera, the helix 49 starts with the sequence CTCTTA and terminates with complementary sequence AAAGAG. The first 30 bases of the helix are conserved in most of foraminiferal species. For this reason and because some of our

sequences lack the terminal part of the SSU, we analysed here only the most variable part of the helix 49, starting at TTTGAG and ending at CTTAAA.

Table 3.3.3. Sequences of the six variable regions and the flanking conserved zones.

# helix Conserved region Variable region Conserved region

I 37/f 5’ GGATTGACA GGC ………...….TAAATA TGCTAGTCC 3’

II 41/f 5’ TTAATTGCG TTT ………….…….GAAA GCAACGAA 3’

III 43/e 5’ CTTGTT GCC ..……...GGCTNNN AACTAGAGGG 3’

IV 45/e1 5’ CAGTGAG CATCTC ..……GGTAAAG CCTGCTTCGAA 3’

V 45’/f 5’ TAATGATT TCCT...………...AATT CAAGGTGG 3’

VI 49 5’ GTGAG TTTGAG ……….CTTAAA CGAACAG 3’

2) Taxonomic resolution of variable regions

In order to evaluate the possibility to distinguish foraminiferal species using different variable regions, we analysed for each region the divergence between and within sister morphospecies, the intraspecific polymorphism, and the ability for phylotypes recognition.

Sister species divergence:

We calculated the sequence divergence between sister morphospecies and within those complex morphospecies that show evidence of cryptic speciation usually related to geographic distribution (Table 3.3.4). We selected seven pairs of morphospecies among monothalamids and rotaliids, which belong to the same genus and are morphologically well defined. For comparison of cryptic species, we selected seven monothalamid morphospecies, for which representatives of different populations usually from geographically distant regions were sequenced. The most closely related populations were analysed.

Our comparison shows that the sequence divergence values strongly vary between the regions and the examined pairs of species. In general, the species or populations can be

Dans le document New insights into the diversity of deep-sea benthic foraminifera (Page 114-0)