• Aucun résultat trouvé

Mock community replicates and mistagging-based filtering

6.1 Project description

6.5.4 Mock community replicates and mistagging-based filtering

We analyse the distribution of mistagging events in two libraries containing PCR replicates of mock community samples. In each library, the reads are evenly distributed among

sam-ples notwithstanding 25% of non-critical mistags (Table 6.7: Supplementary information).

In every PCR replicate of a sample are present all the ISUs perfectly matching the reference sequences of the clones that compose this sample (i.e. reference ISUs), along with other ISUs that have slight errors but could still be assigned to these clone sequences (i.e. error ISUs) as well as critical mistag ISUs (4.7% on average). Most critical mistag reads found in a given sample are assigned to clones ascribed to another sample of the same library (88.5% on av-erage) (Table 6.8: Supplementary information), but never to a clone incorporated as a rare

5 3

17 17

3 8

3

19 9

1

18 8

2 16

3 1

17 7

1

18 7

2

16 5

2

16 6

1 17

a b c d e f g h i j

SFA−123SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124 SFA−123 SFA−124

15−R − F1−B 15−T − F1−B 15−R − F1−C 15−S − F1−C 15−S − F1−D 15−T − F1−D 15−V − F1−D 15−U − F1−E 15−V − F1−E 15−U − F1−F

foram14 foram63 foram6 foram35 foram20 foram24 foram36 foram73 foram72 foram1

foram86 foram50 foram83 foram76 foram2 others

Figure 6.6: Comparison of the 10 PCR product samples sequenced on two separate runs.

Each of the 10 PCR products re-sequenced in either a LSD (SFA-123) or a Saturated Design (SFA-124) correspond to one sample and a clone (a: F1-B +15-R, b: F1-B + 15-T, c: F1-C + 15-R, d: F1-C + 15-S, e: F1-D + 15-S, f: F1-D + 15-T, g: F1-D + 15-V, h: F1-E + 15-U, i: F1-E + 15-V, j: F1-F + 15-U). The top row displays venn-euler diagrams of the assignments recovered in each sample (purple circle: SFA-123, green circle: SFA-124). The compositions of the re-sequenced sample in terms of relative read abundance are detailed in the vertical bars. For each sample, the correct clone used as template is not included in the bars. The read abundance of these clones are displayed in the pie charts (upper: SFA-123, below: SFA-124) relatively to all the other reads (black). The correct clones are boxed in the legend.

template. This observation supporting that mistagging events happen at the library level is clearly illustrated in SFA-125 where the cross-contaminated samples contain no more than five clones (Fig. 6.7). This is further confirmed by the global distribution of clones across the libraries of each run (Fig. 6.12).

●● foram62 foram7

foram75

foram27 foram7 foram15 foram15foram2

foram25foram3 foram19foram2

foram23 foram24

foram26 foram27

foram34 foram35foram4

foram43foram5 foram50

foram63foram7 foram76

foram83 foram86

SFA126 / random

log10(number of reads)

Figure 6.7: Box plots of the number of reads per ISU assigned to a clone with 1, 2, 3 or more than 4 differences. The results are shown separately for each library and mock community.

At the position of each clone name corresponds the group of ISU with the lowest number of difference(s) to this clone and the number of reads in the ISU perfectly matching this clone (blue dot). All the clones found in each mock community are displayed, including the expected clones (black) and the clones resulting from a critical mistagging event (red). The numbers of reads are displayed on a log10 scale.

Mistagging events are importing ISUs as cohorts into samples labelled either with ex-pected (critical mistags) or unexex-pected combinations (non-critical mistags), but with the relative read abundances of the sample from which they originate. In the case of critical mistags, we found that the read abundance ratios between a reference ISU and each of its error ISUs are conserved across samples, i.e. irrespectively of whether the ISU correspond to an expected or a mistag clone. This is exemplified across the two samples of SFA-126, where one error ISU is more abundant than the reference ISU of three clones (foram18, 70 and 78) (Fig. 6.7). In the case of non-critical mistags, we found that the most reads are labelled with unexpected combinations sharing at least one of the initial tagged primers (Fig. 6.15: Sup-plementary information). For each ISU, the correctly tagged copies become outliers standing above the ‘mistagging noise’ represented by the read abundance distribution of non-critical mistags. The rationale of our ISU-centred filtering approach is based on this observation and we evaluated its effect on the ability to recover the diversity of mock community samples using PCR replicates intersection sets.

Taking the ISUs at the intersection of PCR replicates discards numerous, but rare er-ror ISUs, while the mistagging-based filter removes few, but abundant critical-mistag ISUs that can be sequenced in all replicates. Neither of the approaches discards any expected clone, even among as few as 58 ISUs found in five replicates (Fig. 6.16: Supplementary information). In intersection sets of increasingly numerous PCR replicates, the ISU diver-sity decreases while the numbers of reads per ISU increase for both correctly labelled and mistagged ISU (Fig. 6.17: Supplementary information). Only two replicates intersection sets are sufficient to remove all unknown sequences and on average over all mock communities, 87.4% of error ISUs (35.5% of reads). This proportion climbs to 93.3% (41.9% of reads) and to up to 96.8% (47.2% of reads) when intersection sets of three and five replicates are considered, respectively (Table 6.9: Supplementary information).

Our mistagging-based filter removes from the replicates of each sample the critical-mistag ISUs more efficiently than the ISUs assigned to the correct clones (Fig. 6.8a). However, it only discards 69.7% of the artefactual ISUs (Table 6.10: Supplementary information). Up to 42% of the correct ISUs sequenced in all five replicates remain in the five replicates after filtering, while the majority of the critical-mistag ISUs present in three, four or even in all five replicates are discarded from most of them. Interestingly, the correct ISUs occurring in only two replicates tend to be preserved from filtering in both of them. In general, few correct ISUs are filtered from all the replicates in which they occur, including for the ISUs exclusive to one PCR replicate. Surprisingly, the critical-mistag ISUs sequenced in only one

replicate are not filtered, because they are too rare to form cohorts of non-critical mistag for the filter to operate.

Based on filtered data, the exact composition of four samples can be recovered from the ISUs present in at least three replicates (Fig. 6.8b). In fact, after the removal of critical-mistag ISUs, the exact composition can be retrieved in 55.7% of all possible replicate intersections, including in 41.6% of two-replicates intersections and in 66.7% of all the four-replicates. Fi-nally, the observed relative abundances of the ISUs assigned to the clones constituting a given sample reflect the theoretical relative abundances and even more faithfully when only the ISUs shared across replicates are considered (Fig. 6.9).

6.6 Discussion

We showed that multiplexing of amplicon libraries for studying the diversity in metage-nomic samples is prone to intractable cross-contamination events due to the mistagging phe-nomenon. Single-tagging as well as saturated double-tagging strategies are flawed by numer-ous and undetectable critical mistags. We showed that non-combinatorial designs minimize the occurrence of critical mistags by eliciting the formation of unexpected combinations, but offer a poor multiplexing capacity. The LSD represents the optimal trade-off between the error minimisation ability of non-combinatorial designs (7) and the multiplexing capacity of SAD (Galan et al., 2010). LSD enforces non-critical mistags in designs defined according to a number of possible combinations and a number of samples, as shown with the comparison to the SAD at identical saturation levels. For example, by relying on a LSD involving 30 forward and 30 reverse primers used 10 times each, one could multiplex 300 samples (or 100 samples in triplicates) at a saturation level of 33% only. This provides a perfect framework for our filtering method based on non-critical mistags information, but also a considerable gain of time and money. Moreover, limiting the number of deployed tagged primers reduces tag misidentification problems through the design of highly variable tag sequences (Quail et al., 2014), as well as the risk of cross-contamination during handling.

The magnitude of the mistagging phenomenon by far exceeds the expectations of previ-ous studies relying on tagged primer constructs (Quail et al., 2014; Kircher et al., 2012). It has been proposed that one of the major sources of mistagging is primer cross-contamination (Degnan and Ochman, 2012; Kircher et al., 2012). However, we obtained no positives in 60-cycles PCR tests involving only one of two tagged primers in the mix, and in an

addi-12 34 5 0

hhhhl

hllllhhmllHhmlRandomEven

3 38 filtering and intersection sets of PCR replicates for mock community samples (SFA-125 and SFA-126). For each mock community (in lines) is indi-cated on the left side (a) the effect of the mistagging-based filter using barplots and on the right side (b) the clone di-versity retrieved in each repli-cate intersection set using five-way Venn diagrams. (a) The barplots indicate the fraction of the ISUs present in a given number of replicates (horizon-tal bars) that remain in 1 (blue), 2 (orange), 3 (green), 4 (red) or 5 (violet) replicates af-ter filaf-tering. The correct ISUs assigned to expected clones (right barplots) are separated from critical mistag ISUs (left barplots). (b) The differ-ences to the expected num-ber of clones are represented in each replicate intersection set by five-way Venn diagrams both before (raw) and after mistagging-based filtering (fil-tered). The numbers associ-ated with each Venn diagram indicate the minimum (green number) and maximum (red number) difference in the num-ber of clones recovered in the intersections, including unas-signed and artefactual ISUs.

The colour variation within each diagram is defined by these values reported on the diversity variation legend (up-per right).

−0.20 0 0.2 0.4 0.6 0.8 1 1.2

Relative template abundance Relative template abundance

Relative reads abundanceRelative reads abundance

0.249

0.001 0.002 0.992

0.475

0.0010.047 0.002 0.032 0.161 0.805

Figure 6.9: Abundances comparisons between mock community sequence templates and the resulting ISUs. Abundances are displayed for each clone, but separately for each of the 4 mock communities of SFA-125, including (a) hhhhl containing 4 clones at high abundance and 1 clone at low abundance, (b)hhmll containing 2 clones at high abundance, 1 clone at medium abundance and 2 clones at low abundance, (c) hllll containing 1 clone at high abundance and 4 clones at low abundance and (d) Hhml containing 1 clone at very high abundance, 1 clone at high abundance, 1 clone at medium abundance and 1 clone at low abundance (Table 6.2: Supplementary information). The clones (blue dots) and ISUs (red crosses) are organized in five columns according to the number of replicates intersection where it is found simultaneously. In each case, the exact value of the template abundance is located at the “3 replicates” position.

tional library containing more than 300 000 reads, we found only 0.096% of them labelled with at least one of nine tagged primers left untouched out of 40 ordered primers (data not shown). Therefore, the impact of primer cross-contaminations seems negligible and only vis-ible because of the sequencing depth of HTS. To some extent, especially purified primers can alleviate this source of mistagging, but at high expense for numerous samples, and even

without removing them entirely (Quail et al., 2014).

Our study shows clearly that the mistagging events mainly occur during the PCR per-formed on the pool of labelled amplicons. This is demonstrated by the fact that the clones contaminating a sample originate from the other samples multiplexed within the same library.

PCR-free library-preparation methods are promising, but necessitate high amounts of input DNA. This could be achieved by multiplexing more samples or non-homologous material such as a species genome or transcriptome (Kozich et al., 2013). It has been demonstrated that the frequency of chimera formation is inversely related to the complexity of the sequence sample subject to PCR (Fonseca et al., 2012). Therefore, multiplexing non-homologous PCR products prior to library-preparation PCR enhances the sequence diversity and reduces the impact of chimera, which is probably responsible for the recombination of fragments ends where the tags are located. Hence, with more complex environmental DNA samples it could be predicted that chimera-driven mistags might be less prominent. However, their appear-ances might range in the same amounts as chimeras usually witnessed in environmental samples. This may explain what we observed in the LSD library, where both foraminiferal and eukaryotic PCR products were multiplexed. Alternatively, pooling high quantities of PCR products can increase the amount of input DNA. However, this second solution is risky because PCR products are stable laboratory contaminants that can be readily discovered in HTS conditions (Orsi et al., 2013). Although appealing indexing library-preparation methods are flourishing, it is wise to label PCR products during the first amplification in order to be able to trace potential contaminants.

This first PCR enriches a specific diversity from complex samples, but also creates biases responsible for the inflation of diversity estimates (Engelbrektson et al., 2010; Logares et al., 2014) and the introduction of artefactual variability across samples (Schmidt et al., 2013). To correct such biases, internal controls such as co-sequenced mock community samples can be employed (Kozich et al., 2013), but their suitability depends on their complexity (Lee et al., 2012). Instead, our filtering method does not require the addition of a supplementary sample and rather relies directly on the properties of the data itself. Moreover, it is particularly suited to HTS, as its statistical power increases with the amount of sequence data. Indeed, a higher amount of non-critical mistags provides a finer resolution in the detection and removal of critical mistags.

Theoretically, each species genome template should produce exactly one ISU, including the polymorphic copies of a gene. Our filter operates at such a resolution because it is

ISU-centred, i.e. it computes the read abundance distribution across samples of each ISU independently. Hence, it does not rely on a unique abundance threshold applied over all samples but computes a different threshold for each ISU in each sample, accounting for dif-ferential sample sequencing depths (Adams et al., 2013a; Harris et al., 2010). Moreover, our filter requires no tuning of subjective parameters resulting in different sets of arbitrary thresholds (Bokulich et al., 2013; Caporaso et al., 2011). Being completely parameter-free, our approach has the utmost advantage of allowing the establishment of synoptic models towards more comparable diversity analyses.

The robustness of our approach is greatly reinforced by the incorporation of PCR repli-cates. As shown by our study, the replicates labelled using non-combinatorial tag pairs are less prone to cross-contamination by identical mistags or to the accumulation of random er-rors. Indeed, the probability of such co-occurring events corresponds to the product of the probabilities associated with each replicate. The importance of technical replicates has been emphasized for the filtering of erroneous sequences (Robasky et al., 2014; Preheim et al., 2013). One approach is to focus on the union of replicates, assuming that the full sam-ple comsam-plexity is missed by individual PCRs (Gonzalez et al., 2012) and because arbitrary abundance-based filtering can lead to removing many rare genuine species (Egge et al., 2013;

Zhan et al., 2013). Another approach is to analyse the diversity at the intersection of repli-cates, assuming that genuine species are detected in every PCR. Even with as few as 17% of diversity shared among replicates (Zhou et al., 2011), this conservative assumption has been corroborated previously (Morgan et al., 2013) and by our own results, although we cannot exclude false positive artefacts due to the size of our mock community samples. The incorpo-ration of PCR replicates in a multiplexing design is not trivial under this assumption since (i) more replicates may result in more mistags and (ii) the same chimeras are likely to happen across replicates since the initial sequence diversity is similar in the replicates of a sample (Fonseca et al., 2012; Turnbaugh et al., 2010). A trade-off between the number of samples and the amount of replication must be considered to ensure that rare species sequences re-main unfiltered from replicates. Finally, it should be noted that the same amount of caution towards mistagging and applying alleviating measures should also be taken into account for non-environmental studies pooling together the tagged specimens (Shokralla et al., 2014).

In conclusion, we propose a few recommendations to increase the accuracy of HTS data sets based on multiplexed amplicon libraries:

(i) Proscribe single-tagging and saturated double-tagging designs.

(ii) Choose tagged primer combinations according to LSD to maximize the mistagging information.

(iii) Minimize the sample saturation to reduce the proportion of critical mistags.

(iv) Incorporate at least two PCR replicates to remove error ISUs.

(v) Label PCR replicates with tagged primers used only once to avoid inter-replicate mistags.

(vi) Useparameter-free,data-driven and ISU-centred filtering approach.

(vii) Avoid long primer constructs for multi-species samples.

Some of these recommendations can be easily implemented. We provide a LSD generator to assist the design of double-tagging strategies and a filter accounting for mistagging pat-terns and PCR replicates. Our approach allows accurate HTS data denoising and preserves both the relative abundance and the occurrence of rare, genuine sequences templates. We are confident that associating robust experimental planning with powerful sequence-data filtering is thecondicio sine qua non of comprehensive surveys requiring the deployment of numerous samples and replicates.

6.7 Supplementary Information

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

unclassifi ed rhodophytes prymnesiophytes parabasalids opisthokonts heterokonts green plants cryptophytes centrohelids alveolates Rhizaria Lobosa Katablepharidophyta Fornicata FORAM Euglenozoans ENVIR BACT

*

0 500 1000 1500 2000 2500

FORAM_xenophyophores FORAM_unclassifi ed_Foraminiferida FORAM_environmental_samples FORAM_Textulariida FORAM_Spirillinida FORAM_Saccaminidae FORAM_Rzehakinidae FORAM_Rotaliina FORAM_Reticulomyxidae FORAM_Miliolida FORAM_Lituolida FORAM_Lagenina FORAM_Ammodiscus FORAM_Allogromida

*

Candidate primer sequences (5' - 3')

HPS number

B A

HPS number

Figure 6.10: Taxonomic specificity of the reverse foraminiferal primer s15. For each 20-nucleotide long candidate sequence, we show the results of extensive BLASTn searches against the NBCI’s nt database (see online methods). The taxonomy retrieved for each HSP is displayed both at the phylum level (A) and at the foraminifera level (B). The s15 primer covering most of the foraminiferal diversity while avoiding most of the other phyla is indicated by a star.

Samples SFA-125 AdapterLibrary barcode

Mock communityTagPrimer Replicates

SFA-120

A ...... Z A ...... 5Foram Euk

0 0 ForamEuk SFA-121SFA-122SFA-123SFA-124

Foram SFA-126

A B C ...

A B C ...

A B

A BTagging PCR

+ * ......

Single Double

Double

Experimental design Sample Pooling Library PCR MiSeq sequencingLibrary Pooling Run 1Run 2

Foram

Euk ForamForam

+++++ *****

Single-sequences Mock communities... A B C ...

Clones library AA00 Z

5

A Z

AZA5 A 5 Foram Euk Figure6.11:Experimentaldesignsandmolecularworkflow.WeusealibraryofSanger-sequencedclonestoprovideeitherSingle-sequencesamples orMockcommunitiessamplescorrespondingtosingle-sequencecloneampliconspooledincontrolledratios.ThesesamplesarelabelledbyPCR amplificationeitherusingoneoutofthetwoprimerstagged(Single)orthetwoprimerstagged(Double).Wedeployedthesetaggedprimers accordingtoeachExperimentaldesign,representedbythevectorsandmatrices(rows:forwardprimers,columns:reverseprimers).Thesamples labelledbythedeployedcombinationsoftaggedprimersareindicatedbothforsingle-sequencesamples(coloredblocks)andmockcommunities (blacksymbols).AfterthetaggingPCR,thelabelledsamplesarepooledinequimolarratios(Samplepooling)andaTruSeqNanosequencing library(fromSFA-120toSFA-126)ispreparedforeachpool(LibraryPCR).Theresultinglibrariesarethendistributedintwomixesasindicated (Librarypooling)andsequenced(MiSeqsequencing).

Table 6.1: Tag-to-sample compositions of the 7 multiplex-ing libraries distributed in two MiSeq runs. Each line cor-responds to a sample. The columns describe the run (1 or 2), the library (SFA-120 to -126), the tagged primers usage (single: one primer, double: both primers), the type of pro-tocol (forward: forward primer, reverse: reverse primer, non-combinatoric: double-tagging, SAD: Saturation Design, LSD:

Latin Square Design, all types starting with mock correspond to the replicates protocols), the forward and reverse primer combinations, and the assigned tracer clones.

Run Library Tagging Type Forward primer Reverse primer Clone

1 SFA-120 single forward V9F-A sBnew-0 euk24

V9F-B sBnew-0 euk18

V9F-C sBnew-0 euk9

V9F-D sBnew-0 euk1

V9F-E sBnew-0 euk1

V9F-F sBnew-0 euk1

V9F-G sBnew-0 euk7

V9F-H sBnew-0 euk8

V9F-I sBnew-0 euk10

V9F-J sBnew-0 euk9

V9F-1 sBnew-0 euk12

V9F-2 sBnew-0 euk11

V9F-3 sBnew-0 euk16

V9F-4 sBnew-0 euk22

V9F-5 sBnew-0 euk19

s14F1-A s15-0 foram12

s14F1-B s15-0 foram8

s14F1-C s15-0 foram29

s14F1-D s15-0 foram4

s14F1-E s15-0 foram19

s14F1-F s15-0 foram30

s14F1-G s15-0 foram3

s14F1-H s15-0 foram15

s14F1-I s15-0 foram17

s14F1-J s15-0 foram9

s14F1-K s15-0 foram5

s14F1-L s15-0 foram22

s14F1-M s15-0 foram34

s14F1-N s15-0 foram10

s14F1-O s15-0 foram43

s14F1-P s15-0 foram16

s14F1-Q s15-0 foram11

s14F1-R s15-0 foram21

s14F1-S s15-0 foram39

s14F1-T s15-0 foram40

s14F1-U s15-0 foram41

s14F1-V s15-0 foram42

s14F1-W s15-0 foram87

s14F1-X s15-0 foram47

s14F1-Y s15-0 foram49

s14F1-Z s15-0 foram84

2 SFA-121 single reverse V9F-0 sBnew-A euk24

V9F-0 sBnew-B euk18

V9F-0 sBnew-C euk9

V9F-0 sBnew-D euk1

V9F-0 sBnew-E euk1

V9F-0 sBnew-F euk1

V9F-0 sBnew-G euk7

V9F-0 sBnew-H euk8

V9F-0 sBnew-I euk10

V9F-0 sBnew-J euk9

V9F-0 sBnew-1 euk12

V9F-0 sBnew-2 euk11

V9F-0 sBnew-3 euk16

V9F-0 sBnew-4 euk22

V9F-0 sBnew-5 euk19

s14F1-0 s15-A foram12

s14F1-0 s15-B foram8

s14F1-0 s15-C foram29

s14F1-0 s15-D foram4

s14F1-0 s15-E foram19

s14F1-0 s15-F foram30

s14F1-0 s15-G foram3

s14F1-0 s15-H foram15

s14F1-0 s15-I foram17

s14F1-0 s15-J foram9

s14F1-0 s15-K foram5

s14F1-0 s15-L foram22

s14F1-0 s15-M foram34

s14F1-0 s15-N foram10

s14F1-0 s15-O foram43

s14F1-0 s15-P foram16

s14F1-0 s15-Q foram11

s14F1-0 s15-R foram21

s14F1-0 s15-S foram39

s14F1-0 s15-T foram40

s14F1-0 s15-U foram41

s14F1-0 s15-V foram42

s14F1-0 s15-W foram87

s14F1-0 s15-X foram47

s14F1-0 s15-Y foram49

s14F1-0 s15-Z foram84 1 SFA-122 double non-combinatoric V9F-A sBnew-A euk2

V9F-B sBnew-B euk3

V9F-C sBnew-C euk6

V9F-D sBnew-D euk13

V9F-E sBnew-E euk14

V9F-F sBnew-F euk17

V9F-G sBnew-G euk20

V9F-H sBnew-H euk21

V9F-I sBnew-I euk23

V9F-J sBnew-J euk25

V9F-1 sBnew-1 euk26

V9F-2 sBnew-2 euk27

V9F-3 sBnew-3 euk28

V9F-4 sBnew-4 euk29

V9F-5 sBnew-5 euk8

s14F1-A s15-A foram44

s14F1-B s15-B foram60

s14F1-C s15-C foram51

s14F1-D s15-D foram67

s14F1-E s15-E foram58

s14F1-F s15-F foram56

s14F1-G s15-G foram71

s14F1-H s15-H foram48

s14F1-I s15-I foram61

s14F1-J s15-J foram75

s14F1-K s15-K foram38

s14F1-L s15-L foram53

s14F1-M s15-M foram25

s14F1-N s15-N foram68

s14F1-O s15-O foram74

s14F1-O s15-O foram74