• Aucun résultat trouvé

VTAM: A robust pipeline for validating metabarcoding data using internal controls

N/A
N/A
Protected

Academic year: 2021

Partager "VTAM: A robust pipeline for validating metabarcoding data using internal controls"

Copied!
15
0
0

Texte intégral

(1)

HAL Id: hal-03144831

https://hal-amu.archives-ouvertes.fr/hal-03144831

Preprint submitted on 17 Feb 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative CommonsAttribution - NonCommercial - NoDerivatives| 4.0 International License

Aitor Gonzalez, Vincent Dubut, Emmanuel Corse, Reda Mekdad, Thomas Dechatre, Emese Meglécz

To cite this version:

Aitor Gonzalez, Vincent Dubut, Emmanuel Corse, Reda Mekdad, Thomas Dechatre, et al.. VTAM:

A robust pipeline for validating metabarcoding data using internal controls. 2021. �hal-03144831�

(2)

1

VTAM: A robust pipeline for validating

1

metabarcoding data using internal controls

2

Ai t or Gonzál ez1, Vi nc ent Dubut2, Em m anuel Cor se3 , 4, Reda Mekda d1 , 2, Thom as Dechat r e1 , 2 and Em ese Megl écz2

3

1 Ai x Mar sei l l e Uni v , IN SE RM, T AG C, Tur i ng Cent er f or Li vi ng S yst em s, 13288 4

Mar sei l l e, Fr ance 5

2 Ai x Mar sei l l e Uni v , Avi gnon Uni ver si t é, C NRS , IR D, IMB E, Mar sei l l e, Fr ance 6

3 Cent r e Uni ver si t ai r e de Mayot t e, Rout e N a t i onal e 3, B P 53, 9766 0 Dem beni , 7

Mayot t e, 8

Fr ance 9

4 MAR BE C, CN RS, If r em er , IR D, Uni ver si t y of Mont pel l i er , Mon t pel l i er , Fr ance 10

11

Cor r espondi ng aut hor : Ai t or Go nzál ez ( ai t or .gonzal ez@uni v -am u.f r ) and Em ese 12

Megl écz ( em ese.m egl ecz@i m be.f r ) 13

Runni ng t i t l e: V TAM m et abar codi ng pi pel i ne 14

15 16

(3)

2

Abstract

17

1. Met abar codi ng st udi es shoul d be car ef ul l y d esi gned t o m i ni m i ze f al se 18

posi t i ves and f al se ne gat i ve occur r ences . T h e use of i nt er nal cont r ol s, 19

r epl i cat es, and sever al over l appi ng m ar ker s i s expect ed t o i m pr ove t he 20

bi oi nf or m at i c s dat a anal ysi s . 21

2. VTAM i s a t oo l t o per f or m al l st eps of dat a cur at i on f r om r aw f ast q dat a t o 22

t axonom i cal l y assi gned AS V ( Am pl i con Seq uence Var i ant or si m pl y var i ant ) 23

t abl e. It ad dr esses al l known t ech ni cal er r or t ypes and i ncl udes ot h er f eat ur es 24

r ar el y pr esent i n exi st i ng pi pel i nes f or val i da t i ng m et abar codi ng da t a : 25

Fi l t er i ng par am et er s ar e obt ai ned f r om i nt er nal cont r ol sam pl es ; cr oss - 26

sam pl e cont am i nat i on and t ag -j um p ar e cont r ol l ed ; t echni cal r epl i cat es ar e 27

used t o ensur e r epeat a bi l i t y ; i t handl es dat a obt ai ned f r om sever al 28

over l appi ng m ar ker s . 29

3. Two dat aset s wer e ana l ysed by VTAM and t h e r esul t s wer e com par ed t o 30

t hose obt ai ned wi t h a pi pel i ne based on D A DA2. The f al se posi t i ve 31

occur r ences i n sam pl e s wer e consi der abl y hi gher when cur at ed by DA DA2 , 32

whi ch i s l i kel y due t o t he l ack of cont r ol f or t ag -j um p and cr oss -sa m pl e 33

cont am i nat i on.

34

4. VTAM i s a r obust t ool t o val i dat e m et abar codi ng dat a and i m pr ove 35

t r aceabi l i t y, r epr oduci bi l i t y, and com par abi l i t y bet ween r uns and d at aset s.

36 37

Keywor ds: m et abar codi ng, m ock sam pl e, neg at i ve cont r ol , r epl i cat es, t axonom i c 38

assi gnat i on , f al se posi t i ves, f al se negat i ves 39

40

1 Introduction

41

Met abar codi ng has be com e a power f ul ap pr o ach t o st udy bi o di ver si t y f r om 42

envi r onm ent al sam pl es ( i ncl udi ng gut cont en t o r f aecal sam pl es ) . Met abar codi ng , 43

however , i s pr one t o s om e pi t f al l s , and cons equent l y, ever y m et ab ar codi ng st udy 44

shoul d be desi g ned i n a f r om -bencht op -t o -de skt op way ( f r om sam pl i ng t o dat a 45

anal ys i s) t o m i ni m i ze t he bi as of each st ep o n t he out com e ( Al ber d i , Ai zpur ua, 46

Gi l ber t , & Bohm ann, 2 018; Cr i st escu & He be r t , 2018; Zi nger et al ., 2019) . Sever al 47

paper s have cal l ed f or good pr act i ce i n st udy desi gn, dat a pr oduct i on and anal yses 48

t o ensur e r epeat abi l i t y and com par abi l i t y bet ween st udi es . Not abl y , t he i m por t ance 49

of m ock com m uni t y sam pl es , negat i ve cont r o l s, and r epl i cat es i s f r equent l y 50

hi ghl i ght ed ( Al ber di e t al ., 2018; Bakker , 20 18; Cr i st escu & Heber t , 2018;

51

(4)

3 O’Rourke, Bokulich, Jusino, MacManes, & Foster, 2020) . However, their use in 52

bi oi nf or m at i cs pi pel i nes i s of t en l i m i t ed t o t he ver i f i cat i on of expe ct at i on s.

53

In t hi s st ud y, we pr ese nt t he bi oi nf or m at i cs pi pel i ne, VT AM ( Val i dat i on and 54

Taxonom i c Assi gnat i o n of Met abar codi ng da t a) t hat ef f ect i vel y i nt egr at es negat i ve 55

cont r ol s, m ock com m uni t i es and t echni cal r epl i cat es t o cont r ol exp er i m ent al 56

f l uct uat i ons ( e.g. sequ enci ng dept h, PC R st o chast i c i t y) and val i dat e m et abar codi ng 57

dat a . 58

A r ecent st udy on t he ef f ect of di f f er ent st eps of dat a cur at i on on s pat i al 59

par t i t i oni ng of bi odi ve r si t y l i st ed t he f ol l owi ng pot ent i al pr obl em s: Sequenci ng and 60

PCR er r or s, pr esence of hi ghl y spur i ous se q uences , chi m er as, i nt e r nal or ext er nal 61

cont am i nat i on and dys f unct i onal PC Rs ( Cal d er ón‐ Sanou, Mü nkem ü l l er , Boyer , 62

Zi nger , & Thui l l er , 2 0 20) . They sho wed t hat exhaust i ve cur at i on a nd ensur i ng 63

r epeat abi l i t y by t echni cal r epl i cat es ar e necessar y , especi al l y f or b i odi ver si t y 64

m easur em ent s . Ideal l y , a m et abar codi ng wor kf l ow shoul d addr ess al l of t hese 65

t echni cal er r or s. Exi st i ng t ool s , ho wever , ar e ei t her hi ghl y f l exi bl e but t oo com pl ex 66

or t hey do n ot i ncl ude t he cur at i on of al l pot ent i al bi ases ( Mahé, R ognes, Qui nce, 67

de Var gas, & Dunt hor n, 2014; Boyer et al ., 2016; Cal l ahan et al ., 2016; Edgar , 68

2016b; R ognes, Fl our i , Ni chol s, Qui nce , & Mahé, 2016; Bol yen et al ., 201 9) . The 69

f i l t er i ng st eps of VT A M ai m t o addr ess t hes e poi nt s and i ncl ude s ever al addi t i onal 70

f eat ur es t hat ar e uni qu e or r ar el y f ound i n ex i st i ng pi pel i nes: ( i ) t he use of i nt er nal 71

cont r ol s and ( i i ) r epl i cat es t o opt i m i ze f i l t er ing par am et er val ues ; ( i i i ) t he 72

i nt egr at i on of m ul t i pl e over l appi ng m ar ker s and ( i v) f i l t r at i on t o addr ess cr oss - 73

sam pl e cont am i nat i on , i ncl udi ng t ag -j um ps . Fi nal l y, VT AM i s a va r i ant -based 74

f i l t er i ng pi pel i ne ( such as ot her den oi si ng m et hods: Cal l ah an et al . , 2016; Edgar , 75

2016b) t hat deal s wi t h am pl i con sequence va r i ant s ( ASVs) . 76

2 Features

77

2.1 Implementation

78

VTAM i s based o n t he m et hod descr i bed i n Cor se et al . 2 017. It i s a com m and -l i ne 79

appl i cat i on t hat r uns o n Li nux, MacOS or Wi ndows Subsyst em f or Li nux ( WSL) . 80

VTAM i s i m pl em ent ed i n Pyt hon 3 , usi ng a C onda envi r onm ent t o e nsur e 81

r epeat abi l i t y and easy i nst al l at i on of VTAM and t hese t hi r d -par t y a ppl i cat i ons : 82

WopMar s (ht t ps: / / wopm ar s.r eadt hedocs.i o ) , NC B I B LA ST , Vsear c h ( Rognes et al ., 83

2016) , Cut adapt ( Mar t i n, 2011) . Dat a i s st or ed i n a n S QLi t e dat aba se t hat ensur es 84

t r aceabi l i t y.

85 86

(5)

4

2.2 Workflow

87

Tabl e 1 sum m ar i zes t he di f fer ent com m ands and st eps of VT AM, t h ei r pur pose and 88

t he r el at ed er r or t ype s.

89

2.2.1 Pr e - pr ocessi ng ( opt i onal ) 90

An exam pl e of t he dat a st r uct ur e i s i l l ust r at ed i n Fi g. 1.

91

Pai r ed -end F AST Q f i l es ar e m er ged , r eads a r e t r i m m ed and d em ul t i pl ex ed 92

accor di ng t o f or war d a nd r ever se t ag com bi n at i on s.

93

2.2.2 Fi l t er i ng 94

Dem ul t i pl exed r eads a r e der epl i cat ed and A S Vs ar e st or ed i n a n S Q Li t e dat abase.

95

Al l occur r ences ar e ch ar act er i zed by t hei r r ead count . 96

Fi l t erLFN: el i m i nat es occur r ences l i kel y due t o Lo w Fr eque ncy N o i se . Occur r ences 97

ar e f i l t er ed out i f t hey have l ow r ead cou nt s ( i ) i n absol ut e t er m s ( Ni j k i s sm al l , 98

wher e Ni j k i s t he r ead count of var i ant i i n sa m pl e j and r epl i cat e k) , ( i i ) com par ed 99

t o t he t ot al num ber of r eads of t he sam pl e -r e pl i cat e ( Ni j k/ Nj k) or ( i i i ) com par ed t o 100

t he t ot al num ber of r e ads of t he var i ant ( Ni j k/ Ni) . 101

Fi l t erM i nRepl i cat eNumber: Occ ur r ences ar e r et ai ned onl y i f t he A SV i s pr e sent i n 102

at l east a user -def i ned num ber of r epl i cat es.

103

Fi l t erPCRerror: AS Vs wi t h one di f f er ence f r om anot her AS V of t h e sam e sam pl e 104

ar e f i l t er ed out i f t he pr opor t i on of t hei r r ea d count s i s bel o w a u s er -def i ned 105

t hr eshol d val ue . 106

Fi l t erChi mera r uns t h e uchi m e3_denovo chi m er a f i l t er i ng i m pl ement ed i n vsearch.

107

Fi l t erRenkonen r em ov es whol e r epl i cat es t h at ar e t oo di f f er ent co m par ed t o ot her 108

r epl i cat es i n t he sam e sam pl e.

109

Fi l t erI ndel and Fi l t erCodonSt op ar e i nt ende d t o det ect pseudo gene s and shoul d 110

onl y be used f or codi n g m ar ker s. Fi l t erI ndel el i m i nat es al l var i ant s, wi t h aber r ant 111

l engt h , wher e t he m od ul o t hr ee of t he l engt h i s di f f er ent f r om t he m aj or i t y . 112

Fi l t erCodonSt op el i m i nat es al l var i ant s t hat have codon ST OP i n a l l r eadi ng f r am es 113

of t he di r ect st r and.

114

The out put of t he f i l t er s i s an AS V t abl e wi t h val i dat ed var i ant s i n l i nes, sam pl e s i n 115

col um ns and t he sum o f r ead count s over r epl i cat es i n t he cel l s.

116

2.2.3 Taxonom i c assi g nat i on 117

Taxonom i c assi gnat i o n i s based on t he Lo w est Taxonom i c Gr o up m et hod descr i bed 118

i n det ai l i n Suppor t i ng Inf or m at i on 1. The t axonom i c r ef er ence dat abase has a 119

BLA ST f or m at wi t h t a xonom i c i dent i f i er s so t hat cust om dat abase s or t he com pl et e 120

NCB I nucl eot i de dat ab ase can be use d by VT AM . A cust om t axono m i c r ef er ence 121

(6)

5 dat abase of C O I seque nces m i ned f r om NC B I nucl eot i de and BOL D

122

(https://www.boldsystems.org/) dat abases i s avai l abl e wi t h t he pr ogr am . 123

2.2.4 Par am et er o pt i m iz at i on 124

User s shoul d f i r st i den t i f y expect ed and u nex pect ed occur r ences ba sed on t he f i r st 125

f i l t r at i on wi t h def aul t par am et er s . The opt i m i zat i on st ep wi l l gui de user s t o choose 126

par am et er val ue s t hat m axi m i ze t he num ber of expect ed occur r enc es i n t he dat aset 127

and m i ni m i ze t he num ber of unexpect ed o cc ur r ences ( f al se posi t i ves) . Par am et er s 128

ar e opt i m i zed f or t he t hr ee LF N f i l t er s and t he Fi l t er PCRer r or . Op t i m i zed 129

par am et er s can t hen b e used t o r epeat t he f i l t er i ng st eps.

130

2.2.5 Pool r uns/ m ar ke r s 131

A r un i s FAS TQ dat a f r om a sequenci ng r un and a m ar ker i s a r egi on of a l ocus 132

am pl i f i ed by a pr i m er pai r . The po ol com m and pr oduces a n A SV t abl e wi t h any 133

num ber of r un -m ar ker com bi nat i ons. When m or e t han one over l appi ng m ar ker i s 134

used, A SVs i dent i cal t o t hei r over l appi ng pa r t s are pool ed t o t he s am e l i ne.

135

3 Benchmarking

136

VTAM was t est ed wi t h t wo p ubl i shed m et ab ar codi ng dat aset s: a f i sh dat aset 137

obt ai ned f r om f i sh f aecal sam pl es ( Cor se et al ., 2017) , and a bat d at aset obt ai ned 138

f r om bat guano sam pl e s ( Gal an et al ., 2018) . Bot h dat aset s i ncl ude d negat i ve 139

cont r ol s , m ock sam pl es and t hr ee P CR r epl i cat es. A f r agm ent of t h e CO I gene was 140

am pl i f i ed usi ng t wo o ver l appi ng m ar ker s i n t he f i sh dat aset , and o ne i n t he bat 141

dat aset ( See det ai l s i n t he or i gi nal st udi es ) . 142

Bot h dat aset s wer e an al ysed by VTAM . The f i sh dat aset was anal y sed separ at el y f or 143

t he t wo m ar ker s and t h e r esul t s of bot h m ar k er s wer e pool ed t oget h er . 144

Bot h dat aset s wer e al s o anal ysed wi t h t he D AD A2 den oi si ng al gor i t hm ( Cal l ahan et 145

al ., 2016) , one of t he m ost wi de l y used m et h od s f or m et abar codi ng dat a cur at i on . 146

The out put of D AD A2 was f i l t er ed by L UL U ( Fr øsl ev et al ., 20 17) t o f ur t her 147

el i m i nat e pr obabl e f al se posi t i ve occur r ence s . Then t he t hr ee r epl i cat es of each 148

sam pl e wer e pool ed ( a s i n VT AM ) , onl y acce pt i ng t he occur r ence i f i t was pr esent 149

i n at l east t wo r epl i cat es ( Suppor t i ng i nf or m at i on 2) . 150

We com par ed t he α -di ver si t y and β -di ver si t y obt ai ned f or t he en vi r onm ent al 151

sam pl es t o addr ess t he ef f ect of t he cur at i on pi pel i nes on di ver si t y est i m at i ons . α- 152

di ver si t y was est i m at ed usi ng bot h AS V r i ch ness and cl ust er r i chn ess ( cl ust er s 153

aggr egat e AS Vs wi t h < 3% di ver gence) , and β-di ver si t y was sum m ar i zed usi ng t he 154

Br ay- Cur t i s pai r wi se d i ssi m i l arit y i ndex . ( Su ppor t i ng i nf or m at i on 3 ) . 155

In t he f i sh dat aset , al l expect ed var i ant s i n t h e m ock sam pl es wer e val i dat ed by 156

bot h pi pel i nes. Howev er , i n t he bat dat aset , t wo expect ed var i ant s had ver y l o w 157

(7)

6 r ead abundance ( 2 - 18 r eads/ r epl i cat e ) , whi ch wer e i n t he r ange of t he num ber of 158

r eads i n t he negat i ve c ont r ol s ( t en out of t he 19 negat i ve cont r ol s h ad at l east one 159

r ead count gr eat er t ha n 18 ) . T her ef or e, we i gnor e d t hese t w o expe ct ed var i ant s i n 160

t he Bul k Fr a nce m ock sam pl e, and we opt i m i zed t he V TAM par am e t er s t o r et ai n al l 161

ot her expect ed occur r e nces.

162

Af t er f i l t er i ng wi t h V TAM , t he num ber of f al se posi t i ves i n t he m ock sam pl es was 163

m ar kedl y l ower t han w i t h DA DA2 ( Tabl e 2) . Si m i l ar l y , ASV a nd cl ust er r i chness 164

wer e on av er age t wo t i m es l ower wi t h VT AM t han wi t h DA DA 2 i n envi r onm ent al 165

sam pl es ( Fi g. 2A and B) . In cont r ast , di ssi m i l ar i t i es bet ween sam pl es wer e hi gher 166

wi t h VT AM ( Fi g. 2D) . In bot h pi pel i nes , m ost cl ust er s cont ai ned a si ngl e AS V 167

( Suppor t i ng i nf or m at i on 3; Fi g. 2C) . 168

4 Discussion

169

Met abar codi ng i s kno wn t o be pr one t o t wo t ypes of er r or s: f al se n egat i ve s and 170

f al se posi t i ves . Based on cont r ol s ( negat i ve and m ock sam pl es) , V TAM ai m s t o f i nd 171

a com pr om i se bet ween t he se t wo er r or t yp es by m i ni m i z i ng f al se posi t i ve 172

occur r ences whi l e r et a i ni ng expect ed var i ant s i n m ock sam pl es t o a voi d f al se 173

negat i ves. Ther ef or e , t he m ock sam pl es shou l d cont ai n bot h wel l a nd weakl y 174

am pl i f i ed t axa , wher e t he abundance , i .e. t he num ber of r eads , of w eakl y am pl i f i ed 175

t axa i s m ar gi nal l y hi gher t han t hose f ou nd i n negat i ve sam pl es. T hi s shoul d ensur e 176

f i ndi ng f i l t er i ng par am et er val ues t hat si m ul t aneousl y m i ni m i ze f al se posi t i ves and 177

f al se negat i ves. Ad di t i onal l y, i n l ar ge -scal e st udi es wi t h m or e t han one sequenci n g 178

r un, t he use of i dent i c al m ock sam pl es i n al l r uns can ens ur e com p ar abi l i t y am ong 179

r uns i f t hey consi st ent l y yi el d t he sam e r esul t s . 180

The use of t echni cal r epl i cat e s i s anot her i m por t ant t ool t o l i m i t f al se posi t i ve s and 181

f al se negat i ve s ( Al ber di et al . 2018 , Cor se e t al . 2017) . Fal se posi t i ves can be 182

st r ongl y r educed by o nl y accept i ng var i ant s i n a sam pl e i f t hey ar e pr esent i n at 183

l east a cer t ai n num ber of r epl i cat es. Thi s st r a t egy i s st r ongl y advi se d t o r educe 184

exper i m ent al st ochast i ci t y and val i dat e A SV occur r ences . Fur t her m or e, r em ovi ng 185

r epl i cat es wi t h r adi cal l y di f f er ent com posi t i on s ( Renko nen f i l t er ) f ur t her r educes 186

t he ef f ect of exper i m ent al st ochast i ci t y ( De Bar ba et al ., 2014) . A ddi t i onal l y, f al se 187

negat i ves can be f ur t h er reduced by am pl i f yi ng sever al m ar ker s ( C or se et al ., 188

2019) . If t he di f f er ent m ar ker s over l ap , VT A M can pool seq uences t hat ar e 189

i dent i cal i n t hei r over l appi ng r egi ons . Thi s i nt egr at es t he r esul t s o f di f f er ent 190

m ar ker s unam bi guousl y.

191

Whi l e f al se posi t i ve occur r ences du e t o seq u enci ng and PC R er r or s ar e gener al l y 192

wel l det ect e d by d enoi si ng pi pel i nes such as DA DA2 , t ag -j um p and cr oss -sam pl e 193

(8)

7 cont am i nat i on ar e r ar el y t ak en i nt o account ( but see Bo yer et al ., 2 016; Edgar , 194

2016a) . Howe ver , f ai l i ng t o f i l t er out t hese a r t ef act s i s l i kel y t o i nf l at e f al se 195

posi t i ve occur r ences a nd ar t i f i ci al l y i ncr ease i nt er -sam pl e si m i l ar i ti es. In f act , t he 196

DA DA2 based pi pel i n e pr oduced AS V and c l ust er r i chness per sa m pl e t hat was on 197

aver age t wi ce as hi g h as wi t h V TAM a nd ev en hi gher f or som e sa m pl es ( Fi g. 2 A, 198

B) . O n t he ot her hand, di ssi m i l ar i t i es bet ween sam pl es wer e l o wer af t er DAD A2 199

f i l t r at i on . Addi t i onal l y , t he near 1: 1 cor r el at i on bet ween AS V and cl ust er r i chness 200

i n bot h pi pel i nes i ndi c at ed t hat m ost cl ust er s cont ai ned j ust one AS V per sam pl e.

201

Thi s suppor t s t he not i on t hat di ver si t y i nf l at i on i n D AD A2 r esul t e d f r om unf i l t er ed 202

t ag-j um p cont am i nat i o n s r at her t han P CR or sequenci ng er r or s as t hi s woul d have 203

pr oduced m or e ASVs t hat bel ong t o t he sam e cl ust er . Our VT AM pi pel i ne , 204

t her ef or e , appear s m or e appr opr i at e f or com p ar i ng t he di ver si t y bet ween sam pl es 205

and f or i nvest i gat i ng t he bi ol ogi cal r esponse s t o envi r onm ent al cha nge . 206

5 Conclusions

207

The V TAM m et abar co di ng pi pel i ne ai m s t o addr ess kno wn t echni c al er r or s dur i ng 208

dat a anal ysi s ( Tabl e 1 ) t o val i dat e m et abar codi ng dat a . It i s a com pl et e pi pel i ne 209

f r om r aw FA ST Q dat a t o cur at ed AS V t abl e s wi t h t axonom i c assi gn m ent s.

210

The i m pl em ent at i on of VT AM pr ovi des se ver al advant ages such as usi ng a C onda 211

envi r onm ent t o f aci l i t at e t he i nst al l at i on , dat a s t or age i n SQ Li t e d at abase f or 212

t r aceabi l i t y and t he po ssi bi l i t y t o r un one or sever al sequenci ng r un -m ar ker 213

com bi nat i ons usi ng t h e sam e com m and. VT AM i ncl udes f eat ur es r ar el y consi der ed 214

i n m ost m et abar codi ng pi pel i nes, and w e bel i eve i t pr ovi des a usef u l t ool f or t he 215

anal ysi s and val i dat i o n of m et abar codi ng da t a f or conduct i ng r ob u st anal yses of 216

bi odi ver si t y . 217

Acknowledgements

218

We t hank Di ane Zar zo so -Lac ost e and Sam ant a Or t uno Mi guel f or v al uabl e 219

com m ent s on t he use o f VT AM , Luc Gi f f on a nd Li onel S pi nel l i f or t he devel opm ent 220

of Wopm ar s and K ur t Vi l l sen f or Engl i sh ed i t i ng. Cent r e de Cal cul Int ensi f d’ Ai x - 221

Mar sei l l e i s acknowl edged f or gr ant i ng acce ss t o i t s hi gh per f or m a nce com put i ng 222

r esour ces. Thi s wor k i s a cont r i but i on t o t he Eur opean pr oj ect SEA MoBB, f u nded 223

by ER A -Net Mar -T ER A and m anage d by A N R ( num ber A NR _17_ MART -00 01_01) . 224

Authors’ contributions

225

EM, EC , VD c oncei ve d t he i deas and desi gn ed t he m et hodol ogy . E M and A G 226

concei ved t he sof t war e ar chi t ect ur e and t est ed t he V TAM . A G, T D and RM 227

devel oped t h e VT AM sof t war e ; AG cont r i bu t ed t o t he WopMar s so f t war e 228

(9)

8 devel opm ent . EM, AG , V D and EC wr ot e t he m anuscr i pt . Al l aut ho r s cont r i but ed 229

cr i t i cal l y t o t he dr af t and appr oved t he f i nal ver si on of t he m anuscr i pt . 230

231

Data Availability

232

VTAM i s avai l abl e at ht t ps: / / gi t hub.com / ai tgon/ vt am. A det ai l ed u ser m anual i s 233

f ound at ht t ps: / / vt am .r eadt hedocs.i o. 234

Em pi r i cal dat a used i n t hi s paper ar e avai l ab l e f r om t he Dr yad Di gi t al Reposi t or y 235

ht t ps: / / dat adr yad.or g/ st ash/ dat aset / doi : 10.5061/ dr yad.f 40v5 and 236

ht t ps: / / dat adr yad.or g/ st ash/ dat aset / doi : 10.5061/ dr yad.kv02 g . 237

References

238

Al ber di , A ., Ai zpur ua, O., Gi l ber t , M. T. P., & Bohm ann, K. ( 20 18) . Scr ut i ni zi ng 239

key st eps f or r el i abl e m et abar codi ng of envi r onm ent al sam pl es. M et hods i n Ecol o gy 240

and Evol ut i on, 9( 1) , 1 34 –147. doi : 10.11 11/ 2 041 - 210X .1284 9 241

Bakker , M. G. ( 2018) . A f ungal m oc k com m uni t y cont r ol f or am pl i con sequenci ng 242

exper i m ent s. M ol ecul ar Ecol ogy Resources, 18( 3) , 541 – 556. doi : 1 0.1111/ 175 5 - 243

0998.127 60 244

Bol yen, E., Ri deout , J. R., Di l l on, M. R. , B o kul i ch, N. A. , Abnet , C. C ., Al - 245

Ghal i t h, G . A ., … Cap or aso, J. G. ( 2 019) . Re pr oduci bl e, i nt er act i ve, scal abl e and 246

ext ensi bl e m i cr obi om e dat a sci ence usi ng Q I IME 2 . N at ure Bi ot ech nol ogy, 37( 8) , 247

852–857. doi : 10.1 038/ s41587 -019 -02 09 -9 248

Boyer , F., Mer ci er , C. , Boni n , A ., Br as, Y. L., Ta ber l et , P., & Coi ssac , E. ( 2 016) . 249

obi t ool s: a uni x -i nspi r ed sof t war e packa ge f or DN A m et abar codi n g. M ol ecul ar 250

Ecol ogy Reso urces, 1 6( 1) , 176 –18 2. doi : 10 .1 111/ 1755 -099 8.1242 8 251

Calderón‐Sanou, I., Münkemüller, T., Boyer, F., Zinger, L., & Thuiller, W. (2020).

252

Fr om envi r onm ent al D N A se quences t o ec ol o gi cal concl usi ons: Ho w st r ong i s t he 253

i nf l uence of m et hodol ogi cal choi ces? Journ al of Bi ogeogra phy, 4 7( 1) , 193 –2 06.

254

doi : 10.1111/ j bi .13 681 255

Cal l ahan, B . J. , McMu r di e, P. J. , R osen, M. J., Han , A. W., Johnso n, A. J. A., &

256

Hol m es, S. P. ( 2 01 6) . DA DA2: Hi gh -r esol ut i on sam pl e i nf er ence f r om Il l um i na 257

am pl i con dat a. Nat ure M et hods, 13( 7) , 5 81 –5 83. doi : 10. 1038/ nm et h .3869 258

Cor se, E ., Megl écz, E. , Ar cham baud , G ., Ar d i sson, M., Mar t i n, J . - F ., Tou gar d, C., 259

… Dubut, V. (2017). A from -benchtop-to-desktop workflow for validating HTS data 260

and f or t axonom i c i de nt i f i cat i on i n di et m etabar codi ng st udi es. M ol ecul ar Ecol ogy 261

Resources, 1 7( 6) , e14 6 –e159. doi : 10.111 1/ 1 755 - 0998.1 2703 262

Corse, E., Tougard, C., Archambaud‐Suard, G., Agnèse, J. -F., Mandeng, F. D. M., 263

Bi l ong, C. F . B ., … D ubut , V . ( 2019) . On e -l ocus -sever al -pr i m er s: A st r at egy t o 264

i m pr ove t he t axonom i c and hapl ot ypi c c over age i n di et m et abar cod i ng st udi es.

265

Ecol ogy and Evol ut i on, 9( 8) , 46 03 –462 0. doi : 10.1002/ ece3. 5063 266

Cr i st escu, M. E ., & H eber t , P. D. N. ( 2 01 8) . Uses and Mi suses of Envi r onm ent al 267

DN A i n Bi odi ver si t y S ci ence and C onser vat i on. An nual Revi e w of Ecol ogy, 268

(10)

9 Evol ut i on, and Syst em at i cs, 49( 1) , 20 9 –230 . doi : 10.1146/ ann ur ev - e col sys -11 0617 - 269

062306 270

De Bar ba , M., Mi quel , C., Boyer , F. , Mer ci er , C., Ri oux , D ., Coi ss ac, E., &

271

Taber l et , P. ( 2 014) . D NA m et abar codi ng m ul t i pl exi ng and val i dat i on of dat a 272

accur acy f or di et asses sm ent : appl i cat i on t o om ni vor ous di et . M ol e cul ar Ecol ogy 273

Resources, 1 4( 2) , 30 6 –323. d oi : 10.1111/ 1 75 5 -09 98.121 88 274

Edgar , R. C. ( 201 6a) . UN CR OSS: Fi l t er i ng of hi gh -f r eque ncy cr oss -t al k i n 16 S 275

am pl i con r eads. Bi oRx i v, 088666 . doi : 10. 110 1/ 088666 276

Edgar , R. C. ( 201 6b) . UN OISE2: i m pr oved e r r or -cor r ect i on f or Il l u m i na 16S and 277

IT S am pl i con sequenci ng. Bi oR xi v, 081 257. doi : 10.1101/ 08 1257 278

Fr øsl ev, T. G. , Kj øl l er , R., Br uu n, H. H., Ej r næs, R., Br unbj er g, A. K., Pi et r oni , C ., 279

& Hansen, A. J . ( 2017 ) . Al gor i t hm f or post - cl ust er i ng cur at i on of DN A am pl i con 280

dat a yi el ds r el i abl e bi odi ver si t y est i m at es. Nat ure C om muni cat i o ns, 8( 1) , 1 –11.

281

doi : 10.1038/ s41 467 -0 17-0 1312 -x 282

Gal an, M., Pons , J. -B . , Tour nayr e , O ., Pi er r e , É., Leucht m ann, M., Pont i er , D. , &

283

Char bonnel , N. ( 2018) . Met abar codi ng f or t h e par al l el i dent i f i cat i on of sever al 284

hundr ed pr edat or s a nd t hei r pr ey: Appl i cat i o n t o bat speci es di et a nal ysi s.

285

M ol ecul ar Ecol o gy Re sources, 18( 3) , 474 – 4 89. doi : 10. 1111/ 175 5 - 0998.127 49 286

Mahé, F., Rog nes, T., Qui nce, C., de Var gas, C., & Dunt hor n , M. ( 2014) . S war m : 287

r obust and f ast cl ust er i ng m et hod f or am pl i con -base d st udi es. Peer J, 2, e5 93.

288

doi : 10.7717/ peer j .5 93 289

Mar t i n, M. ( 2011) . Cu t adapt r em oves adapt e r sequences f r om hi gh -t hr ough put 290

sequenci ng r eads. EM Bnet .Journal, 17( 1) , 1 0 –12. d oi : 10.1480 6/ ej .17.1.20 0 291

O’Rourke, D. R., Bokulich, N. A., Jusino, M. A., MacManes, M. D., & Foster, J. T.

292

( 2020) . A t ot al cr apsh oot ? Eval uat i ng bi oi nf or m at i c deci si ons i n ani m al di et 293

m et abar codi ng anal yse s. Ecol ogy and Evol ut i on, 10( 18) , 9721 –9739 . 294

doi : 10.1002/ ece3 .6594 295

Rognes, T ., Fl our i , T., Ni chol s, B., Qui nce, C., & Mahé, F. ( 2016) . VS EA RC H: a 296

ver sat i l e open sour ce t ool f or m et agenom i cs. PeerJ, 4, e258 4.

297

doi : 10.7717/ peer j .2 58 4 298

Zinger, L., Bonin, A., Alsos, I. G., Bálint, M., Bik, H., Boyer, F., … Taberlet, P.

299

( 2019) . D N A m et ab ar codi ng —Need f or r obus t exper i m ent al desi gns t o dr aw s ound 300

ecol ogi cal concl usi ons . M ol ecul ar Ecol ogy, 28( 8) , 1857 –1862 . 301

doi : 10.1111/ m ec.150 6 0 302

303 304

(11)

10

Figures and tables

305 306

307

F i g u r e 1. A n e x a m p l e o f a d a t a s t r u c t u r e w i t h o n e r u n , t w o m a r k e r s a n d 308

t h r e e r e p l i c a t e s f o r e a c h s a m p l e . S 1 - R 1 : R e p l i c a t e 1 o f S a m p l e 1 . R e p l i c a t e s 309

a r e n o t e s s e n t i a l b u t s t r o n g l y r e c o m m e n d e d . S a m p l e s s h o u l d i n c l u d e a t l e a s t 310

o n e m o c k s a m p l e a n d o n e n e g a t i v e c o n t r o l . 311

312

(12)

11 313

314

F i g u r e 2. D i v e r s i t y e s t i m a t e s f r o m t h e f i s h a n d b a t d a t a s e t s , b a s e d o n t h e 315

V T A M a n d D A D A 2 - b a s e d p i p e l i n e s . A ) A S V r i c h n e s s p e r s a m p l e B ) c l u s t e r 316

r i c h n e s s p e r s a m p l e C ) T h e c o r r e l a t i o n b e t w e e n A S V a n d c l u s t e r r i c h n e s s . P - 317

v a l u e i n d i c a t e s a s i g n i f i c a n t s l o p e d i f f e r e n c e b e t w e e n t h e t w o p i p e l i n e s . 318

D ) β - d i v e r s i t y w a s e s t i m a t e d u s i n g t h e B r a y - C u r t i s d i s s i m i l a r i t y i n d e x 319

c a l c u l a t e d f o r e a c h p a i r w i s e s a m p l e c o m p a r i s o n . S o l i d l i n e s i n d i c a t e l i n e a r 320

r e g r e s s i o n l i n e s , h a t c h e d l i n e s a r e t h e 1 : 1 r e f e r e n c e l i n e s . 321

322

(13)

12 323

T a b l e 1 . L i s t o f V T A M c o m m a n d s a n d t h e i r r o l e s . 324

325

VTAM command

VTAM step

(Name in Corse et al. 2017) Role Error Type

merge

Merges paired-end reads

and quality filtering Sequencing errors

sortreads Assigns reads to samples Sequencing errors

filter Dereplicate Dereplicates

filter Delete singletons Deletes singletons

Sequencing errors, highly spurious se- quences

filter

LFN_variant filter (LFNtag)

Deletes low frequency er- rors

Tug jump, inter sam- ple contamination filter

LFN_read_count filter (LFNneg)

Deletes low frequency er- rors

Sequencing error, light contamination filter

LFN_sample_replicate filter (LFNpos)

Deletes low frequency er- rors

Sequencing error, light contamination

filter FilterMinReplicateNumber

Ensures consistency be-

tween replicates PCR heterogeneity filter

FilterPCRerror (Obliclean)

Eliminates PCR errors (even

if frequent) PCR errors

filter FilterChimera Eliminates chimeras Chimeras

filter FilterRenkonen

Eliminates aberrant repli-

cates Dysfunctional PCRs

filter

FilterIndel

(Pseudogene filter) Eliminates pseudogenes

Pseudogenes, spuri- ous sequences filter

FilterCondonStop

(Pseudogene filter) Eliminates pseudogenes

Pseudogenes, spuri- ous sequences

taxassign (LTG) Assigns variants to taxa

Highly spurious sequences

optimize OptimizeLFNsampleReplicate

Finds the optimal parameter for the LFN-sample-replicate

filter

optimize OptimizePCRerror

Finds the optimal parameter for FilterPCRerror

optimize

OptimizeLFNreadCountAndLFN- variant

Finds the optimal value for LFN-read-count and LFN-

variant filters

pool

Pools the results from differ-

ent runs/markers

326 327 328

(14)

13 T a b l e 2 . N u m b e r o f f a l s e p o s i t i v e o c c u r r e n c e s c o m p a r e d t o t h e t o t a l n u m b e r 329

o f o c c u r r e n c e s . I n n e g a t i v e c o n t r o l a n d m o c k s a m p l e s , t h e c o u n t o f f a l s e 330

p o s i t i v e s i s p r e c i s e , s i n c e t h e s a m p l e c o m p o s i t i o n i s k n o w n . 331

VTAM Fish DADA Fish VTAM Bat DADA Bat Negative controls 0/0 (0%) 32/32 (100%) 2/2 (100%) 19/19 (100%) Mock samples 5/17 (29%) 37/49 (75%) 22/61 (36%) 73/114 (65%) 332

333

(15)

14

Supporting Information

334 335

S u p p I n f o 1 . p d f 336

D e s c r i p t i o n o f t h e t a x o n o m i c a s s i g n a t i o n a n d i t s c u s t o m d a t a b a s e . 337

S u p p I n f o 2 . p d f 338

C o m m a n d s , u s e r i n p u t f i l e s , a n d t h e f i n a l A S V t a b l e s p r o d u c e d b y V T A M 339

a n d t h e D A D A b a s e d p i p e l i n e f o r t h e f i s h a n d t h e b a t d a t a s e t s . 340

341

S u p p I n f o 3 . p d f 342

D i v e r s i t y e s t i m a t i o n p r o t o c o l 343

Références

Documents relatifs

Xuan-Nga Cao, Cyrille Dakhlia, Patricia Carmen, Mohamed-Amine Jaouani, Malik Ould-Arbi, Emmanuel Dupoux.. To cite

To cite this version: Forgione, Thomas and Carlier, Axel and Morin, Géraldine and Ooi, Wei Tsang and Charvillat, Vincent and Yadav, Praveen Kumar An Implementation of a DASH

Lucas Breder Teixeira, Eric Blond, Thomas Sayet, Jean Gillibert-. To cite

Alexandre Gramfort, Théodore Papadopoulo, Emmanuel Olivi, Maureen Clerc.. To cite

Gabriel Durin, Jean-Claude Berthet, Emmanuel Nicolas, Pierre Thuéry, Thibault Cantat.. To cite

To cite this version: Forgione, Thomas and Carlier, Axel and Morin, Géraldine and Ooi, Wei Tsang and Charvillat, Vincent and Yadav, Praveen Kumar An Implementation of a DASH

Spiegelungssatz: a combinatorial proof for the 4-rank Laurent Habsieger, Emmanuel Royer.. To cite

Maxime Gasse, Fabien Millioz, Emmanuel Roux, Damien Garcia, Hervé Liebgott, Denis Friboulet. To cite