HAL Id: hal-03144831
https://hal-amu.archives-ouvertes.fr/hal-03144831
Preprint submitted on 17 Feb 2021
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Distributed under a Creative CommonsAttribution - NonCommercial - NoDerivatives| 4.0 International License
Aitor Gonzalez, Vincent Dubut, Emmanuel Corse, Reda Mekdad, Thomas Dechatre, Emese Meglécz
To cite this version:
Aitor Gonzalez, Vincent Dubut, Emmanuel Corse, Reda Mekdad, Thomas Dechatre, et al.. VTAM:
A robust pipeline for validating metabarcoding data using internal controls. 2021. �hal-03144831�
1
VTAM: A robust pipeline for validating
1
metabarcoding data using internal controls
2
Ai t or Gonzál ez1, Vi nc ent Dubut2, Em m anuel Cor se3 , 4, Reda Mekda d1 , 2, Thom as Dechat r e1 , 2 and Em ese Megl écz2
3
1 Ai x Mar sei l l e Uni v , IN SE RM, T AG C, Tur i ng Cent er f or Li vi ng S yst em s, 13288 4
Mar sei l l e, Fr ance 5
2 Ai x Mar sei l l e Uni v , Avi gnon Uni ver si t é, C NRS , IR D, IMB E, Mar sei l l e, Fr ance 6
3 Cent r e Uni ver si t ai r e de Mayot t e, Rout e N a t i onal e 3, B P 53, 9766 0 Dem beni , 7
Mayot t e, 8
Fr ance 9
4 MAR BE C, CN RS, If r em er , IR D, Uni ver si t y of Mont pel l i er , Mon t pel l i er , Fr ance 10
11
Cor r espondi ng aut hor : Ai t or Go nzál ez ( ai t or .gonzal ez@uni v -am u.f r ) and Em ese 12
Megl écz ( em ese.m egl ecz@i m be.f r ) 13
Runni ng t i t l e: V TAM m et abar codi ng pi pel i ne 14
15 16
2
Abstract
17
1. Met abar codi ng st udi es shoul d be car ef ul l y d esi gned t o m i ni m i ze f al se 18
posi t i ves and f al se ne gat i ve occur r ences . T h e use of i nt er nal cont r ol s, 19
r epl i cat es, and sever al over l appi ng m ar ker s i s expect ed t o i m pr ove t he 20
bi oi nf or m at i c s dat a anal ysi s . 21
2. VTAM i s a t oo l t o per f or m al l st eps of dat a cur at i on f r om r aw f ast q dat a t o 22
t axonom i cal l y assi gned AS V ( Am pl i con Seq uence Var i ant or si m pl y var i ant ) 23
t abl e. It ad dr esses al l known t ech ni cal er r or t ypes and i ncl udes ot h er f eat ur es 24
r ar el y pr esent i n exi st i ng pi pel i nes f or val i da t i ng m et abar codi ng da t a : 25
Fi l t er i ng par am et er s ar e obt ai ned f r om i nt er nal cont r ol sam pl es ; cr oss - 26
sam pl e cont am i nat i on and t ag -j um p ar e cont r ol l ed ; t echni cal r epl i cat es ar e 27
used t o ensur e r epeat a bi l i t y ; i t handl es dat a obt ai ned f r om sever al 28
over l appi ng m ar ker s . 29
3. Two dat aset s wer e ana l ysed by VTAM and t h e r esul t s wer e com par ed t o 30
t hose obt ai ned wi t h a pi pel i ne based on D A DA2. The f al se posi t i ve 31
occur r ences i n sam pl e s wer e consi der abl y hi gher when cur at ed by DA DA2 , 32
whi ch i s l i kel y due t o t he l ack of cont r ol f or t ag -j um p and cr oss -sa m pl e 33
cont am i nat i on.
34
4. VTAM i s a r obust t ool t o val i dat e m et abar codi ng dat a and i m pr ove 35
t r aceabi l i t y, r epr oduci bi l i t y, and com par abi l i t y bet ween r uns and d at aset s.
36 37
Keywor ds: m et abar codi ng, m ock sam pl e, neg at i ve cont r ol , r epl i cat es, t axonom i c 38
assi gnat i on , f al se posi t i ves, f al se negat i ves 39
40
1 Introduction
41
Met abar codi ng has be com e a power f ul ap pr o ach t o st udy bi o di ver si t y f r om 42
envi r onm ent al sam pl es ( i ncl udi ng gut cont en t o r f aecal sam pl es ) . Met abar codi ng , 43
however , i s pr one t o s om e pi t f al l s , and cons equent l y, ever y m et ab ar codi ng st udy 44
shoul d be desi g ned i n a f r om -bencht op -t o -de skt op way ( f r om sam pl i ng t o dat a 45
anal ys i s) t o m i ni m i ze t he bi as of each st ep o n t he out com e ( Al ber d i , Ai zpur ua, 46
Gi l ber t , & Bohm ann, 2 018; Cr i st escu & He be r t , 2018; Zi nger et al ., 2019) . Sever al 47
paper s have cal l ed f or good pr act i ce i n st udy desi gn, dat a pr oduct i on and anal yses 48
t o ensur e r epeat abi l i t y and com par abi l i t y bet ween st udi es . Not abl y , t he i m por t ance 49
of m ock com m uni t y sam pl es , negat i ve cont r o l s, and r epl i cat es i s f r equent l y 50
hi ghl i ght ed ( Al ber di e t al ., 2018; Bakker , 20 18; Cr i st escu & Heber t , 2018;
51
3 O’Rourke, Bokulich, Jusino, MacManes, & Foster, 2020) . However, their use in 52
bi oi nf or m at i cs pi pel i nes i s of t en l i m i t ed t o t he ver i f i cat i on of expe ct at i on s.
53
In t hi s st ud y, we pr ese nt t he bi oi nf or m at i cs pi pel i ne, VT AM ( Val i dat i on and 54
Taxonom i c Assi gnat i o n of Met abar codi ng da t a) t hat ef f ect i vel y i nt egr at es negat i ve 55
cont r ol s, m ock com m uni t i es and t echni cal r epl i cat es t o cont r ol exp er i m ent al 56
f l uct uat i ons ( e.g. sequ enci ng dept h, PC R st o chast i c i t y) and val i dat e m et abar codi ng 57
dat a . 58
A r ecent st udy on t he ef f ect of di f f er ent st eps of dat a cur at i on on s pat i al 59
par t i t i oni ng of bi odi ve r si t y l i st ed t he f ol l owi ng pot ent i al pr obl em s: Sequenci ng and 60
PCR er r or s, pr esence of hi ghl y spur i ous se q uences , chi m er as, i nt e r nal or ext er nal 61
cont am i nat i on and dys f unct i onal PC Rs ( Cal d er ón‐ Sanou, Mü nkem ü l l er , Boyer , 62
Zi nger , & Thui l l er , 2 0 20) . They sho wed t hat exhaust i ve cur at i on a nd ensur i ng 63
r epeat abi l i t y by t echni cal r epl i cat es ar e necessar y , especi al l y f or b i odi ver si t y 64
m easur em ent s . Ideal l y , a m et abar codi ng wor kf l ow shoul d addr ess al l of t hese 65
t echni cal er r or s. Exi st i ng t ool s , ho wever , ar e ei t her hi ghl y f l exi bl e but t oo com pl ex 66
or t hey do n ot i ncl ude t he cur at i on of al l pot ent i al bi ases ( Mahé, R ognes, Qui nce, 67
de Var gas, & Dunt hor n, 2014; Boyer et al ., 2016; Cal l ahan et al ., 2016; Edgar , 68
2016b; R ognes, Fl our i , Ni chol s, Qui nce , & Mahé, 2016; Bol yen et al ., 201 9) . The 69
f i l t er i ng st eps of VT A M ai m t o addr ess t hes e poi nt s and i ncl ude s ever al addi t i onal 70
f eat ur es t hat ar e uni qu e or r ar el y f ound i n ex i st i ng pi pel i nes: ( i ) t he use of i nt er nal 71
cont r ol s and ( i i ) r epl i cat es t o opt i m i ze f i l t er ing par am et er val ues ; ( i i i ) t he 72
i nt egr at i on of m ul t i pl e over l appi ng m ar ker s and ( i v) f i l t r at i on t o addr ess cr oss - 73
sam pl e cont am i nat i on , i ncl udi ng t ag -j um ps . Fi nal l y, VT AM i s a va r i ant -based 74
f i l t er i ng pi pel i ne ( such as ot her den oi si ng m et hods: Cal l ah an et al . , 2016; Edgar , 75
2016b) t hat deal s wi t h am pl i con sequence va r i ant s ( ASVs) . 76
2 Features
77
2.1 Implementation
78
VTAM i s based o n t he m et hod descr i bed i n Cor se et al . 2 017. It i s a com m and -l i ne 79
appl i cat i on t hat r uns o n Li nux, MacOS or Wi ndows Subsyst em f or Li nux ( WSL) . 80
VTAM i s i m pl em ent ed i n Pyt hon 3 , usi ng a C onda envi r onm ent t o e nsur e 81
r epeat abi l i t y and easy i nst al l at i on of VTAM and t hese t hi r d -par t y a ppl i cat i ons : 82
WopMar s (ht t ps: / / wopm ar s.r eadt hedocs.i o ) , NC B I B LA ST , Vsear c h ( Rognes et al ., 83
2016) , Cut adapt ( Mar t i n, 2011) . Dat a i s st or ed i n a n S QLi t e dat aba se t hat ensur es 84
t r aceabi l i t y.
85 86
4
2.2 Workflow
87
Tabl e 1 sum m ar i zes t he di f fer ent com m ands and st eps of VT AM, t h ei r pur pose and 88
t he r el at ed er r or t ype s.
89
2.2.1 Pr e - pr ocessi ng ( opt i onal ) 90
An exam pl e of t he dat a st r uct ur e i s i l l ust r at ed i n Fi g. 1.
91
Pai r ed -end F AST Q f i l es ar e m er ged , r eads a r e t r i m m ed and d em ul t i pl ex ed 92
accor di ng t o f or war d a nd r ever se t ag com bi n at i on s.
93
2.2.2 Fi l t er i ng 94
Dem ul t i pl exed r eads a r e der epl i cat ed and A S Vs ar e st or ed i n a n S Q Li t e dat abase.
95
Al l occur r ences ar e ch ar act er i zed by t hei r r ead count . 96
Fi l t erLFN: el i m i nat es occur r ences l i kel y due t o Lo w Fr eque ncy N o i se . Occur r ences 97
ar e f i l t er ed out i f t hey have l ow r ead cou nt s ( i ) i n absol ut e t er m s ( Ni j k i s sm al l , 98
wher e Ni j k i s t he r ead count of var i ant i i n sa m pl e j and r epl i cat e k) , ( i i ) com par ed 99
t o t he t ot al num ber of r eads of t he sam pl e -r e pl i cat e ( Ni j k/ Nj k) or ( i i i ) com par ed t o 100
t he t ot al num ber of r e ads of t he var i ant ( Ni j k/ Ni) . 101
Fi l t erM i nRepl i cat eNumber: Occ ur r ences ar e r et ai ned onl y i f t he A SV i s pr e sent i n 102
at l east a user -def i ned num ber of r epl i cat es.
103
Fi l t erPCRerror: AS Vs wi t h one di f f er ence f r om anot her AS V of t h e sam e sam pl e 104
ar e f i l t er ed out i f t he pr opor t i on of t hei r r ea d count s i s bel o w a u s er -def i ned 105
t hr eshol d val ue . 106
Fi l t erChi mera r uns t h e uchi m e3_denovo chi m er a f i l t er i ng i m pl ement ed i n vsearch.
107
Fi l t erRenkonen r em ov es whol e r epl i cat es t h at ar e t oo di f f er ent co m par ed t o ot her 108
r epl i cat es i n t he sam e sam pl e.
109
Fi l t erI ndel and Fi l t erCodonSt op ar e i nt ende d t o det ect pseudo gene s and shoul d 110
onl y be used f or codi n g m ar ker s. Fi l t erI ndel el i m i nat es al l var i ant s, wi t h aber r ant 111
l engt h , wher e t he m od ul o t hr ee of t he l engt h i s di f f er ent f r om t he m aj or i t y . 112
Fi l t erCodonSt op el i m i nat es al l var i ant s t hat have codon ST OP i n a l l r eadi ng f r am es 113
of t he di r ect st r and.
114
The out put of t he f i l t er s i s an AS V t abl e wi t h val i dat ed var i ant s i n l i nes, sam pl e s i n 115
col um ns and t he sum o f r ead count s over r epl i cat es i n t he cel l s.
116
2.2.3 Taxonom i c assi g nat i on 117
Taxonom i c assi gnat i o n i s based on t he Lo w est Taxonom i c Gr o up m et hod descr i bed 118
i n det ai l i n Suppor t i ng Inf or m at i on 1. The t axonom i c r ef er ence dat abase has a 119
BLA ST f or m at wi t h t a xonom i c i dent i f i er s so t hat cust om dat abase s or t he com pl et e 120
NCB I nucl eot i de dat ab ase can be use d by VT AM . A cust om t axono m i c r ef er ence 121
5 dat abase of C O I seque nces m i ned f r om NC B I nucl eot i de and BOL D
122
(https://www.boldsystems.org/) dat abases i s avai l abl e wi t h t he pr ogr am . 123
2.2.4 Par am et er o pt i m iz at i on 124
User s shoul d f i r st i den t i f y expect ed and u nex pect ed occur r ences ba sed on t he f i r st 125
f i l t r at i on wi t h def aul t par am et er s . The opt i m i zat i on st ep wi l l gui de user s t o choose 126
par am et er val ue s t hat m axi m i ze t he num ber of expect ed occur r enc es i n t he dat aset 127
and m i ni m i ze t he num ber of unexpect ed o cc ur r ences ( f al se posi t i ves) . Par am et er s 128
ar e opt i m i zed f or t he t hr ee LF N f i l t er s and t he Fi l t er PCRer r or . Op t i m i zed 129
par am et er s can t hen b e used t o r epeat t he f i l t er i ng st eps.
130
2.2.5 Pool r uns/ m ar ke r s 131
A r un i s FAS TQ dat a f r om a sequenci ng r un and a m ar ker i s a r egi on of a l ocus 132
am pl i f i ed by a pr i m er pai r . The po ol com m and pr oduces a n A SV t abl e wi t h any 133
num ber of r un -m ar ker com bi nat i ons. When m or e t han one over l appi ng m ar ker i s 134
used, A SVs i dent i cal t o t hei r over l appi ng pa r t s are pool ed t o t he s am e l i ne.
135
3 Benchmarking
136
VTAM was t est ed wi t h t wo p ubl i shed m et ab ar codi ng dat aset s: a f i sh dat aset 137
obt ai ned f r om f i sh f aecal sam pl es ( Cor se et al ., 2017) , and a bat d at aset obt ai ned 138
f r om bat guano sam pl e s ( Gal an et al ., 2018) . Bot h dat aset s i ncl ude d negat i ve 139
cont r ol s , m ock sam pl es and t hr ee P CR r epl i cat es. A f r agm ent of t h e CO I gene was 140
am pl i f i ed usi ng t wo o ver l appi ng m ar ker s i n t he f i sh dat aset , and o ne i n t he bat 141
dat aset ( See det ai l s i n t he or i gi nal st udi es ) . 142
Bot h dat aset s wer e an al ysed by VTAM . The f i sh dat aset was anal y sed separ at el y f or 143
t he t wo m ar ker s and t h e r esul t s of bot h m ar k er s wer e pool ed t oget h er . 144
Bot h dat aset s wer e al s o anal ysed wi t h t he D AD A2 den oi si ng al gor i t hm ( Cal l ahan et 145
al ., 2016) , one of t he m ost wi de l y used m et h od s f or m et abar codi ng dat a cur at i on . 146
The out put of D AD A2 was f i l t er ed by L UL U ( Fr øsl ev et al ., 20 17) t o f ur t her 147
el i m i nat e pr obabl e f al se posi t i ve occur r ence s . Then t he t hr ee r epl i cat es of each 148
sam pl e wer e pool ed ( a s i n VT AM ) , onl y acce pt i ng t he occur r ence i f i t was pr esent 149
i n at l east t wo r epl i cat es ( Suppor t i ng i nf or m at i on 2) . 150
We com par ed t he α -di ver si t y and β -di ver si t y obt ai ned f or t he en vi r onm ent al 151
sam pl es t o addr ess t he ef f ect of t he cur at i on pi pel i nes on di ver si t y est i m at i ons . α- 152
di ver si t y was est i m at ed usi ng bot h AS V r i ch ness and cl ust er r i chn ess ( cl ust er s 153
aggr egat e AS Vs wi t h < 3% di ver gence) , and β-di ver si t y was sum m ar i zed usi ng t he 154
Br ay- Cur t i s pai r wi se d i ssi m i l arit y i ndex . ( Su ppor t i ng i nf or m at i on 3 ) . 155
In t he f i sh dat aset , al l expect ed var i ant s i n t h e m ock sam pl es wer e val i dat ed by 156
bot h pi pel i nes. Howev er , i n t he bat dat aset , t wo expect ed var i ant s had ver y l o w 157
6 r ead abundance ( 2 - 18 r eads/ r epl i cat e ) , whi ch wer e i n t he r ange of t he num ber of 158
r eads i n t he negat i ve c ont r ol s ( t en out of t he 19 negat i ve cont r ol s h ad at l east one 159
r ead count gr eat er t ha n 18 ) . T her ef or e, we i gnor e d t hese t w o expe ct ed var i ant s i n 160
t he Bul k Fr a nce m ock sam pl e, and we opt i m i zed t he V TAM par am e t er s t o r et ai n al l 161
ot her expect ed occur r e nces.
162
Af t er f i l t er i ng wi t h V TAM , t he num ber of f al se posi t i ves i n t he m ock sam pl es was 163
m ar kedl y l ower t han w i t h DA DA2 ( Tabl e 2) . Si m i l ar l y , ASV a nd cl ust er r i chness 164
wer e on av er age t wo t i m es l ower wi t h VT AM t han wi t h DA DA 2 i n envi r onm ent al 165
sam pl es ( Fi g. 2A and B) . In cont r ast , di ssi m i l ar i t i es bet ween sam pl es wer e hi gher 166
wi t h VT AM ( Fi g. 2D) . In bot h pi pel i nes , m ost cl ust er s cont ai ned a si ngl e AS V 167
( Suppor t i ng i nf or m at i on 3; Fi g. 2C) . 168
4 Discussion
169
Met abar codi ng i s kno wn t o be pr one t o t wo t ypes of er r or s: f al se n egat i ve s and 170
f al se posi t i ves . Based on cont r ol s ( negat i ve and m ock sam pl es) , V TAM ai m s t o f i nd 171
a com pr om i se bet ween t he se t wo er r or t yp es by m i ni m i z i ng f al se posi t i ve 172
occur r ences whi l e r et a i ni ng expect ed var i ant s i n m ock sam pl es t o a voi d f al se 173
negat i ves. Ther ef or e , t he m ock sam pl es shou l d cont ai n bot h wel l a nd weakl y 174
am pl i f i ed t axa , wher e t he abundance , i .e. t he num ber of r eads , of w eakl y am pl i f i ed 175
t axa i s m ar gi nal l y hi gher t han t hose f ou nd i n negat i ve sam pl es. T hi s shoul d ensur e 176
f i ndi ng f i l t er i ng par am et er val ues t hat si m ul t aneousl y m i ni m i ze f al se posi t i ves and 177
f al se negat i ves. Ad di t i onal l y, i n l ar ge -scal e st udi es wi t h m or e t han one sequenci n g 178
r un, t he use of i dent i c al m ock sam pl es i n al l r uns can ens ur e com p ar abi l i t y am ong 179
r uns i f t hey consi st ent l y yi el d t he sam e r esul t s . 180
The use of t echni cal r epl i cat e s i s anot her i m por t ant t ool t o l i m i t f al se posi t i ve s and 181
f al se negat i ve s ( Al ber di et al . 2018 , Cor se e t al . 2017) . Fal se posi t i ves can be 182
st r ongl y r educed by o nl y accept i ng var i ant s i n a sam pl e i f t hey ar e pr esent i n at 183
l east a cer t ai n num ber of r epl i cat es. Thi s st r a t egy i s st r ongl y advi se d t o r educe 184
exper i m ent al st ochast i ci t y and val i dat e A SV occur r ences . Fur t her m or e, r em ovi ng 185
r epl i cat es wi t h r adi cal l y di f f er ent com posi t i on s ( Renko nen f i l t er ) f ur t her r educes 186
t he ef f ect of exper i m ent al st ochast i ci t y ( De Bar ba et al ., 2014) . A ddi t i onal l y, f al se 187
negat i ves can be f ur t h er reduced by am pl i f yi ng sever al m ar ker s ( C or se et al ., 188
2019) . If t he di f f er ent m ar ker s over l ap , VT A M can pool seq uences t hat ar e 189
i dent i cal i n t hei r over l appi ng r egi ons . Thi s i nt egr at es t he r esul t s o f di f f er ent 190
m ar ker s unam bi guousl y.
191
Whi l e f al se posi t i ve occur r ences du e t o seq u enci ng and PC R er r or s ar e gener al l y 192
wel l det ect e d by d enoi si ng pi pel i nes such as DA DA2 , t ag -j um p and cr oss -sam pl e 193
7 cont am i nat i on ar e r ar el y t ak en i nt o account ( but see Bo yer et al ., 2 016; Edgar , 194
2016a) . Howe ver , f ai l i ng t o f i l t er out t hese a r t ef act s i s l i kel y t o i nf l at e f al se 195
posi t i ve occur r ences a nd ar t i f i ci al l y i ncr ease i nt er -sam pl e si m i l ar i ti es. In f act , t he 196
DA DA2 based pi pel i n e pr oduced AS V and c l ust er r i chness per sa m pl e t hat was on 197
aver age t wi ce as hi g h as wi t h V TAM a nd ev en hi gher f or som e sa m pl es ( Fi g. 2 A, 198
B) . O n t he ot her hand, di ssi m i l ar i t i es bet ween sam pl es wer e l o wer af t er DAD A2 199
f i l t r at i on . Addi t i onal l y , t he near 1: 1 cor r el at i on bet ween AS V and cl ust er r i chness 200
i n bot h pi pel i nes i ndi c at ed t hat m ost cl ust er s cont ai ned j ust one AS V per sam pl e.
201
Thi s suppor t s t he not i on t hat di ver si t y i nf l at i on i n D AD A2 r esul t e d f r om unf i l t er ed 202
t ag-j um p cont am i nat i o n s r at her t han P CR or sequenci ng er r or s as t hi s woul d have 203
pr oduced m or e ASVs t hat bel ong t o t he sam e cl ust er . Our VT AM pi pel i ne , 204
t her ef or e , appear s m or e appr opr i at e f or com p ar i ng t he di ver si t y bet ween sam pl es 205
and f or i nvest i gat i ng t he bi ol ogi cal r esponse s t o envi r onm ent al cha nge . 206
5 Conclusions
207
The V TAM m et abar co di ng pi pel i ne ai m s t o addr ess kno wn t echni c al er r or s dur i ng 208
dat a anal ysi s ( Tabl e 1 ) t o val i dat e m et abar codi ng dat a . It i s a com pl et e pi pel i ne 209
f r om r aw FA ST Q dat a t o cur at ed AS V t abl e s wi t h t axonom i c assi gn m ent s.
210
The i m pl em ent at i on of VT AM pr ovi des se ver al advant ages such as usi ng a C onda 211
envi r onm ent t o f aci l i t at e t he i nst al l at i on , dat a s t or age i n SQ Li t e d at abase f or 212
t r aceabi l i t y and t he po ssi bi l i t y t o r un one or sever al sequenci ng r un -m ar ker 213
com bi nat i ons usi ng t h e sam e com m and. VT AM i ncl udes f eat ur es r ar el y consi der ed 214
i n m ost m et abar codi ng pi pel i nes, and w e bel i eve i t pr ovi des a usef u l t ool f or t he 215
anal ysi s and val i dat i o n of m et abar codi ng da t a f or conduct i ng r ob u st anal yses of 216
bi odi ver si t y . 217
Acknowledgements
218
We t hank Di ane Zar zo so -Lac ost e and Sam ant a Or t uno Mi guel f or v al uabl e 219
com m ent s on t he use o f VT AM , Luc Gi f f on a nd Li onel S pi nel l i f or t he devel opm ent 220
of Wopm ar s and K ur t Vi l l sen f or Engl i sh ed i t i ng. Cent r e de Cal cul Int ensi f d’ Ai x - 221
Mar sei l l e i s acknowl edged f or gr ant i ng acce ss t o i t s hi gh per f or m a nce com put i ng 222
r esour ces. Thi s wor k i s a cont r i but i on t o t he Eur opean pr oj ect SEA MoBB, f u nded 223
by ER A -Net Mar -T ER A and m anage d by A N R ( num ber A NR _17_ MART -00 01_01) . 224
Authors’ contributions
225
EM, EC , VD c oncei ve d t he i deas and desi gn ed t he m et hodol ogy . E M and A G 226
concei ved t he sof t war e ar chi t ect ur e and t est ed t he V TAM . A G, T D and RM 227
devel oped t h e VT AM sof t war e ; AG cont r i bu t ed t o t he WopMar s so f t war e 228
8 devel opm ent . EM, AG , V D and EC wr ot e t he m anuscr i pt . Al l aut ho r s cont r i but ed 229
cr i t i cal l y t o t he dr af t and appr oved t he f i nal ver si on of t he m anuscr i pt . 230
231
Data Availability
232
VTAM i s avai l abl e at ht t ps: / / gi t hub.com / ai tgon/ vt am. A det ai l ed u ser m anual i s 233
f ound at ht t ps: / / vt am .r eadt hedocs.i o. 234
Em pi r i cal dat a used i n t hi s paper ar e avai l ab l e f r om t he Dr yad Di gi t al Reposi t or y 235
ht t ps: / / dat adr yad.or g/ st ash/ dat aset / doi : 10.5061/ dr yad.f 40v5 and 236
ht t ps: / / dat adr yad.or g/ st ash/ dat aset / doi : 10.5061/ dr yad.kv02 g . 237
References
238
Al ber di , A ., Ai zpur ua, O., Gi l ber t , M. T. P., & Bohm ann, K. ( 20 18) . Scr ut i ni zi ng 239
key st eps f or r el i abl e m et abar codi ng of envi r onm ent al sam pl es. M et hods i n Ecol o gy 240
and Evol ut i on, 9( 1) , 1 34 –147. doi : 10.11 11/ 2 041 - 210X .1284 9 241
Bakker , M. G. ( 2018) . A f ungal m oc k com m uni t y cont r ol f or am pl i con sequenci ng 242
exper i m ent s. M ol ecul ar Ecol ogy Resources, 18( 3) , 541 – 556. doi : 1 0.1111/ 175 5 - 243
0998.127 60 244
Bol yen, E., Ri deout , J. R., Di l l on, M. R. , B o kul i ch, N. A. , Abnet , C. C ., Al - 245
Ghal i t h, G . A ., … Cap or aso, J. G. ( 2 019) . Re pr oduci bl e, i nt er act i ve, scal abl e and 246
ext ensi bl e m i cr obi om e dat a sci ence usi ng Q I IME 2 . N at ure Bi ot ech nol ogy, 37( 8) , 247
852–857. doi : 10.1 038/ s41587 -019 -02 09 -9 248
Boyer , F., Mer ci er , C. , Boni n , A ., Br as, Y. L., Ta ber l et , P., & Coi ssac , E. ( 2 016) . 249
obi t ool s: a uni x -i nspi r ed sof t war e packa ge f or DN A m et abar codi n g. M ol ecul ar 250
Ecol ogy Reso urces, 1 6( 1) , 176 –18 2. doi : 10 .1 111/ 1755 -099 8.1242 8 251
Calderón‐Sanou, I., Münkemüller, T., Boyer, F., Zinger, L., & Thuiller, W. (2020).
252
Fr om envi r onm ent al D N A se quences t o ec ol o gi cal concl usi ons: Ho w st r ong i s t he 253
i nf l uence of m et hodol ogi cal choi ces? Journ al of Bi ogeogra phy, 4 7( 1) , 193 –2 06.
254
doi : 10.1111/ j bi .13 681 255
Cal l ahan, B . J. , McMu r di e, P. J. , R osen, M. J., Han , A. W., Johnso n, A. J. A., &
256
Hol m es, S. P. ( 2 01 6) . DA DA2: Hi gh -r esol ut i on sam pl e i nf er ence f r om Il l um i na 257
am pl i con dat a. Nat ure M et hods, 13( 7) , 5 81 –5 83. doi : 10. 1038/ nm et h .3869 258
Cor se, E ., Megl écz, E. , Ar cham baud , G ., Ar d i sson, M., Mar t i n, J . - F ., Tou gar d, C., 259
… Dubut, V. (2017). A from -benchtop-to-desktop workflow for validating HTS data 260
and f or t axonom i c i de nt i f i cat i on i n di et m etabar codi ng st udi es. M ol ecul ar Ecol ogy 261
Resources, 1 7( 6) , e14 6 –e159. doi : 10.111 1/ 1 755 - 0998.1 2703 262
Corse, E., Tougard, C., Archambaud‐Suard, G., Agnèse, J. -F., Mandeng, F. D. M., 263
Bi l ong, C. F . B ., … D ubut , V . ( 2019) . On e -l ocus -sever al -pr i m er s: A st r at egy t o 264
i m pr ove t he t axonom i c and hapl ot ypi c c over age i n di et m et abar cod i ng st udi es.
265
Ecol ogy and Evol ut i on, 9( 8) , 46 03 –462 0. doi : 10.1002/ ece3. 5063 266
Cr i st escu, M. E ., & H eber t , P. D. N. ( 2 01 8) . Uses and Mi suses of Envi r onm ent al 267
DN A i n Bi odi ver si t y S ci ence and C onser vat i on. An nual Revi e w of Ecol ogy, 268
9 Evol ut i on, and Syst em at i cs, 49( 1) , 20 9 –230 . doi : 10.1146/ ann ur ev - e col sys -11 0617 - 269
062306 270
De Bar ba , M., Mi quel , C., Boyer , F. , Mer ci er , C., Ri oux , D ., Coi ss ac, E., &
271
Taber l et , P. ( 2 014) . D NA m et abar codi ng m ul t i pl exi ng and val i dat i on of dat a 272
accur acy f or di et asses sm ent : appl i cat i on t o om ni vor ous di et . M ol e cul ar Ecol ogy 273
Resources, 1 4( 2) , 30 6 –323. d oi : 10.1111/ 1 75 5 -09 98.121 88 274
Edgar , R. C. ( 201 6a) . UN CR OSS: Fi l t er i ng of hi gh -f r eque ncy cr oss -t al k i n 16 S 275
am pl i con r eads. Bi oRx i v, 088666 . doi : 10. 110 1/ 088666 276
Edgar , R. C. ( 201 6b) . UN OISE2: i m pr oved e r r or -cor r ect i on f or Il l u m i na 16S and 277
IT S am pl i con sequenci ng. Bi oR xi v, 081 257. doi : 10.1101/ 08 1257 278
Fr øsl ev, T. G. , Kj øl l er , R., Br uu n, H. H., Ej r næs, R., Br unbj er g, A. K., Pi et r oni , C ., 279
& Hansen, A. J . ( 2017 ) . Al gor i t hm f or post - cl ust er i ng cur at i on of DN A am pl i con 280
dat a yi el ds r el i abl e bi odi ver si t y est i m at es. Nat ure C om muni cat i o ns, 8( 1) , 1 –11.
281
doi : 10.1038/ s41 467 -0 17-0 1312 -x 282
Gal an, M., Pons , J. -B . , Tour nayr e , O ., Pi er r e , É., Leucht m ann, M., Pont i er , D. , &
283
Char bonnel , N. ( 2018) . Met abar codi ng f or t h e par al l el i dent i f i cat i on of sever al 284
hundr ed pr edat or s a nd t hei r pr ey: Appl i cat i o n t o bat speci es di et a nal ysi s.
285
M ol ecul ar Ecol o gy Re sources, 18( 3) , 474 – 4 89. doi : 10. 1111/ 175 5 - 0998.127 49 286
Mahé, F., Rog nes, T., Qui nce, C., de Var gas, C., & Dunt hor n , M. ( 2014) . S war m : 287
r obust and f ast cl ust er i ng m et hod f or am pl i con -base d st udi es. Peer J, 2, e5 93.
288
doi : 10.7717/ peer j .5 93 289
Mar t i n, M. ( 2011) . Cu t adapt r em oves adapt e r sequences f r om hi gh -t hr ough put 290
sequenci ng r eads. EM Bnet .Journal, 17( 1) , 1 0 –12. d oi : 10.1480 6/ ej .17.1.20 0 291
O’Rourke, D. R., Bokulich, N. A., Jusino, M. A., MacManes, M. D., & Foster, J. T.
292
( 2020) . A t ot al cr apsh oot ? Eval uat i ng bi oi nf or m at i c deci si ons i n ani m al di et 293
m et abar codi ng anal yse s. Ecol ogy and Evol ut i on, 10( 18) , 9721 –9739 . 294
doi : 10.1002/ ece3 .6594 295
Rognes, T ., Fl our i , T., Ni chol s, B., Qui nce, C., & Mahé, F. ( 2016) . VS EA RC H: a 296
ver sat i l e open sour ce t ool f or m et agenom i cs. PeerJ, 4, e258 4.
297
doi : 10.7717/ peer j .2 58 4 298
Zinger, L., Bonin, A., Alsos, I. G., Bálint, M., Bik, H., Boyer, F., … Taberlet, P.
299
( 2019) . D N A m et ab ar codi ng —Need f or r obus t exper i m ent al desi gns t o dr aw s ound 300
ecol ogi cal concl usi ons . M ol ecul ar Ecol ogy, 28( 8) , 1857 –1862 . 301
doi : 10.1111/ m ec.150 6 0 302
303 304
10
Figures and tables
305 306
307
F i g u r e 1. A n e x a m p l e o f a d a t a s t r u c t u r e w i t h o n e r u n , t w o m a r k e r s a n d 308
t h r e e r e p l i c a t e s f o r e a c h s a m p l e . S 1 - R 1 : R e p l i c a t e 1 o f S a m p l e 1 . R e p l i c a t e s 309
a r e n o t e s s e n t i a l b u t s t r o n g l y r e c o m m e n d e d . S a m p l e s s h o u l d i n c l u d e a t l e a s t 310
o n e m o c k s a m p l e a n d o n e n e g a t i v e c o n t r o l . 311
312
11 313
314
F i g u r e 2. D i v e r s i t y e s t i m a t e s f r o m t h e f i s h a n d b a t d a t a s e t s , b a s e d o n t h e 315
V T A M a n d D A D A 2 - b a s e d p i p e l i n e s . A ) A S V r i c h n e s s p e r s a m p l e B ) c l u s t e r 316
r i c h n e s s p e r s a m p l e C ) T h e c o r r e l a t i o n b e t w e e n A S V a n d c l u s t e r r i c h n e s s . P - 317
v a l u e i n d i c a t e s a s i g n i f i c a n t s l o p e d i f f e r e n c e b e t w e e n t h e t w o p i p e l i n e s . 318
D ) β - d i v e r s i t y w a s e s t i m a t e d u s i n g t h e B r a y - C u r t i s d i s s i m i l a r i t y i n d e x 319
c a l c u l a t e d f o r e a c h p a i r w i s e s a m p l e c o m p a r i s o n . S o l i d l i n e s i n d i c a t e l i n e a r 320
r e g r e s s i o n l i n e s , h a t c h e d l i n e s a r e t h e 1 : 1 r e f e r e n c e l i n e s . 321
322
12 323
T a b l e 1 . L i s t o f V T A M c o m m a n d s a n d t h e i r r o l e s . 324
325
VTAM command
VTAM step
(Name in Corse et al. 2017) Role Error Type
merge
Merges paired-end reads
and quality filtering Sequencing errors
sortreads Assigns reads to samples Sequencing errors
filter Dereplicate Dereplicates
filter Delete singletons Deletes singletons
Sequencing errors, highly spurious se- quences
filter
LFN_variant filter (LFNtag)
Deletes low frequency er- rors
Tug jump, inter sam- ple contamination filter
LFN_read_count filter (LFNneg)
Deletes low frequency er- rors
Sequencing error, light contamination filter
LFN_sample_replicate filter (LFNpos)
Deletes low frequency er- rors
Sequencing error, light contamination
filter FilterMinReplicateNumber
Ensures consistency be-
tween replicates PCR heterogeneity filter
FilterPCRerror (Obliclean)
Eliminates PCR errors (even
if frequent) PCR errors
filter FilterChimera Eliminates chimeras Chimeras
filter FilterRenkonen
Eliminates aberrant repli-
cates Dysfunctional PCRs
filter
FilterIndel
(Pseudogene filter) Eliminates pseudogenes
Pseudogenes, spuri- ous sequences filter
FilterCondonStop
(Pseudogene filter) Eliminates pseudogenes
Pseudogenes, spuri- ous sequences
taxassign (LTG) Assigns variants to taxa
Highly spurious sequences
optimize OptimizeLFNsampleReplicate
Finds the optimal parameter for the LFN-sample-replicate
filter
optimize OptimizePCRerror
Finds the optimal parameter for FilterPCRerror
optimize
OptimizeLFNreadCountAndLFN- variant
Finds the optimal value for LFN-read-count and LFN-
variant filters
pool
Pools the results from differ-
ent runs/markers
326 327 328
13 T a b l e 2 . N u m b e r o f f a l s e p o s i t i v e o c c u r r e n c e s c o m p a r e d t o t h e t o t a l n u m b e r 329
o f o c c u r r e n c e s . I n n e g a t i v e c o n t r o l a n d m o c k s a m p l e s , t h e c o u n t o f f a l s e 330
p o s i t i v e s i s p r e c i s e , s i n c e t h e s a m p l e c o m p o s i t i o n i s k n o w n . 331
VTAM Fish DADA Fish VTAM Bat DADA Bat Negative controls 0/0 (0%) 32/32 (100%) 2/2 (100%) 19/19 (100%) Mock samples 5/17 (29%) 37/49 (75%) 22/61 (36%) 73/114 (65%) 332
333
14
Supporting Information
334 335
S u p p I n f o 1 . p d f 336
D e s c r i p t i o n o f t h e t a x o n o m i c a s s i g n a t i o n a n d i t s c u s t o m d a t a b a s e . 337
S u p p I n f o 2 . p d f 338
C o m m a n d s , u s e r i n p u t f i l e s , a n d t h e f i n a l A S V t a b l e s p r o d u c e d b y V T A M 339
a n d t h e D A D A b a s e d p i p e l i n e f o r t h e f i s h a n d t h e b a t d a t a s e t s . 340
341
S u p p I n f o 3 . p d f 342
D i v e r s i t y e s t i m a t i o n p r o t o c o l 343