DuplicationDetector, a light weight tool for duplication detection using NGS data

(1)

HAL Id: hal-03070294

https://hal.archives-ouvertes.fr/hal-03070294

Submitted on 28 Apr 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

DuplicationDetector, a light weight tool for duplication

detection using NGS data

G. Djedatin, C. Monat, S. Engelen, François Sabot

To cite this version:

G. Djedatin, C. Monat, S. Engelen, François Sabot. DuplicationDetector, a light weight tool for

duplication detection using NGS data. Current Plant Biology, Elsevier, 2017, Plant Development,

9-10, pp.23-28. �10.1016/j.cpb.2017.07.001�. �hal-03070294�

(2)

ContentslistsavailableatScienceDirect

Current

Plant

Biology

j ou rn a l h o m ep a g e :w w w . e l s e v i e r . c o m / l o c a t e / c p b

DuplicationDetector,

a

light

weight

tool

for

duplication

detection

using

NGS

data

夽

Gustave

Djedatin

a,b,∗∗

_,

_Cécile

_Monat

b,c,d,1

_,

_Stefan

_Engelen

e

_,

_Francois

_Sabot

b,c,d,∗

a_BIOGENOM_Laboratory,_FAST/DASSA,_BP₁₄_{Dassa-Zoumé,}_Benin

b_DIADE_UMR_{IRD/UM–Centre}_IRD_de_Montpellier,₉₁₁_av_Agropolis_BP_604501,_F-34₃₉₄_Montpellier_Cedex_5,_France c_South_Green_{Bioinformatics}_Platform,_Agropolis_Campus,_Montpellier,_France

d_Université_de_Montpellier,_Place_Eugène_Bataillon,_34000,_Montpellier,_France

e_Commissariat_à_l’Energie_Atomique_(CEA),_Institut_de_Génomique_(IG),_Genoscope,_BP5706,_F-91057_Evry,_France

a

r

t

i

c

l

e

i

n

f

o

Articlehistory: Received19May2017

Receivedinrevisedform18July2017 Accepted19July2017 Keywords: Duplication NGS Rice

a

b

s

t

r

a

c

t

Duplicationsareoneonthemainevolutionaryforcesinangiosperm,especiallyinPoaceae.Alargenumber ofgenesinvolvedinvariousmetabolismsandpathwaysoriginatefromsuchduplications(wholegenome, segmentalorsinglegene).However,todetectsuchduplicationmaybecomplicated,costlyandgenerally requiresheavyhumanandmaterialinvestments.Here,weproposeanalternativeapproachfordetecting putativerecentsegmentalduplicationsinhaploidordiploidhomozygousorganismsbasedonNGSdata. Werelyonabusivemappingsofparalogoussequencesthatincreaseapparentheterozygouspointsata givenlocustoidentifysuchduplicatedgenomicregions.Wetestourtoolonsimulateddata,thenontrue ricegenomicsequencesandwereabletoidentifyabout200candidateduplicatedgenesinAfricanrice (Oryzaglaberrima)lineagecomparedtoAsianone(O.sativa).

1. Introduction

Duplicationisanimportantfeatureoftheplantgenome archi-tecture,andcaninvolveasinglegene,achromosomesegment,an entirechromosomeoreventhewholegenome[1].Itwasshownfor instancethatangiospermsundergonelargescaleduplicationsand multiplewholegenomeduplicationsallalongtheirevolution[2]. Wholegenomeduplication,i.e.doublingtheamountofthe com-pletegeneticmaterialofanindividualwithoutcrossing,appears asasourceofevolutionandbiologicalcomplexity[1,3,4].Inthe sameway,segmentalduplicationscreatelocalvariationsoffering newopportunitiesfornaturalselectiontooccur[5].Therefore,gene andgenomicduplicationsplayanimportantroleintheevolutionof plantphenotypes[6],andduplicatedgenescouldundergodifferent behaviors:(i)neofunctionalization–retentionofbothdivergent

夽 Thisarticleispartofaspecialissueentitled“PlantDevelopment”.

∗ Correspondingauthorat:DIADEUMRIRD/UM–CentreIRDdeMontpellier,911 avAgropolisBP604501,F-34394MontpellierCedex5,France.

∗∗ Correspondingauthorat:BIOGENOMLaboratory,FAST/DASSA,BP14 Dassa-Zoumé,Benin.

E-mailaddresses:[email protected](G.Djedatin),[email protected]

(F.Sabot).

1 _Current_address:_{Domestication}_Genomics_Group,_IPK_Gatersleben,_OT

Gater-sleben,Corrensstrasse3,D-06466Seeland,Germany.

copiesbutwithanewfunctionforoneofthem–,(ii) subfunction-alization–retentionofbothcopieswithconservedfunctionbutin anothertissue/organ/timeframeforone–,or(iii) nonfunctionaliza-tion/pseudogenization–largenumberofmutationsaccumulation in oneof thecopies [1]. Thetwo ﬁrst case (neo-and subfunc-tionalization)mayleadtonewexpressionpattern,oreven new regulatorypathway[7].

In cultivated Asian rice (Oryza sativa), for instance,genome duplicationprovidedimprovedrootresistance[8],seed germina-tionandseedlinggrowthtosaltstress[8,9].Inaddition,tandem duplications were evidenced, amplifying adaptively important resistancegenesencodingmembraneproteinsandfunctionrelated toabioticandbioticstress[10].Hence,segmentalduplicationand tandemduplicationleadtoHAP(HeterotrimericHemeActivator Protein)geneduplicationregulatingriceheadingdate[11].

However,detectinggenomeorsegmentalduplicationsisa com-plextask.Differentapproachesandtechniquesareused,suchas molecularonesthatgather(time-consuming)techniquese.g. com-parativegenomehybridization(CGH)[12],FISH,and arrayCGH

[13].Recently,duetotheavailabilityofNextGeneration Sequenc-ingtechnologiesandoftheirlowcost[14–16],morecomputational sequencing-based approaches weredeveloped (e.g. [17]).These methods rely mainlyonDepth of Coveragevariations(DoC) to identifyduplicationsrelative toa referencegenome,asa dupli-cated regions are expected to be twice more sequenced than

https://doi.org/10.1016/j.cpb.2017.07.001

(3)

24 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28

non-duplicatedregions[18].However,molecularapproachessuch asDoCmethods requirehighly preciseexperiments,repetitions andheavycomputationaltimes,aredesignedtocompareone tar-getindividualtoagivenreferenceone,andthuscannotbeapplied onalargenumberofindividuals.

Inthisstudy,weproposeanewmethodologybasedontheuseof apparentexcessofheterozygousloci(AEH)ongenomicintervalsin autogamousdiploidspecies.Thismethodologywasimplemented inatoolcalledDuplicationDetector,andwastestedonasetof simu-lateddataandshowntobefastandrobust.Inaddition,weappliedit oncultivatedandwildAfricanrices,respectivelyOryzaglaberrima andO.barthii,todetectduplicatedgenescomparedtotheAsian riceOryzasativa.

2. Materialsandmethods

2.1. Material

2.1.1. Simulatedsequencedataforvalidation

Afragmentofchromosome7(1Mb)ofOryzasativasspjaponica cvNipponBareIRGSP1.0wasextractedfromposition1,000,000to 1,999,999usingtheextractseqtoolfromEMBOSS[19].Three dupli-cations(namelyduplications1–3)wereartiﬁciallycreatedwithin thissequenceusinghome-madePerlscript(availableondemand), respectivelyatpositions300–1500;350,000–390,000(containing another initialduplication) and 800,000 to 810,000. A total of ninevirtual‘clones’wereconstructed.Clone1,withoutduplicated sequences,wasconsideredasidenticaltothereference.Clone2has thethreeduplicatedsequenceswithoutmutation.Clones4–6have theduplicatedsequencesandmutationswith3%of supplemen-tarydivergence,whileclones7–9havetheduplicatedsequences withthesamemutationsand6%ofdivergence.Inadditionweadd inclones3–9additionalcommonmutationsinthenonduplicated sequence.Allmutationswereinducedusingthemutatednatool fromtheSMS2suite[20].Thesequencesfromeachvirtualclone werethenusedtosimulateFASTQdatausingaRTsimulationtool

[21],specifyingasoptionsHiSeq2500machine,100pair-end frag-ment,depthof35,insertsizeof200,10%ofinsertsizedivergence. aRTincludesanempiricerrormodelthatallowsaverygood simu-lationofsequencingdata[21].

2.1.2. SequencedataforOryzaglaberrimaandO.barthiiand initialqualitycontrol

Eight accessions of African cultivated rice Oryza glaberrima (TOG5307, TOG5307f, TOG5321, TOG5666, TOG5887, TOG7291, UB06, UG26), and six wild relatives Oryza barthii (B88, IG05, IRGC106302,MB323,TB41,TG57)wereusedinthisstudy(see[22]

formore informationsaboutthoseaccessions). Asian cultivated riceO. sativaIR64(sspindica)and Azucena(ssp japonica)were alsoincludedascontrol.AllsamplesweresequencedatGenoscope (France),intheframeoftheIRIGINproject(http://irigin.org),as follows:

Sequencing:Librarieswere preparedusing theNEBNext DNA ModulesProducts(NewEnglandBiolabs,MA,USA)witha‘onbeads’ protocoldevelopedattheGenoscope,thusreducingthecostsand increasingtheyields.Briefly,aftergDNAfragmentationwiththe E210Covarisinstrument(Covaris,Inc.,USA),endrepair,A-tailing andligationwithadaptedconcentrationsofNextflexDNAbarcodes (BiooScientific,Austin,TX,)wereperformedonthesameAMPure XPbeadsthatwasusedforthefirstpurificationafterendrepair. Aftertwoconsecutive1xAMPureXPcleanup,theligatedproduct wasamplifiedby12cyclesPCRusingKapaHifiHotstartNGSlibrary Amplificationkit(KapaBiosystems,Wilmington,MA),followedby 0.6xAMPureXPpurification.Librariestraceswerevalidatedon Agi-lent2100Bioanalyzer(AgilentTechnologies,USA)andquantified

byqPCR usingtheKAPALibraryQuantiﬁcationKit (KapaBiosys-tems)onaMxProinstrument(AgilentTechnologies,USA).Libraries weresequencedonanIlluminaHiSeq2000orHiSeq4000instrument (Illumina,USA),at2×101bpor2×151bp.respectively.About50 billionusefulpaired-endreadswereobtainedperrun.

QCand initial treatments: Low quality clusters werefiltered duringthesequencingrunbyRealTimeAnalysis(RTA)software. FilteringstepswereperformedonwholepairedFASTQfiles: Illu-minaadaptersandprimerswereremoved,nucleotideswithquality valuelowerthan20weretrimmedfrombothendsandsequences between the second unknown nucleotide (N) and the end of thereadweretrimmed.Readsshorter than30 nucleotidesafter trimmingwerediscarded. These trimmingstepswereachieved usingfastxtend (http://www.genoscope.cns.fr/fastxtend/), a soft-ware based on the FASTX library [23]. The filtered reads and theirmatesthatmappedontorunqualitycontrolsequences(PhiX genome)wereremovedusingSOAPaligner[24].

2.1.3. Referencesequence&annotation

The reference sequenced genome IRSGP-1.0/MSU7.0 and its annotationfromMSUv7[25]wereusedforanalysisasdescribed above.TheinitialVCF(VariantCallFormat)ﬁlesareavailableat

http://bioinfo-storage.ird.fr/2017/CPB/Djedatin/.

2.2. Methods

2.2.1. Mappingapproach&initialSNPcalling

For VCF creation, cleaned paired FASTQ data were mapped uponthereferenceinitialsequenceusingBWA0.7.12(aln/sampe legacyalgoritm)[26].SAM(SequenceAlignment/Map)fileswere cleanedandfilteredforlowqualitymappingandabnormal map-ping,mergedandrealignedusingcombinationofSAMtools[27]and PicardTools[28].Afterrealignment,SNPwerecalledusingtheGATK HaplotypeCaller[29]understandardconditions.Callingwas per-formedperindividualchromosometooptimizecalculationtime. AllstepswereperformedusingtheTOGGLEpipeline[30]toensure repeatabilityand traceability.The standard defaultvalues were chosenrespectingtwocriteria:I)THEIRIC/3Kgenomesstandards formapping/callingandII)numeroushomemadetestsand evalua-tionsofconditionsusingcontrolsamplesindifferentanalyses(such asinMonatetal.[30],GBE)thatprovidethebestresults.Detailed optionsareshowninsuppData,aswellasTOGGLEconfiguration file.

2.2.2. HeterozygousSNPrecovery

VCFwereﬁlteredoutforrecoveryoflinescontaining heterozy-gousSNPsbasedonstandarddefaultﬁlters(depthforeachsample, maximumnumberofmissingdata,minimalcallingqualityvalue, maximumMQ0value,homozygouscontrols).

2.2.3. Genomicsintervalsrecomposition

Extracted VCF lines were recompiled in genomics intervals respectinga specifiedmaximal distance between2 SNPs tobe consideredasrelated, aminimalblocksize,andaminimal het-erozygousSNPdensity.Resultingfilesare3columnsBED-likefiles.

2.2.4. DuplicatedgenesidentiﬁcationandSNPpotentialeffect identiﬁcation

GenomicsintervalﬁleswerecrossedwithGFFﬁlecontaining annotationusingintersectBEDfromtheBEDtoolssuite[31].Selected heterozygousSNPswerethenannotatedfortheirpotentialeffect usingsnpEffsoftware[32].Aschematicviewofthewholepipeline isdetailedinSuppdata.

(4)

Fig.1. ApparentExcessofHeterozygouswillappearifreadscomingfromaduplicatedregionareabusivelymappedonareferencegenomewithouttheduplication.

2.3. Availability

All codes, installation instructions and manual for Dupli-cationDetector are available, under the GPLv3/CeCiLL-B double licenses, on the GitHub of the project: https://github.com/

SouthGreenPlatform/duplicationDetector.

3. Results

3.1. Descriptionoftheapproach

We rely on AEH genomic intervals to detect duplicated sequences(Fig.1).Basically,wedetectabusivemappingofreads comingfromduplicatedregionsinasequencedindividualwhen theyaremappedonareferencegenomewithouttheduplication (i.e.harboringonlyasinglecopy).Suchabusivemappingwilllead theSNPcallingtoproduceanApparent ExcessofHeterozygous locus,i.e.toomanyheterozygouslociinashortregion.Ifmany indi-vidualsaresequencedandmappedinthesameexperiment,such AEHlociwillappearfor(almost)eachindividualinthesame loca-tion,indicatingthusthattheregionisduplicatedinthesequenced genomes compared to thereference one. Userscan manage in DuplicationDetectorthelevelofstringencyforselectionofAEHloci usingdifferentcriteria:

• Minimaldepthperindividuals(defaultat30)

• Minimalnumberofindividualstobeheterozygousforapointto bechosen(defaultat10)

• MaximalnumberofMQ0reads(thatcanbemappedattwo posi-tionswithidenticalscore;defaultat0)

• Controlindividuals(somesamplesthatmaynotbeheterozygous) Thegenomicintervalcreationcanalsobesetupusingfollowing criteria:

• Maximumsizebetweentwoheterozygousloci(defaultat1kb) • Minimumsizeofthegenomicinterval(defaultat100bp) • Minimumdensityofheterozygouslociinthegenomicinterval

(defaultat25basesbetweeneachSNP).

IfuserprovideaGFFfileforgeneannotation,duplicatedgenes will beidentified throughoverlapping withidentified genomic intervals.

Intermsofspeed,acompletescanfromarawVCFof16African riceindividuals (12chromosomes,∼380Mb)athighsequencing depth(∼35x,seeMaterialsandmethods),withtwoAsianrice indi-vidualsascontrol,willbeperformedonasinglerecent64-bitscore inlessthan2h.

3.2. Resultsonsimulateddata

Onsimulateddata,wewereabletopartially(∼40%)identifythe duplication1directlyandalmostentirely(∼95%)theduplication

3(Table1)asfragmentedblocks.Wewereabletolimitthosetwo

duplicationswithaquitegoodresolution,i.e.capacityofcorrectly identifytheborders(max500bpoferrorinlimitating;seeTable1). Thedifferenceofrecoverylevelbetweenthetwoduplicationsis mainlyduetotheirsize,asforduplication1(1.2kb),thenon recov-eredfragmentisof193bpin5and522bpin3on1200,whilefor duplication3thenon-recoveredsizeisof142/355bp.The fragmen-tationeffectmaybeduetothefactthatduplication3isalargeblock butwithalowvariationdensity,andAEHlocidensityparameteris quiteconservative.

Duplication2wasnot detected,asit containsanother older (non-simulated) nestedduplication,and AEH loci inthis region wereremovedbasedonthemaximumMQ0parameter.Indeed, readscomingfromaregionalready duplicatedin thereference genomecouldbemappedonanyofthetwocopiesonthe refer-encewiththesameprobability.ThiswillincreasetheMQ0level,

(5)

Table1

Statisticsaboutsimulateddata.Positionsaregiveninbasepair.

Duplication Start Stop Start(recovered) Stop(recovered) Resolutioninbp/Nbblocks Recovery

1 300 1500 493 978 193<->522/1block 40.42%

2 3,50,000 3,90,000 NA NA NA NA

3 8,00,000 8,10,000 8,00,142 8,09,675 142<->355/8blocks 95.33%

Table2

StatisticsaboutduplicatedregionsinAfricanricescomparedtoAsian.Sizesaregiveninbasepair.

BlockSize AEHloci AEHlocifreq

Min Mean Max Min Mean Max Min Mean Max

Chr1 142 795 2770 9 37 111 9.33 10.99 13.03 Chr2 106 623 2031 5 31 106 8.58 10.75 12.75 Chr3 110 293 825 5 15 40 8.78 10.67 13.3 Chr4 103 794 2079 6 40 132 8.33 10.31 13 Chr5 119 719 2668 6 37 113 9.62 11.07 13.25 Chr6 107 1057 3560 7 50 169 8.44 11.02 13.15 Chr7 147 903 3198 7 49 154 8.94 11.19 13.17 Chr8 111 925 3336 8 44 158 8.1 11.29 13.36 Chr9 127 634 1057 7 35 64 10.07 11.37 13 Chr10 127 937 2817 7 47 170 9.14 11.21 13.29 Chr11 124 716 2055 6 35 92 8.68 10.91 12.89 Chr12 101 744 1804 5 37 85 8.79 10.62 12.05

andthusthoseregionswillbeﬁlteredoutwiththecurrentversion ofDuplicationDetector.

Nofalsepositivesregions wereobtainedfromthesimulated data.

3.3. Resultsonexperimentaldata

Wethentest ourtool onanexperimentalsetof Africanrice individualssequencedatrelativelyhigh-depth(seeMaterialsand methods).AfricancultivatedriceOryzaglaberrimaanditswild rela-tiveO.barthiidivergedfromAsiancultivatedriceO.sativaancestor almost 1 million year ago [33,34], and may harbor duplicated regionscomparedtoAsianrice.Wethustested8cultivatedand 6wildsamplestoevaluateourtoolonrealdata.Toavoid individ-ualeffectduetothereferencegenome(i.e.identiﬁcationofAEH locionlybecauseofalossofduplicationinNipponbare),weaddas controltwootherO. sativasample,IR64(indicasubspecies)and Azucena (japonicasubspecies), sequencedas described in 2.2.1. Thosetwoindividualsmustbehomozygousandsimilartothe ref-erenceforanAEHlocusontheAfricansamplestobevalidatedas such.

ThewholeanalysisontherawVCFrepresentingvariationson allthe12chromosomesofNipponbarereferencespendlessthan 2husingasinglecore,withaveryshortmemoryfootprint(max

2Mbytesperﬁle),andprovided786putativelyduplicatedregions inAfricanricelineagerelativetoAsianone(seeTable2).Fromthose putativelyduplicatedregions,wecanidentify200annotatedgene feature(i.e.taggedas‘gene’intheMSU7GFFﬁle;seeSuppData).The blocksrangedfrom102to3560bases,with5–170SNP(11SNP/kb onaverage)withAEHineachblock(Table2).

Theduplicatedregionsarewidelyspreadallalong the chro-mosomes, without any main locations (Fig. 2 and SuppData). Theputativelyduplicatedgenesthemselvesseemstoberelated partiallytostressresponse,butnogeneraltrendwereobserve con-cerningGene Ontology (datanotshown).We observeda mean valueof38AEHlociperregion,showingthattheseduplications arequite recent,but dating from beforetheradiation between O.glaberrimaandO.barthii.Indeed,ifapplyingameanvalueof 1.3×10−8mutationpermillionyears(ascalculatedin[35]),the meanageoftheputativeduplicatedregionsisof∼800,000years, spanningfrom∼605,000to972,000years,asexpectedfromthe twogroupseparation[33,34].

DetailedanalysisofmutationinducedbytheAEHlociin dupli-catedgenesshowedthatmostofthemutationsoccurredoutside ofthegenecodingregions(7–20%onlyinexonicsequences; Sup-pData). For exonicmutation, a mean Ka/Ks ratio of 1.53±0.45 (1.01–2.75)wasobserved,indicatingalowlevelofglobal diver-gent selection. The mean observed Transition vs Transversion

Fig.2. ChromosomallocationofduplicationsinAfricanricescomparedtoAsianonChromosome1.Inredaresymbolizedduplicationinvolvingannotatedgenes,ingreen regionwithoutannotatedgenes.

(6)

ratio is around 2 (1.7–2.3), as expected for genomic mutation

[36,37].

Whenfocusingonindividualgenes,wewereabletoidentify asputativelyduplicatedtheLOCOs07g09900gene(Chromosome 7,from5,263,409to5,267,310),DiseaseresistanceproteinRPM1, locatedunderamajorQTLofresistancetoAfricanstrainof Xan-thomonas(qABB-7,from[38]).Asithasbeenshowninotherplant species,duplicationofresistancegenesmayincreasedisease resis-tance.

3.4. Limitsoftheapproach

In a recent paper, Hutin et al. [39] shown a recent partial duplicationof3.2kboftheLOCOs11g31190geneOsSWEET14[39]

(Chromosome11,from18,171,678to18,174,478),involvedinan increasedresistancetoXanthomonasoryzaepvoryzae.Asetof12 highlevelvariantpositions,includingthe18bpdeletionbetween theoriginalcopyandtheduplicatedone[39],wereidentiﬁedin thevariantselectionstep.However,post-ﬁltrations(especiallythe minimalSNPdensity,hereof177bpbetweentwoSNPinsteadof 25)didnotallowtorecoverthisduplicatedblockinourcurrent test.

4. Discussion

Detectionofduplicatedregionsbetweenindividualsisa chal-lengingtaskusingmoleculartools,andacostingcomputingtask withsequencedata.Uptonow,useofthelatterapproach(mainly using NGS) was based on raw mapping, DoC divergence com-putationand CopyNumberVariation(CNV)analyses.Numerous toolsexperimentedsuchapproaches,withmoreorlessefﬁciency

[14–16],butallofthemrequiresalongcomputationtimeand

can-notcomparemassivelydifferentindividuals.

In thepresent study,we rely onabusivemapping and sub-sequentAEH loci to identifythe duplicatedregions. Moreover, ourtooldoesnotrequireintensere-calculationormapping,asit reliesdirectlyontherawVCFdata(alreadygeneratedin numer-ousgeneticanalyses)toidentifythoseAEHloci.Thisapproachis fastandallowstoworkonlargesamples;inaddition,itoffersthe possibilitytoincludenegativecontrolswhichallowusersto iden-tifyduplicationsexistinginonlyasubsampleoftheirsequenced individuals.Ourapproachishoweverquiteconservative,asshown onsimulateddata,andwillnotdetectforinstancenewcopiesof analreadyexistingrepeatedsequenceinthereferencegenome. Inthesameway,itcannotidentifytoorecentduplications,due tothelownumberofmutationsbetweenthetwocopies(asfor OsSWEET14).However,appliedtothedivergencebetweenAfrican andAsianrices,wewereabletoidentifymorethan200putatively duplicatedgenesandalmost780totalregions.Thedetectedgenes arewidelyspreadallalongthechromosomes,generallyrelatedto stressresponse,atleastmarginally,andunderaquitelowdivergent selection.

5. Conclusion

DuplicationDetectoristhusaveryefﬁcienttooltodetect dupli-cationinhaploidandhighlyhomozygousdiploidorganisms,such asrice(testedhere),butalsobacteria,yeasts,autogamousplants, haploidfungi,andsoon.Thefuturedevelopmentofourtoolwill includetheimplementationofdetectioninheterozygousor poly-ploidorganismsorboth,aswellasadditionalcriteriaforﬁltering (suchashard-clippinglevel).

Author’scontribution

GD and FS manage the whole study and wrote the whole pipeline.SE&Genoscopeperformedthesequencingandinitialdata

treatmentsandQC.FSperformedthesimulation,GDandCM per-formedthebasicdataanalyses,andGDanalyzedtheresults.GDand FSwrotethemanuscript,andallauthorscorrectedandapprovethe currentversion.

Acknowledgments

GD was supported by an IRD grant (2013–2017 BEST FellowShip). CM was supported by ANR (AfriCrop project #ANR-13-BSV7-0017)andNUMEVlabex(LandPanToggle #2015-1-030-LARMANDE).AuthorswanttothanksGenoscopemembers forthesequencingofallricedata.Thisworkwassupportedby FranceGénomiqueFrenchNationalinfrastructure,fundedaspart of“Investissementd’avenir”programmanagedbyANR (#ANR-10-INBS-09),intheframeoftheIRIGINproject(http://irigin.org).

Conﬂictofintereststatement

Theauthorshavenoconﬂictofinterest.

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound,in theonlineversion,athttp://dx.doi.org/10.1016/j.cpb.2017.07.001.

References

[1]S.Ohno,Theenormousdiversityingenomesizesofﬁshasareﬂectionof nature’sextensiveexperimentswithgeneduplication,Trans.Am.Fish.Soc. 99(1970)120–130.

[2]S.DeBodt,S.Maere,Y.VandePeer,Genomeduplicationandtheoriginof angiosperms,TrendsEcol.Evol.20(2005)591–597.

[3]R.Aburomia,O.Khaner,A.Sidow,Functionalevolutionintheancestrallineage ofvertebratesorwhengenomiccomplexitywaswaggingitsmorphological tail,in:GenomeEvolution,Springer,Netherlands,Dordrecht,2003,pp.45–52.

[4]J.S.Taylor,J.Raes,Duplicationanddivergence:theevolutionofnewgenes andoldideas,Annu.Rev.Genet.38(2004)615–643.

[5]R.Chandan,D.Indra,Geneduplication:amajorforceinevolutionand bio-diversity,Int.J.Biodivers.Conserv.6(2014)41–49.

[6]L.E.Flagel,J.F.Wendel,Geneduplicationandevolutionarynoveltyinplants, NewPhytol.183(2009)557–564.

[7]C.Feschotte,Transposableelementsandtheevolutionofregulatory networks,Nat.Rev.Genet.9(2008)397–405.

[8]Y.Tu,A.Jiang,L.Gan,M.Hossain,J.Zhang,B.Peng,Y.Xiong,Z.Song,D.Cai,W. Xu,etal.,Genomeduplicationimprovesricerootresistancetosaltstress,Rice 7(2014)15.

[9]A.Jiang,L.Gan,Y.Tu,H.Ma,J.Zhang,Z.Song,Y.He,D.Cai,X.Xue,Theeffect ofgenomeduplicationonseedgerminationandseedlinggrowthofriceunder saltstress,Aust.J.CropSci.7(2013)1814–1821.

[10]C.Rizzon,L.Ponger,B.S.Gaut,S.Maere,S.Bodt,J.DeRaes,T.Casneuf,M. Montagu,G.VanBlanc,K.Wolfe,etal.,Strikingsimilaritiesinthegenomic distributionoftandemlyarrayedgenesinArabidopsisandrice,PLoSComput. Biol.2(2006)e115.

[11]Q.Li,W.Yan,H.Chen,C.Tan,Z.Han,W.Yao,G.Li,M.Yuan,Y.Xing, DuplicationofOsHAPfamilygenesandtheirassociationwithheadingdatein rice,J.Exp.Bot.67(2016)1759–1768.

[12]S.Solinas-Toldo,S.Lampel,S.Stilgenbauer,J.Nickolenko,A.Benner,H. Döhner,T.Cremer,P.Lichter,Matrix-basedcomparativegenomic

hybridization:biochipstoscreenforgenomicimbalances,GenesChromosom. Cancer20(1997)399–407.

[13]C.Shaw-Smith,R.Redon,L.Rickman,M.Rio,L.Willatt,H.Fiegler,H.Firth,D. Sanlaville,R.Winter,L.Colleaux,etal.,Microarraybasedcomparative genomichybridisation(array-CGH)detectssubmicroscopicchromosomal deletionsandduplicationsinpatientswithlearningdisability/mental retardationanddysmorphicfeatures,J.Med.Genet.41(2004)241–248.

[14]S.Goodwin,J.D.McPherson,W.R.McCombie,Comingofage:tenyearsof next-generationsequencingtechnologies,Nat.Rev.Genet.17(2016) 333–351.

[15]T.C.Glenn,Fieldguidetonext-generationDNAsequencers,Mol.Ecol.Resour. 11(2011)759–769.

[16]T.C.Glenn,<http://molecularecologist.com//>.

[17]S.Newman,K.E.Hermetz,B.Weckselblatt,M.K.Rudd,Next-generation sequencingofduplicationCNVsrevealsthatmostaretandemandsome createfusiongenesatbreakpoints,Am.J.Hum.Genet.96(2015)208–220.

[18]P.Guan,W.-K.Sung,Structuralvariationdetectionusingnext-generation sequencingdata:acomparativetechnicalreview,Methods102(1June) (2016)36–49,http://dx.doi.org/10.1016/j.ymeth.2016.01.020.

(7)

[19]P.Rice,I.Longden,A.Bleasby,EMBOSS:theEuropeanmolecularbiologyopen softwaresuite,TrendsGenet.16(2000)276–277.

[20]P.Stothard,Thesequencemanipulationsuite:JavaScriptprogramsfor analyzingandformattingproteinandDNAsequences,Biotechniques28 (2000).

[21]W.Huang,L.Li,J.R.Myers,G.T.Marth,ART:anext-generationsequencing readsimulator,Bioinformatics28(2012)593–594.

[22]J.Orjuela,F.Sabot,S.Chéron,Y.Vigouroux,H.Adam,H.Chrestin,K.Sanni,M. Lorieux,A.Ghesquière,AnextensiveanalysisoftheAfricanricegenetic diversitythroughaglobalgenotyping–Springer,Theor.Appl.Genet.127(10) (2014)2211–2223,http://dx.doi.org/10.1007/s00122-014-2374-z. [23]FASTX-Toolkit,<http://hannonlab.cshl.edu/fastxtoolkit/index.html/>. [24]R.Li,C.Yu,Y.Li,T.-W.Lam,S.-M.Yiu,K.Kristiansen,J.Wang,SOAP2:an

improvedultrafasttoolforshortreadalignment,Bioinformatics25(2009) 1966–1967.

[25]Y.Kawahara,M.delaBastide,J.P.Hamilton,H.Kanamori,W.R.McCombie,S. Ouyang,D.C.Schwartz,T.Tanaka,J.Wu,S.Zhou,etal.,Improvementofthe OryzasativaNipponbarereferencegenomeusingnextgenerationsequence andopticalmapdata–Springer,Rice6(2013)4.

[26]H.Li,R.Durbin,Fastandaccuratelong-readalignmentwith Burrows–Wheelertransform,Bioinformatics26(2010)589–595.

[27]H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G. Abecasis,R.Durbin,Thesequencealignment/mapformatandSAMtools, Bioinformatics25(2009)2078–2079.

[28]PicardTools–ByBroadInstitute.

[29]A.McKenna,M.Hanna,E.Banks,A.Sivachenko,K.Cibulskis,A.Kernytsky,K. Garimella,D.Altshuler,S.Gabriel,M.Daly,etal.,TheGenomeAnalysisToolkit: aMapReduceframeworkforanalyzingnext-generationDNAsequencingdata, GenomeRes.20(2010)1297–1303.

[30]C.Monat,C.Tranchant-Dubreuil,A.Kougbeadjo,C.Farcy,E.Ortega-Abboud,S. Amanzougarene,S.Ravel,M.Agbessi,J.Orjuela-Bouniol,M.Summo,etal., TOGGLE:toolboxforgenericNGSanalyses,BMCBioinform.16(2015)374.

[31]A.R.Quinlan,I.M.Hall,BEDTools:aﬂexiblesuiteofutilitiesforcomparing genomicfeatures,Bioinformatics26(2010)841–842.

[32]P.Cingolani,A.Platts,L.L.Wang,M.Coon,T.Nguyen,L.Wang,S.J.Land,X.Lu, D.M.Ruden,Aprogramforannotatingandpredictingtheeffectsofsingle nucleotidepolymorphisms,SnpEff,Fly(Austin)6(2012)80–92.

[33]D.a.Vaughan,B.-R.Lu,N.Tomooka,Theevolvingstoryofriceevolution,Plant Sci.174(2008)394–408.

[34]D.a.Vaughan,K.Kadowaki,A.Kaga,N.Tomooka,Onthephylogenyand biogeographyofthegenusOryza,Breed.Sci.55(2005)113–122.

[35]J.Ma,J.L.Bennetzen,Rapidrecentgrowthanddivergenceofricenuclear genomes,Proc.Natl.Acad.Sci.U.S.A.101(2004)12404–12410.

[36]Z.Yang,D.Yodera,Estimationofthetransition/transversionratebiasand speciessampling,J.Mol.Evol.48(1999)274–283.

[37]S.Duchêne,S.Ho,E.C.Holmes,T.Jukes,C.Cantor,W.Brown,E.Prager,A. Wang,A.Wilson,R.Lewontin,etal.,Decliningtransition/transversionratios throughtimereveallimitationstotheaccuracyofnucleotidesubstitution models,BMCEvol.Biol.15(2015)36.

[38]G.Djedatin,M.-N.Ndjiondjop,A.Sanni,M.Lorieux,V.Verdier,A.Ghesquiere, IdentiﬁcationofnovelmajorandminorQTLsassociatedwithXanthomonas oryzaepv.oryzae(Africanstrains)resistanceinrice(OryzasativaL.),Rice(N. Y.)9(2016)18.

[39]M.Hutin,F.Sabot,A.Ghesquière,R.Koebnik,B.Szurek,Aknowledge-based molecularscreenuncoversabroadspectrumOsSWEET14resistancealleleto bacterialblightfromwildrice,PlantJ.84(4)(2015)694–703,http://dx.doi. org/10.1111/tpj.13042.