• Aucun résultat trouvé

DuplicationDetector, a light weight tool for duplication detection using NGS data

N/A
N/A
Protected

Academic year: 2021

Partager "DuplicationDetector, a light weight tool for duplication detection using NGS data"

Copied!
7
0
0

Texte intégral

(1)

HAL Id: hal-03070294

https://hal.archives-ouvertes.fr/hal-03070294

Submitted on 28 Apr 2021

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

DuplicationDetector, a light weight tool for duplication

detection using NGS data

G. Djedatin, C. Monat, S. Engelen, François Sabot

To cite this version:

G. Djedatin, C. Monat, S. Engelen, François Sabot. DuplicationDetector, a light weight tool for

duplication detection using NGS data. Current Plant Biology, Elsevier, 2017, Plant Development,

9-10, pp.23-28. �10.1016/j.cpb.2017.07.001�. �hal-03070294�

(2)

ContentslistsavailableatScienceDirect

Current

Plant

Biology

j ou rn a l h o m ep a g e :w w w . e l s e v i e r . c o m / l o c a t e / c p b

DuplicationDetector,

a

light

weight

tool

for

duplication

detection

using

NGS

data

Gustave

Djedatin

a,b,∗∗

,

Cécile

Monat

b,c,d,1

,

Stefan

Engelen

e

,

Francois

Sabot

b,c,d,∗

aBIOGENOMLaboratory,FAST/DASSA,BP14Dassa-Zoumé,Benin

bDIADEUMRIRD/UM–CentreIRDdeMontpellier,911avAgropolisBP604501,F-34394MontpellierCedex5,France cSouthGreenBioinformaticsPlatform,AgropolisCampus,Montpellier,France

dUniversitédeMontpellier,PlaceEugèneBataillon,34000,Montpellier,France

eCommissariatàl’EnergieAtomique(CEA),InstitutdeGénomique(IG),Genoscope,BP5706,F-91057Evry,France

a

r

t

i

c

l

e

i

n

f

o

Articlehistory: Received19May2017

Receivedinrevisedform18July2017 Accepted19July2017 Keywords: Duplication NGS Rice

a

b

s

t

r

a

c

t

Duplicationsareoneonthemainevolutionaryforcesinangiosperm,especiallyinPoaceae.Alargenumber ofgenesinvolvedinvariousmetabolismsandpathwaysoriginatefromsuchduplications(wholegenome, segmentalorsinglegene).However,todetectsuchduplicationmaybecomplicated,costlyandgenerally requiresheavyhumanandmaterialinvestments.Here,weproposeanalternativeapproachfordetecting putativerecentsegmentalduplicationsinhaploidordiploidhomozygousorganismsbasedonNGSdata. Werelyonabusivemappingsofparalogoussequencesthatincreaseapparentheterozygouspointsata givenlocustoidentifysuchduplicatedgenomicregions.Wetestourtoolonsimulateddata,thenontrue ricegenomicsequencesandwereabletoidentifyabout200candidateduplicatedgenesinAfricanrice (Oryzaglaberrima)lineagecomparedtoAsianone(O.sativa).

©2017TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Duplicationisanimportantfeatureoftheplantgenome archi-tecture,andcaninvolveasinglegene,achromosomesegment,an entirechromosomeoreventhewholegenome[1].Itwasshownfor instancethatangiospermsundergonelargescaleduplicationsand multiplewholegenomeduplicationsallalongtheirevolution[2]. Wholegenomeduplication,i.e.doublingtheamountofthe com-pletegeneticmaterialofanindividualwithoutcrossing,appears asasourceofevolutionandbiologicalcomplexity[1,3,4].Inthe sameway,segmentalduplicationscreatelocalvariationsoffering newopportunitiesfornaturalselectiontooccur[5].Therefore,gene andgenomicduplicationsplayanimportantroleintheevolutionof plantphenotypes[6],andduplicatedgenescouldundergodifferent behaviors:(i)neofunctionalization–retentionofbothdivergent

夽 Thisarticleispartofaspecialissueentitled“PlantDevelopment”.

∗ Correspondingauthorat:DIADEUMRIRD/UM–CentreIRDdeMontpellier,911 avAgropolisBP604501,F-34394MontpellierCedex5,France.

∗∗ Correspondingauthorat:BIOGENOMLaboratory,FAST/DASSA,BP14 Dassa-Zoumé,Benin.

E-mailaddresses:djedatingustave@yahoo.fr(G.Djedatin),francois.sabot@ird.fr

(F.Sabot).

1 Currentaddress:DomesticationGenomicsGroup,IPKGatersleben,OT

Gater-sleben,Corrensstrasse3,D-06466Seeland,Germany.

copiesbutwithanewfunctionforoneofthem–,(ii) subfunction-alization–retentionofbothcopieswithconservedfunctionbutin anothertissue/organ/timeframeforone–,or(iii) nonfunctionaliza-tion/pseudogenization–largenumberofmutationsaccumulation in oneof thecopies [1]. Thetwo first case (neo-and subfunc-tionalization)mayleadtonewexpressionpattern,oreven new regulatorypathway[7].

In cultivated Asian rice (Oryza sativa), for instance,genome duplicationprovidedimprovedrootresistance[8],seed germina-tionandseedlinggrowthtosaltstress[8,9].Inaddition,tandem duplications were evidenced, amplifying adaptively important resistancegenesencodingmembraneproteinsandfunctionrelated toabioticandbioticstress[10].Hence,segmentalduplicationand tandemduplicationleadtoHAP(HeterotrimericHemeActivator Protein)geneduplicationregulatingriceheadingdate[11].

However,detectinggenomeorsegmentalduplicationsisa com-plextask.Differentapproachesandtechniquesareused,suchas molecularonesthatgather(time-consuming)techniquese.g. com-parativegenomehybridization(CGH)[12],FISH,and arrayCGH

[13].Recently,duetotheavailabilityofNextGeneration Sequenc-ingtechnologiesandoftheirlowcost[14–16],morecomputational sequencing-based approaches weredeveloped (e.g. [17]).These methods rely mainlyonDepth of Coveragevariations(DoC) to identifyduplicationsrelative toa referencegenome,asa dupli-cated regions are expected to be twice more sequenced than

https://doi.org/10.1016/j.cpb.2017.07.001

(3)

24 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28

non-duplicatedregions[18].However,molecularapproachessuch asDoCmethods requirehighly preciseexperiments,repetitions andheavycomputationaltimes,aredesignedtocompareone tar-getindividualtoagivenreferenceone,andthuscannotbeapplied onalargenumberofindividuals.

Inthisstudy,weproposeanewmethodologybasedontheuseof apparentexcessofheterozygousloci(AEH)ongenomicintervalsin autogamousdiploidspecies.Thismethodologywasimplemented inatoolcalledDuplicationDetector,andwastestedonasetof simu-lateddataandshowntobefastandrobust.Inaddition,weappliedit oncultivatedandwildAfricanrices,respectivelyOryzaglaberrima andO.barthii,todetectduplicatedgenescomparedtotheAsian riceOryzasativa.

2. Materialsandmethods

2.1. Material

2.1.1. Simulatedsequencedataforvalidation

Afragmentofchromosome7(1Mb)ofOryzasativasspjaponica cvNipponBareIRGSP1.0wasextractedfromposition1,000,000to 1,999,999usingtheextractseqtoolfromEMBOSS[19].Three dupli-cations(namelyduplications1–3)wereartificiallycreatedwithin thissequenceusinghome-madePerlscript(availableondemand), respectivelyatpositions300–1500;350,000–390,000(containing another initialduplication) and 800,000 to 810,000. A total of ninevirtual‘clones’wereconstructed.Clone1,withoutduplicated sequences,wasconsideredasidenticaltothereference.Clone2has thethreeduplicatedsequenceswithoutmutation.Clones4–6have theduplicatedsequencesandmutationswith3%of supplemen-tarydivergence,whileclones7–9havetheduplicatedsequences withthesamemutationsand6%ofdivergence.Inadditionweadd inclones3–9additionalcommonmutationsinthenonduplicated sequence.Allmutationswereinducedusingthemutatednatool fromtheSMS2suite[20].Thesequencesfromeachvirtualclone werethenusedtosimulateFASTQdatausingaRTsimulationtool

[21],specifyingasoptionsHiSeq2500machine,100pair-end frag-ment,depthof35,insertsizeof200,10%ofinsertsizedivergence. aRTincludesanempiricerrormodelthatallowsaverygood simu-lationofsequencingdata[21].

2.1.2. SequencedataforOryzaglaberrimaandO.barthiiand initialqualitycontrol

Eight accessions of African cultivated rice Oryza glaberrima (TOG5307, TOG5307f, TOG5321, TOG5666, TOG5887, TOG7291, UB06, UG26), and six wild relatives Oryza barthii (B88, IG05, IRGC106302,MB323,TB41,TG57)wereusedinthisstudy(see[22]

formore informationsaboutthoseaccessions). Asian cultivated riceO. sativaIR64(sspindica)and Azucena(ssp japonica)were alsoincludedascontrol.AllsamplesweresequencedatGenoscope (France),intheframeoftheIRIGINproject(http://irigin.org),as follows:

Sequencing:Librarieswere preparedusing theNEBNext DNA ModulesProducts(NewEnglandBiolabs,MA,USA)witha‘onbeads’ protocoldevelopedattheGenoscope,thusreducingthecostsand increasingtheyields.Briefly,aftergDNAfragmentationwiththe E210Covarisinstrument(Covaris,Inc.,USA),endrepair,A-tailing andligationwithadaptedconcentrationsofNextflexDNAbarcodes (BiooScientific,Austin,TX,)wereperformedonthesameAMPure XPbeadsthatwasusedforthefirstpurificationafterendrepair. Aftertwoconsecutive1xAMPureXPcleanup,theligatedproduct wasamplifiedby12cyclesPCRusingKapaHifiHotstartNGSlibrary Amplificationkit(KapaBiosystems,Wilmington,MA),followedby 0.6xAMPureXPpurification.Librariestraceswerevalidatedon Agi-lent2100Bioanalyzer(AgilentTechnologies,USA)andquantified

byqPCR usingtheKAPALibraryQuantificationKit (KapaBiosys-tems)onaMxProinstrument(AgilentTechnologies,USA).Libraries weresequencedonanIlluminaHiSeq2000orHiSeq4000instrument (Illumina,USA),at2×101bpor2×151bp.respectively.About50 billionusefulpaired-endreadswereobtainedperrun.

QCand initial treatments: Low quality clusters werefiltered duringthesequencingrunbyRealTimeAnalysis(RTA)software. FilteringstepswereperformedonwholepairedFASTQfiles: Illu-minaadaptersandprimerswereremoved,nucleotideswithquality valuelowerthan20weretrimmedfrombothendsandsequences between the second unknown nucleotide (N) and the end of thereadweretrimmed.Readsshorter than30 nucleotidesafter trimmingwerediscarded. These trimmingstepswereachieved usingfastxtend (http://www.genoscope.cns.fr/fastxtend/), a soft-ware based on the FASTX library [23]. The filtered reads and theirmatesthatmappedontorunqualitycontrolsequences(PhiX genome)wereremovedusingSOAPaligner[24].

2.1.3. Referencesequence&annotation

The reference sequenced genome IRSGP-1.0/MSU7.0 and its annotationfromMSUv7[25]wereusedforanalysisasdescribed above.TheinitialVCF(VariantCallFormat)filesareavailableat

http://bioinfo-storage.ird.fr/2017/CPB/Djedatin/.

2.2. Methods

2.2.1. Mappingapproach&initialSNPcalling

For VCF creation, cleaned paired FASTQ data were mapped uponthereferenceinitialsequenceusingBWA0.7.12(aln/sampe legacyalgoritm)[26].SAM(SequenceAlignment/Map)fileswere cleanedandfilteredforlowqualitymappingandabnormal map-ping,mergedandrealignedusingcombinationofSAMtools[27]and PicardTools[28].Afterrealignment,SNPwerecalledusingtheGATK HaplotypeCaller[29]understandardconditions.Callingwas per-formedperindividualchromosometooptimizecalculationtime. AllstepswereperformedusingtheTOGGLEpipeline[30]toensure repeatabilityand traceability.The standard defaultvalues were chosenrespectingtwocriteria:I)THEIRIC/3Kgenomesstandards formapping/callingandII)numeroushomemadetestsand evalua-tionsofconditionsusingcontrolsamplesindifferentanalyses(such asinMonatetal.[30],GBE)thatprovidethebestresults.Detailed optionsareshowninsuppData,aswellasTOGGLEconfiguration file.

2.2.2. HeterozygousSNPrecovery

VCFwerefilteredoutforrecoveryoflinescontaining heterozy-gousSNPsbasedonstandarddefaultfilters(depthforeachsample, maximumnumberofmissingdata,minimalcallingqualityvalue, maximumMQ0value,homozygouscontrols).

2.2.3. Genomicsintervalsrecomposition

Extracted VCF lines were recompiled in genomics intervals respectinga specifiedmaximal distance between2 SNPs tobe consideredasrelated, aminimalblocksize,andaminimal het-erozygousSNPdensity.Resultingfilesare3columnsBED-likefiles.

2.2.4. DuplicatedgenesidentificationandSNPpotentialeffect identification

GenomicsintervalfileswerecrossedwithGFFfilecontaining annotationusingintersectBEDfromtheBEDtoolssuite[31].Selected heterozygousSNPswerethenannotatedfortheirpotentialeffect usingsnpEffsoftware[32].Aschematicviewofthewholepipeline isdetailedinSuppdata.

(4)

Fig.1. ApparentExcessofHeterozygouswillappearifreadscomingfromaduplicatedregionareabusivelymappedonareferencegenomewithouttheduplication.

2.3. Availability

All codes, installation instructions and manual for Dupli-cationDetector are available, under the GPLv3/CeCiLL-B double licenses, on the GitHub of the project: https://github.com/

SouthGreenPlatform/duplicationDetector.

3. Results

3.1. Descriptionoftheapproach

We rely on AEH genomic intervals to detect duplicated sequences(Fig.1).Basically,wedetectabusivemappingofreads comingfromduplicatedregionsinasequencedindividualwhen theyaremappedonareferencegenomewithouttheduplication (i.e.harboringonlyasinglecopy).Suchabusivemappingwilllead theSNPcallingtoproduceanApparent ExcessofHeterozygous locus,i.e.toomanyheterozygouslociinashortregion.Ifmany indi-vidualsaresequencedandmappedinthesameexperiment,such AEHlociwillappearfor(almost)eachindividualinthesame loca-tion,indicatingthusthattheregionisduplicatedinthesequenced genomes compared to thereference one. Userscan manage in DuplicationDetectorthelevelofstringencyforselectionofAEHloci usingdifferentcriteria:

• Minimaldepthperindividuals(defaultat30)

• Minimalnumberofindividualstobeheterozygousforapointto bechosen(defaultat10)

• MaximalnumberofMQ0reads(thatcanbemappedattwo posi-tionswithidenticalscore;defaultat0)

• Controlindividuals(somesamplesthatmaynotbeheterozygous) Thegenomicintervalcreationcanalsobesetupusingfollowing criteria:

• Maximumsizebetweentwoheterozygousloci(defaultat1kb) • Minimumsizeofthegenomicinterval(defaultat100bp) • Minimumdensityofheterozygouslociinthegenomicinterval

(defaultat25basesbetweeneachSNP).

IfuserprovideaGFFfileforgeneannotation,duplicatedgenes will beidentified throughoverlapping withidentified genomic intervals.

Intermsofspeed,acompletescanfromarawVCFof16African riceindividuals (12chromosomes,∼380Mb)athighsequencing depth(∼35x,seeMaterialsandmethods),withtwoAsianrice indi-vidualsascontrol,willbeperformedonasinglerecent64-bitscore inlessthan2h.

3.2. Resultsonsimulateddata

Onsimulateddata,wewereabletopartially(∼40%)identifythe duplication1directlyandalmostentirely(∼95%)theduplication

3(Table1)asfragmentedblocks.Wewereabletolimitthosetwo

duplicationswithaquitegoodresolution,i.e.capacityofcorrectly identifytheborders(max500bpoferrorinlimitating;seeTable1). Thedifferenceofrecoverylevelbetweenthetwoduplicationsis mainlyduetotheirsize,asforduplication1(1.2kb),thenon recov-eredfragmentisof193bpin5and522bpin3on1200,whilefor duplication3thenon-recoveredsizeisof142/355bp.The fragmen-tationeffectmaybeduetothefactthatduplication3isalargeblock butwithalowvariationdensity,andAEHlocidensityparameteris quiteconservative.

Duplication2wasnot detected,asit containsanother older (non-simulated) nestedduplication,and AEH loci inthis region wereremovedbasedonthemaximumMQ0parameter.Indeed, readscomingfromaregionalready duplicatedin thereference genomecouldbemappedonanyofthetwocopiesonthe refer-encewiththesameprobability.ThiswillincreasetheMQ0level,

(5)

26 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28

Table1

Statisticsaboutsimulateddata.Positionsaregiveninbasepair.

Duplication Start Stop Start(recovered) Stop(recovered) Resolutioninbp/Nbblocks Recovery

1 300 1500 493 978 193<->522/1block 40.42%

2 3,50,000 3,90,000 NA NA NA NA

3 8,00,000 8,10,000 8,00,142 8,09,675 142<->355/8blocks 95.33%

Table2

StatisticsaboutduplicatedregionsinAfricanricescomparedtoAsian.Sizesaregiveninbasepair.

BlockSize AEHloci AEHlocifreq

Min Mean Max Min Mean Max Min Mean Max

Chr1 142 795 2770 9 37 111 9.33 10.99 13.03 Chr2 106 623 2031 5 31 106 8.58 10.75 12.75 Chr3 110 293 825 5 15 40 8.78 10.67 13.3 Chr4 103 794 2079 6 40 132 8.33 10.31 13 Chr5 119 719 2668 6 37 113 9.62 11.07 13.25 Chr6 107 1057 3560 7 50 169 8.44 11.02 13.15 Chr7 147 903 3198 7 49 154 8.94 11.19 13.17 Chr8 111 925 3336 8 44 158 8.1 11.29 13.36 Chr9 127 634 1057 7 35 64 10.07 11.37 13 Chr10 127 937 2817 7 47 170 9.14 11.21 13.29 Chr11 124 716 2055 6 35 92 8.68 10.91 12.89 Chr12 101 744 1804 5 37 85 8.79 10.62 12.05

andthusthoseregionswillbefilteredoutwiththecurrentversion ofDuplicationDetector.

Nofalsepositivesregions wereobtainedfromthesimulated data.

3.3. Resultsonexperimentaldata

Wethentest ourtool onanexperimentalsetof Africanrice individualssequencedatrelativelyhigh-depth(seeMaterialsand methods).AfricancultivatedriceOryzaglaberrimaanditswild rela-tiveO.barthiidivergedfromAsiancultivatedriceO.sativaancestor almost 1 million year ago [33,34], and may harbor duplicated regionscomparedtoAsianrice.Wethustested8cultivatedand 6wildsamplestoevaluateourtoolonrealdata.Toavoid individ-ualeffectduetothereferencegenome(i.e.identificationofAEH locionlybecauseofalossofduplicationinNipponbare),weaddas controltwootherO. sativasample,IR64(indicasubspecies)and Azucena (japonicasubspecies), sequencedas described in 2.2.1. Thosetwoindividualsmustbehomozygousandsimilartothe ref-erenceforanAEHlocusontheAfricansamplestobevalidatedas such.

ThewholeanalysisontherawVCFrepresentingvariationson allthe12chromosomesofNipponbarereferencespendlessthan 2husingasinglecore,withaveryshortmemoryfootprint(max

2Mbytesperfile),andprovided786putativelyduplicatedregions inAfricanricelineagerelativetoAsianone(seeTable2).Fromthose putativelyduplicatedregions,wecanidentify200annotatedgene feature(i.e.taggedas‘gene’intheMSU7GFFfile;seeSuppData).The blocksrangedfrom102to3560bases,with5–170SNP(11SNP/kb onaverage)withAEHineachblock(Table2).

Theduplicatedregionsarewidelyspreadallalong the chro-mosomes, without any main locations (Fig. 2 and SuppData). Theputativelyduplicatedgenesthemselvesseemstoberelated partiallytostressresponse,butnogeneraltrendwereobserve con-cerningGene Ontology (datanotshown).We observeda mean valueof38AEHlociperregion,showingthattheseduplications arequite recent,but dating from beforetheradiation between O.glaberrimaandO.barthii.Indeed,ifapplyingameanvalueof 1.3×10−8mutationpermillionyears(ascalculatedin[35]),the meanageoftheputativeduplicatedregionsisof∼800,000years, spanningfrom∼605,000to972,000years,asexpectedfromthe twogroupseparation[33,34].

DetailedanalysisofmutationinducedbytheAEHlociin dupli-catedgenesshowedthatmostofthemutationsoccurredoutside ofthegenecodingregions(7–20%onlyinexonicsequences; Sup-pData). For exonicmutation, a mean Ka/Ks ratio of 1.53±0.45 (1.01–2.75)wasobserved,indicatingalowlevelofglobal diver-gent selection. The mean observed Transition vs Transversion

Fig.2. ChromosomallocationofduplicationsinAfricanricescomparedtoAsianonChromosome1.Inredaresymbolizedduplicationinvolvingannotatedgenes,ingreen regionwithoutannotatedgenes.

(6)

ratio is around 2 (1.7–2.3), as expected for genomic mutation

[36,37].

Whenfocusingonindividualgenes,wewereabletoidentify asputativelyduplicatedtheLOCOs07g09900gene(Chromosome 7,from5,263,409to5,267,310),DiseaseresistanceproteinRPM1, locatedunderamajorQTLofresistancetoAfricanstrainof Xan-thomonas(qABB-7,from[38]).Asithasbeenshowninotherplant species,duplicationofresistancegenesmayincreasedisease resis-tance.

3.4. Limitsoftheapproach

In a recent paper, Hutin et al. [39] shown a recent partial duplicationof3.2kboftheLOCOs11g31190geneOsSWEET14[39]

(Chromosome11,from18,171,678to18,174,478),involvedinan increasedresistancetoXanthomonasoryzaepvoryzae.Asetof12 highlevelvariantpositions,includingthe18bpdeletionbetween theoriginalcopyandtheduplicatedone[39],wereidentifiedin thevariantselectionstep.However,post-filtrations(especiallythe minimalSNPdensity,hereof177bpbetweentwoSNPinsteadof 25)didnotallowtorecoverthisduplicatedblockinourcurrent test.

4. Discussion

Detectionofduplicatedregionsbetweenindividualsisa chal-lengingtaskusingmoleculartools,andacostingcomputingtask withsequencedata.Uptonow,useofthelatterapproach(mainly using NGS) was based on raw mapping, DoC divergence com-putationand CopyNumberVariation(CNV)analyses.Numerous toolsexperimentedsuchapproaches,withmoreorlessefficiency

[14–16],butallofthemrequiresalongcomputationtimeand

can-notcomparemassivelydifferentindividuals.

In thepresent study,we rely onabusivemapping and sub-sequentAEH loci to identifythe duplicatedregions. Moreover, ourtooldoesnotrequireintensere-calculationormapping,asit reliesdirectlyontherawVCFdata(alreadygeneratedin numer-ousgeneticanalyses)toidentifythoseAEHloci.Thisapproachis fastandallowstoworkonlargesamples;inaddition,itoffersthe possibilitytoincludenegativecontrolswhichallowusersto iden-tifyduplicationsexistinginonlyasubsampleoftheirsequenced individuals.Ourapproachishoweverquiteconservative,asshown onsimulateddata,andwillnotdetectforinstancenewcopiesof analreadyexistingrepeatedsequenceinthereferencegenome. Inthesameway,itcannotidentifytoorecentduplications,due tothelownumberofmutationsbetweenthetwocopies(asfor OsSWEET14).However,appliedtothedivergencebetweenAfrican andAsianrices,wewereabletoidentifymorethan200putatively duplicatedgenesandalmost780totalregions.Thedetectedgenes arewidelyspreadallalongthechromosomes,generallyrelatedto stressresponse,atleastmarginally,andunderaquitelowdivergent selection.

5. Conclusion

DuplicationDetectoristhusaveryefficienttooltodetect dupli-cationinhaploidandhighlyhomozygousdiploidorganisms,such asrice(testedhere),butalsobacteria,yeasts,autogamousplants, haploidfungi,andsoon.Thefuturedevelopmentofourtoolwill includetheimplementationofdetectioninheterozygousor poly-ploidorganismsorboth,aswellasadditionalcriteriaforfiltering (suchashard-clippinglevel).

Author’scontribution

GD and FS manage the whole study and wrote the whole pipeline.SE&Genoscopeperformedthesequencingandinitialdata

treatmentsandQC.FSperformedthesimulation,GDandCM per-formedthebasicdataanalyses,andGDanalyzedtheresults.GDand FSwrotethemanuscript,andallauthorscorrectedandapprovethe currentversion.

Acknowledgments

GD was supported by an IRD grant (2013–2017 BEST FellowShip). CM was supported by ANR (AfriCrop project #ANR-13-BSV7-0017)andNUMEVlabex(LandPanToggle #2015-1-030-LARMANDE).AuthorswanttothanksGenoscopemembers forthesequencingofallricedata.Thisworkwassupportedby FranceGénomiqueFrenchNationalinfrastructure,fundedaspart of“Investissementd’avenir”programmanagedbyANR (#ANR-10-INBS-09),intheframeoftheIRIGINproject(http://irigin.org).

Conflictofintereststatement

Theauthorshavenoconflictofinterest.

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound,in theonlineversion,athttp://dx.doi.org/10.1016/j.cpb.2017.07.001.

References

[1]S.Ohno,Theenormousdiversityingenomesizesoffishasareflectionof nature’sextensiveexperimentswithgeneduplication,Trans.Am.Fish.Soc. 99(1970)120–130.

[2]S.DeBodt,S.Maere,Y.VandePeer,Genomeduplicationandtheoriginof angiosperms,TrendsEcol.Evol.20(2005)591–597.

[3]R.Aburomia,O.Khaner,A.Sidow,Functionalevolutionintheancestrallineage ofvertebratesorwhengenomiccomplexitywaswaggingitsmorphological tail,in:GenomeEvolution,Springer,Netherlands,Dordrecht,2003,pp.45–52.

[4]J.S.Taylor,J.Raes,Duplicationanddivergence:theevolutionofnewgenes andoldideas,Annu.Rev.Genet.38(2004)615–643.

[5]R.Chandan,D.Indra,Geneduplication:amajorforceinevolutionand bio-diversity,Int.J.Biodivers.Conserv.6(2014)41–49.

[6]L.E.Flagel,J.F.Wendel,Geneduplicationandevolutionarynoveltyinplants, NewPhytol.183(2009)557–564.

[7]C.Feschotte,Transposableelementsandtheevolutionofregulatory networks,Nat.Rev.Genet.9(2008)397–405.

[8]Y.Tu,A.Jiang,L.Gan,M.Hossain,J.Zhang,B.Peng,Y.Xiong,Z.Song,D.Cai,W. Xu,etal.,Genomeduplicationimprovesricerootresistancetosaltstress,Rice 7(2014)15.

[9]A.Jiang,L.Gan,Y.Tu,H.Ma,J.Zhang,Z.Song,Y.He,D.Cai,X.Xue,Theeffect ofgenomeduplicationonseedgerminationandseedlinggrowthofriceunder saltstress,Aust.J.CropSci.7(2013)1814–1821.

[10]C.Rizzon,L.Ponger,B.S.Gaut,S.Maere,S.Bodt,J.DeRaes,T.Casneuf,M. Montagu,G.VanBlanc,K.Wolfe,etal.,Strikingsimilaritiesinthegenomic distributionoftandemlyarrayedgenesinArabidopsisandrice,PLoSComput. Biol.2(2006)e115.

[11]Q.Li,W.Yan,H.Chen,C.Tan,Z.Han,W.Yao,G.Li,M.Yuan,Y.Xing, DuplicationofOsHAPfamilygenesandtheirassociationwithheadingdatein rice,J.Exp.Bot.67(2016)1759–1768.

[12]S.Solinas-Toldo,S.Lampel,S.Stilgenbauer,J.Nickolenko,A.Benner,H. Döhner,T.Cremer,P.Lichter,Matrix-basedcomparativegenomic

hybridization:biochipstoscreenforgenomicimbalances,GenesChromosom. Cancer20(1997)399–407.

[13]C.Shaw-Smith,R.Redon,L.Rickman,M.Rio,L.Willatt,H.Fiegler,H.Firth,D. Sanlaville,R.Winter,L.Colleaux,etal.,Microarraybasedcomparative genomichybridisation(array-CGH)detectssubmicroscopicchromosomal deletionsandduplicationsinpatientswithlearningdisability/mental retardationanddysmorphicfeatures,J.Med.Genet.41(2004)241–248.

[14]S.Goodwin,J.D.McPherson,W.R.McCombie,Comingofage:tenyearsof next-generationsequencingtechnologies,Nat.Rev.Genet.17(2016) 333–351.

[15]T.C.Glenn,Fieldguidetonext-generationDNAsequencers,Mol.Ecol.Resour. 11(2011)759–769.

[16]T.C.Glenn,<http://molecularecologist.com//>.

[17]S.Newman,K.E.Hermetz,B.Weckselblatt,M.K.Rudd,Next-generation sequencingofduplicationCNVsrevealsthatmostaretandemandsome createfusiongenesatbreakpoints,Am.J.Hum.Genet.96(2015)208–220.

[18]P.Guan,W.-K.Sung,Structuralvariationdetectionusingnext-generation sequencingdata:acomparativetechnicalreview,Methods102(1June) (2016)36–49,http://dx.doi.org/10.1016/j.ymeth.2016.01.020.

(7)

28 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28

[19]P.Rice,I.Longden,A.Bleasby,EMBOSS:theEuropeanmolecularbiologyopen softwaresuite,TrendsGenet.16(2000)276–277.

[20]P.Stothard,Thesequencemanipulationsuite:JavaScriptprogramsfor analyzingandformattingproteinandDNAsequences,Biotechniques28 (2000).

[21]W.Huang,L.Li,J.R.Myers,G.T.Marth,ART:anext-generationsequencing readsimulator,Bioinformatics28(2012)593–594.

[22]J.Orjuela,F.Sabot,S.Chéron,Y.Vigouroux,H.Adam,H.Chrestin,K.Sanni,M. Lorieux,A.Ghesquière,AnextensiveanalysisoftheAfricanricegenetic diversitythroughaglobalgenotyping–Springer,Theor.Appl.Genet.127(10) (2014)2211–2223,http://dx.doi.org/10.1007/s00122-014-2374-z. [23]FASTX-Toolkit,<http://hannonlab.cshl.edu/fastxtoolkit/index.html/>. [24]R.Li,C.Yu,Y.Li,T.-W.Lam,S.-M.Yiu,K.Kristiansen,J.Wang,SOAP2:an

improvedultrafasttoolforshortreadalignment,Bioinformatics25(2009) 1966–1967.

[25]Y.Kawahara,M.delaBastide,J.P.Hamilton,H.Kanamori,W.R.McCombie,S. Ouyang,D.C.Schwartz,T.Tanaka,J.Wu,S.Zhou,etal.,Improvementofthe OryzasativaNipponbarereferencegenomeusingnextgenerationsequence andopticalmapdata–Springer,Rice6(2013)4.

[26]H.Li,R.Durbin,Fastandaccuratelong-readalignmentwith Burrows–Wheelertransform,Bioinformatics26(2010)589–595.

[27]H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G. Abecasis,R.Durbin,Thesequencealignment/mapformatandSAMtools, Bioinformatics25(2009)2078–2079.

[28]PicardTools–ByBroadInstitute.

[29]A.McKenna,M.Hanna,E.Banks,A.Sivachenko,K.Cibulskis,A.Kernytsky,K. Garimella,D.Altshuler,S.Gabriel,M.Daly,etal.,TheGenomeAnalysisToolkit: aMapReduceframeworkforanalyzingnext-generationDNAsequencingdata, GenomeRes.20(2010)1297–1303.

[30]C.Monat,C.Tranchant-Dubreuil,A.Kougbeadjo,C.Farcy,E.Ortega-Abboud,S. Amanzougarene,S.Ravel,M.Agbessi,J.Orjuela-Bouniol,M.Summo,etal., TOGGLE:toolboxforgenericNGSanalyses,BMCBioinform.16(2015)374.

[31]A.R.Quinlan,I.M.Hall,BEDTools:aflexiblesuiteofutilitiesforcomparing genomicfeatures,Bioinformatics26(2010)841–842.

[32]P.Cingolani,A.Platts,L.L.Wang,M.Coon,T.Nguyen,L.Wang,S.J.Land,X.Lu, D.M.Ruden,Aprogramforannotatingandpredictingtheeffectsofsingle nucleotidepolymorphisms,SnpEff,Fly(Austin)6(2012)80–92.

[33]D.a.Vaughan,B.-R.Lu,N.Tomooka,Theevolvingstoryofriceevolution,Plant Sci.174(2008)394–408.

[34]D.a.Vaughan,K.Kadowaki,A.Kaga,N.Tomooka,Onthephylogenyand biogeographyofthegenusOryza,Breed.Sci.55(2005)113–122.

[35]J.Ma,J.L.Bennetzen,Rapidrecentgrowthanddivergenceofricenuclear genomes,Proc.Natl.Acad.Sci.U.S.A.101(2004)12404–12410.

[36]Z.Yang,D.Yodera,Estimationofthetransition/transversionratebiasand speciessampling,J.Mol.Evol.48(1999)274–283.

[37]S.Duchêne,S.Ho,E.C.Holmes,T.Jukes,C.Cantor,W.Brown,E.Prager,A. Wang,A.Wilson,R.Lewontin,etal.,Decliningtransition/transversionratios throughtimereveallimitationstotheaccuracyofnucleotidesubstitution models,BMCEvol.Biol.15(2015)36.

[38]G.Djedatin,M.-N.Ndjiondjop,A.Sanni,M.Lorieux,V.Verdier,A.Ghesquiere, IdentificationofnovelmajorandminorQTLsassociatedwithXanthomonas oryzaepv.oryzae(Africanstrains)resistanceinrice(OryzasativaL.),Rice(N. Y.)9(2016)18.

[39]M.Hutin,F.Sabot,A.Ghesquière,R.Koebnik,B.Szurek,Aknowledge-based molecularscreenuncoversabroadspectrumOsSWEET14resistancealleleto bacterialblightfromwildrice,PlantJ.84(4)(2015)694–703,http://dx.doi. org/10.1111/tpj.13042.

Figure

Fig. 1. Apparent Excess of Heterozygous will appear if reads coming from a duplicated region are abusively mapped on a reference genome without the duplication.
Fig. 2. Chromosomal location of duplications in African rices compared to Asian on Chromosome 1

Références

Documents relatifs

8 genes known to contribute to root development in rice and Arabidopsis or co-segregating with meta-QTLs for root development in rice (Courtois et al, 2009): CRL1/ARL1, 4

With a priori information from paired-ends such as order, orientation and insert size of pairs as constraints during read alignment to the reference genome, anomalously mapped

Nous avons écrit la condition de résonance pour l’absorption de deux photons se propageant en sens opposé dans le référentiel de l’atome. En fait, cette

We present PrePeP, a light-weight tool for predicting whether molecules are frequent hitters, and visually inspecting the subgraphs sup- porting this decision.. PrePeP is contains

In particular, (i) we present an extension of the UML profile for spatial DW integrating the Hierarchical Agglomerative Clustering for defining

Bertrand Cottenceau, Mehdi Lhommeau, Laurent Hardouin, Jean-Louis Boimond.. To cite

They are generally composed of a statistical part allowing the generation of the purely random aspect of the climate from distribution laws and determined

Their conclusion that only limited regions of collinearity would be found is supported by more recent results (21), based on comparison of annotated Arabidopsis and rice