HAL Id: hal-03070294
https://hal.archives-ouvertes.fr/hal-03070294
Submitted on 28 Apr 2021
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distributed under a Creative Commons Attribution| 4.0 International License
DuplicationDetector, a light weight tool for duplication
detection using NGS data
G. Djedatin, C. Monat, S. Engelen, François Sabot
To cite this version:
G. Djedatin, C. Monat, S. Engelen, François Sabot. DuplicationDetector, a light weight tool for
duplication detection using NGS data. Current Plant Biology, Elsevier, 2017, Plant Development,
9-10, pp.23-28. �10.1016/j.cpb.2017.07.001�. �hal-03070294�
ContentslistsavailableatScienceDirect
Current
Plant
Biology
j ou rn a l h o m ep a g e :w w w . e l s e v i e r . c o m / l o c a t e / c p b
DuplicationDetector,
a
light
weight
tool
for
duplication
detection
using
NGS
data
夽
Gustave
Djedatin
a,b,∗∗,
Cécile
Monat
b,c,d,1,
Stefan
Engelen
e,
Francois
Sabot
b,c,d,∗aBIOGENOMLaboratory,FAST/DASSA,BP14Dassa-Zoumé,Benin
bDIADEUMRIRD/UM–CentreIRDdeMontpellier,911avAgropolisBP604501,F-34394MontpellierCedex5,France cSouthGreenBioinformaticsPlatform,AgropolisCampus,Montpellier,France
dUniversitédeMontpellier,PlaceEugèneBataillon,34000,Montpellier,France
eCommissariatàl’EnergieAtomique(CEA),InstitutdeGénomique(IG),Genoscope,BP5706,F-91057Evry,France
a
r
t
i
c
l
e
i
n
f
o
Articlehistory: Received19May2017
Receivedinrevisedform18July2017 Accepted19July2017 Keywords: Duplication NGS Rice
a
b
s
t
r
a
c
t
Duplicationsareoneonthemainevolutionaryforcesinangiosperm,especiallyinPoaceae.Alargenumber ofgenesinvolvedinvariousmetabolismsandpathwaysoriginatefromsuchduplications(wholegenome, segmentalorsinglegene).However,todetectsuchduplicationmaybecomplicated,costlyandgenerally requiresheavyhumanandmaterialinvestments.Here,weproposeanalternativeapproachfordetecting putativerecentsegmentalduplicationsinhaploidordiploidhomozygousorganismsbasedonNGSdata. Werelyonabusivemappingsofparalogoussequencesthatincreaseapparentheterozygouspointsata givenlocustoidentifysuchduplicatedgenomicregions.Wetestourtoolonsimulateddata,thenontrue ricegenomicsequencesandwereabletoidentifyabout200candidateduplicatedgenesinAfricanrice (Oryzaglaberrima)lineagecomparedtoAsianone(O.sativa).
©2017TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Duplicationisanimportantfeatureoftheplantgenome archi-tecture,andcaninvolveasinglegene,achromosomesegment,an entirechromosomeoreventhewholegenome[1].Itwasshownfor instancethatangiospermsundergonelargescaleduplicationsand multiplewholegenomeduplicationsallalongtheirevolution[2]. Wholegenomeduplication,i.e.doublingtheamountofthe com-pletegeneticmaterialofanindividualwithoutcrossing,appears asasourceofevolutionandbiologicalcomplexity[1,3,4].Inthe sameway,segmentalduplicationscreatelocalvariationsoffering newopportunitiesfornaturalselectiontooccur[5].Therefore,gene andgenomicduplicationsplayanimportantroleintheevolutionof plantphenotypes[6],andduplicatedgenescouldundergodifferent behaviors:(i)neofunctionalization–retentionofbothdivergent
夽 Thisarticleispartofaspecialissueentitled“PlantDevelopment”.
∗ Correspondingauthorat:DIADEUMRIRD/UM–CentreIRDdeMontpellier,911 avAgropolisBP604501,F-34394MontpellierCedex5,France.
∗∗ Correspondingauthorat:BIOGENOMLaboratory,FAST/DASSA,BP14 Dassa-Zoumé,Benin.
E-mailaddresses:djedatingustave@yahoo.fr(G.Djedatin),francois.sabot@ird.fr
(F.Sabot).
1 Currentaddress:DomesticationGenomicsGroup,IPKGatersleben,OT
Gater-sleben,Corrensstrasse3,D-06466Seeland,Germany.
copiesbutwithanewfunctionforoneofthem–,(ii) subfunction-alization–retentionofbothcopieswithconservedfunctionbutin anothertissue/organ/timeframeforone–,or(iii) nonfunctionaliza-tion/pseudogenization–largenumberofmutationsaccumulation in oneof thecopies [1]. Thetwo first case (neo-and subfunc-tionalization)mayleadtonewexpressionpattern,oreven new regulatorypathway[7].
In cultivated Asian rice (Oryza sativa), for instance,genome duplicationprovidedimprovedrootresistance[8],seed germina-tionandseedlinggrowthtosaltstress[8,9].Inaddition,tandem duplications were evidenced, amplifying adaptively important resistancegenesencodingmembraneproteinsandfunctionrelated toabioticandbioticstress[10].Hence,segmentalduplicationand tandemduplicationleadtoHAP(HeterotrimericHemeActivator Protein)geneduplicationregulatingriceheadingdate[11].
However,detectinggenomeorsegmentalduplicationsisa com-plextask.Differentapproachesandtechniquesareused,suchas molecularonesthatgather(time-consuming)techniquese.g. com-parativegenomehybridization(CGH)[12],FISH,and arrayCGH
[13].Recently,duetotheavailabilityofNextGeneration Sequenc-ingtechnologiesandoftheirlowcost[14–16],morecomputational sequencing-based approaches weredeveloped (e.g. [17]).These methods rely mainlyonDepth of Coveragevariations(DoC) to identifyduplicationsrelative toa referencegenome,asa dupli-cated regions are expected to be twice more sequenced than
https://doi.org/10.1016/j.cpb.2017.07.001
24 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28
non-duplicatedregions[18].However,molecularapproachessuch asDoCmethods requirehighly preciseexperiments,repetitions andheavycomputationaltimes,aredesignedtocompareone tar-getindividualtoagivenreferenceone,andthuscannotbeapplied onalargenumberofindividuals.
Inthisstudy,weproposeanewmethodologybasedontheuseof apparentexcessofheterozygousloci(AEH)ongenomicintervalsin autogamousdiploidspecies.Thismethodologywasimplemented inatoolcalledDuplicationDetector,andwastestedonasetof simu-lateddataandshowntobefastandrobust.Inaddition,weappliedit oncultivatedandwildAfricanrices,respectivelyOryzaglaberrima andO.barthii,todetectduplicatedgenescomparedtotheAsian riceOryzasativa.
2. Materialsandmethods
2.1. Material
2.1.1. Simulatedsequencedataforvalidation
Afragmentofchromosome7(1Mb)ofOryzasativasspjaponica cvNipponBareIRGSP1.0wasextractedfromposition1,000,000to 1,999,999usingtheextractseqtoolfromEMBOSS[19].Three dupli-cations(namelyduplications1–3)wereartificiallycreatedwithin thissequenceusinghome-madePerlscript(availableondemand), respectivelyatpositions300–1500;350,000–390,000(containing another initialduplication) and 800,000 to 810,000. A total of ninevirtual‘clones’wereconstructed.Clone1,withoutduplicated sequences,wasconsideredasidenticaltothereference.Clone2has thethreeduplicatedsequenceswithoutmutation.Clones4–6have theduplicatedsequencesandmutationswith3%of supplemen-tarydivergence,whileclones7–9havetheduplicatedsequences withthesamemutationsand6%ofdivergence.Inadditionweadd inclones3–9additionalcommonmutationsinthenonduplicated sequence.Allmutationswereinducedusingthemutatednatool fromtheSMS2suite[20].Thesequencesfromeachvirtualclone werethenusedtosimulateFASTQdatausingaRTsimulationtool
[21],specifyingasoptionsHiSeq2500machine,100pair-end frag-ment,depthof35,insertsizeof200,10%ofinsertsizedivergence. aRTincludesanempiricerrormodelthatallowsaverygood simu-lationofsequencingdata[21].
2.1.2. SequencedataforOryzaglaberrimaandO.barthiiand initialqualitycontrol
Eight accessions of African cultivated rice Oryza glaberrima (TOG5307, TOG5307f, TOG5321, TOG5666, TOG5887, TOG7291, UB06, UG26), and six wild relatives Oryza barthii (B88, IG05, IRGC106302,MB323,TB41,TG57)wereusedinthisstudy(see[22]
formore informationsaboutthoseaccessions). Asian cultivated riceO. sativaIR64(sspindica)and Azucena(ssp japonica)were alsoincludedascontrol.AllsamplesweresequencedatGenoscope (France),intheframeoftheIRIGINproject(http://irigin.org),as follows:
Sequencing:Librarieswere preparedusing theNEBNext DNA ModulesProducts(NewEnglandBiolabs,MA,USA)witha‘onbeads’ protocoldevelopedattheGenoscope,thusreducingthecostsand increasingtheyields.Briefly,aftergDNAfragmentationwiththe E210Covarisinstrument(Covaris,Inc.,USA),endrepair,A-tailing andligationwithadaptedconcentrationsofNextflexDNAbarcodes (BiooScientific,Austin,TX,)wereperformedonthesameAMPure XPbeadsthatwasusedforthefirstpurificationafterendrepair. Aftertwoconsecutive1xAMPureXPcleanup,theligatedproduct wasamplifiedby12cyclesPCRusingKapaHifiHotstartNGSlibrary Amplificationkit(KapaBiosystems,Wilmington,MA),followedby 0.6xAMPureXPpurification.Librariestraceswerevalidatedon Agi-lent2100Bioanalyzer(AgilentTechnologies,USA)andquantified
byqPCR usingtheKAPALibraryQuantificationKit (KapaBiosys-tems)onaMxProinstrument(AgilentTechnologies,USA).Libraries weresequencedonanIlluminaHiSeq2000orHiSeq4000instrument (Illumina,USA),at2×101bpor2×151bp.respectively.About50 billionusefulpaired-endreadswereobtainedperrun.
QCand initial treatments: Low quality clusters werefiltered duringthesequencingrunbyRealTimeAnalysis(RTA)software. FilteringstepswereperformedonwholepairedFASTQfiles: Illu-minaadaptersandprimerswereremoved,nucleotideswithquality valuelowerthan20weretrimmedfrombothendsandsequences between the second unknown nucleotide (N) and the end of thereadweretrimmed.Readsshorter than30 nucleotidesafter trimmingwerediscarded. These trimmingstepswereachieved usingfastxtend (http://www.genoscope.cns.fr/fastxtend/), a soft-ware based on the FASTX library [23]. The filtered reads and theirmatesthatmappedontorunqualitycontrolsequences(PhiX genome)wereremovedusingSOAPaligner[24].
2.1.3. Referencesequence&annotation
The reference sequenced genome IRSGP-1.0/MSU7.0 and its annotationfromMSUv7[25]wereusedforanalysisasdescribed above.TheinitialVCF(VariantCallFormat)filesareavailableat
http://bioinfo-storage.ird.fr/2017/CPB/Djedatin/.
2.2. Methods
2.2.1. Mappingapproach&initialSNPcalling
For VCF creation, cleaned paired FASTQ data were mapped uponthereferenceinitialsequenceusingBWA0.7.12(aln/sampe legacyalgoritm)[26].SAM(SequenceAlignment/Map)fileswere cleanedandfilteredforlowqualitymappingandabnormal map-ping,mergedandrealignedusingcombinationofSAMtools[27]and PicardTools[28].Afterrealignment,SNPwerecalledusingtheGATK HaplotypeCaller[29]understandardconditions.Callingwas per-formedperindividualchromosometooptimizecalculationtime. AllstepswereperformedusingtheTOGGLEpipeline[30]toensure repeatabilityand traceability.The standard defaultvalues were chosenrespectingtwocriteria:I)THEIRIC/3Kgenomesstandards formapping/callingandII)numeroushomemadetestsand evalua-tionsofconditionsusingcontrolsamplesindifferentanalyses(such asinMonatetal.[30],GBE)thatprovidethebestresults.Detailed optionsareshowninsuppData,aswellasTOGGLEconfiguration file.
2.2.2. HeterozygousSNPrecovery
VCFwerefilteredoutforrecoveryoflinescontaining heterozy-gousSNPsbasedonstandarddefaultfilters(depthforeachsample, maximumnumberofmissingdata,minimalcallingqualityvalue, maximumMQ0value,homozygouscontrols).
2.2.3. Genomicsintervalsrecomposition
Extracted VCF lines were recompiled in genomics intervals respectinga specifiedmaximal distance between2 SNPs tobe consideredasrelated, aminimalblocksize,andaminimal het-erozygousSNPdensity.Resultingfilesare3columnsBED-likefiles.
2.2.4. DuplicatedgenesidentificationandSNPpotentialeffect identification
GenomicsintervalfileswerecrossedwithGFFfilecontaining annotationusingintersectBEDfromtheBEDtoolssuite[31].Selected heterozygousSNPswerethenannotatedfortheirpotentialeffect usingsnpEffsoftware[32].Aschematicviewofthewholepipeline isdetailedinSuppdata.
Fig.1. ApparentExcessofHeterozygouswillappearifreadscomingfromaduplicatedregionareabusivelymappedonareferencegenomewithouttheduplication.
2.3. Availability
All codes, installation instructions and manual for Dupli-cationDetector are available, under the GPLv3/CeCiLL-B double licenses, on the GitHub of the project: https://github.com/
SouthGreenPlatform/duplicationDetector.
3. Results
3.1. Descriptionoftheapproach
We rely on AEH genomic intervals to detect duplicated sequences(Fig.1).Basically,wedetectabusivemappingofreads comingfromduplicatedregionsinasequencedindividualwhen theyaremappedonareferencegenomewithouttheduplication (i.e.harboringonlyasinglecopy).Suchabusivemappingwilllead theSNPcallingtoproduceanApparent ExcessofHeterozygous locus,i.e.toomanyheterozygouslociinashortregion.Ifmany indi-vidualsaresequencedandmappedinthesameexperiment,such AEHlociwillappearfor(almost)eachindividualinthesame loca-tion,indicatingthusthattheregionisduplicatedinthesequenced genomes compared to thereference one. Userscan manage in DuplicationDetectorthelevelofstringencyforselectionofAEHloci usingdifferentcriteria:
• Minimaldepthperindividuals(defaultat30)
• Minimalnumberofindividualstobeheterozygousforapointto bechosen(defaultat10)
• MaximalnumberofMQ0reads(thatcanbemappedattwo posi-tionswithidenticalscore;defaultat0)
• Controlindividuals(somesamplesthatmaynotbeheterozygous) Thegenomicintervalcreationcanalsobesetupusingfollowing criteria:
• Maximumsizebetweentwoheterozygousloci(defaultat1kb) • Minimumsizeofthegenomicinterval(defaultat100bp) • Minimumdensityofheterozygouslociinthegenomicinterval
(defaultat25basesbetweeneachSNP).
IfuserprovideaGFFfileforgeneannotation,duplicatedgenes will beidentified throughoverlapping withidentified genomic intervals.
Intermsofspeed,acompletescanfromarawVCFof16African riceindividuals (12chromosomes,∼380Mb)athighsequencing depth(∼35x,seeMaterialsandmethods),withtwoAsianrice indi-vidualsascontrol,willbeperformedonasinglerecent64-bitscore inlessthan2h.
3.2. Resultsonsimulateddata
Onsimulateddata,wewereabletopartially(∼40%)identifythe duplication1directlyandalmostentirely(∼95%)theduplication
3(Table1)asfragmentedblocks.Wewereabletolimitthosetwo
duplicationswithaquitegoodresolution,i.e.capacityofcorrectly identifytheborders(max500bpoferrorinlimitating;seeTable1). Thedifferenceofrecoverylevelbetweenthetwoduplicationsis mainlyduetotheirsize,asforduplication1(1.2kb),thenon recov-eredfragmentisof193bpin5and522bpin3on1200,whilefor duplication3thenon-recoveredsizeisof142/355bp.The fragmen-tationeffectmaybeduetothefactthatduplication3isalargeblock butwithalowvariationdensity,andAEHlocidensityparameteris quiteconservative.
Duplication2wasnot detected,asit containsanother older (non-simulated) nestedduplication,and AEH loci inthis region wereremovedbasedonthemaximumMQ0parameter.Indeed, readscomingfromaregionalready duplicatedin thereference genomecouldbemappedonanyofthetwocopiesonthe refer-encewiththesameprobability.ThiswillincreasetheMQ0level,
26 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28
Table1
Statisticsaboutsimulateddata.Positionsaregiveninbasepair.
Duplication Start Stop Start(recovered) Stop(recovered) Resolutioninbp/Nbblocks Recovery
1 300 1500 493 978 193<->522/1block 40.42%
2 3,50,000 3,90,000 NA NA NA NA
3 8,00,000 8,10,000 8,00,142 8,09,675 142<->355/8blocks 95.33%
Table2
StatisticsaboutduplicatedregionsinAfricanricescomparedtoAsian.Sizesaregiveninbasepair.
BlockSize AEHloci AEHlocifreq
Min Mean Max Min Mean Max Min Mean Max
Chr1 142 795 2770 9 37 111 9.33 10.99 13.03 Chr2 106 623 2031 5 31 106 8.58 10.75 12.75 Chr3 110 293 825 5 15 40 8.78 10.67 13.3 Chr4 103 794 2079 6 40 132 8.33 10.31 13 Chr5 119 719 2668 6 37 113 9.62 11.07 13.25 Chr6 107 1057 3560 7 50 169 8.44 11.02 13.15 Chr7 147 903 3198 7 49 154 8.94 11.19 13.17 Chr8 111 925 3336 8 44 158 8.1 11.29 13.36 Chr9 127 634 1057 7 35 64 10.07 11.37 13 Chr10 127 937 2817 7 47 170 9.14 11.21 13.29 Chr11 124 716 2055 6 35 92 8.68 10.91 12.89 Chr12 101 744 1804 5 37 85 8.79 10.62 12.05
andthusthoseregionswillbefilteredoutwiththecurrentversion ofDuplicationDetector.
Nofalsepositivesregions wereobtainedfromthesimulated data.
3.3. Resultsonexperimentaldata
Wethentest ourtool onanexperimentalsetof Africanrice individualssequencedatrelativelyhigh-depth(seeMaterialsand methods).AfricancultivatedriceOryzaglaberrimaanditswild rela-tiveO.barthiidivergedfromAsiancultivatedriceO.sativaancestor almost 1 million year ago [33,34], and may harbor duplicated regionscomparedtoAsianrice.Wethustested8cultivatedand 6wildsamplestoevaluateourtoolonrealdata.Toavoid individ-ualeffectduetothereferencegenome(i.e.identificationofAEH locionlybecauseofalossofduplicationinNipponbare),weaddas controltwootherO. sativasample,IR64(indicasubspecies)and Azucena (japonicasubspecies), sequencedas described in 2.2.1. Thosetwoindividualsmustbehomozygousandsimilartothe ref-erenceforanAEHlocusontheAfricansamplestobevalidatedas such.
ThewholeanalysisontherawVCFrepresentingvariationson allthe12chromosomesofNipponbarereferencespendlessthan 2husingasinglecore,withaveryshortmemoryfootprint(max
2Mbytesperfile),andprovided786putativelyduplicatedregions inAfricanricelineagerelativetoAsianone(seeTable2).Fromthose putativelyduplicatedregions,wecanidentify200annotatedgene feature(i.e.taggedas‘gene’intheMSU7GFFfile;seeSuppData).The blocksrangedfrom102to3560bases,with5–170SNP(11SNP/kb onaverage)withAEHineachblock(Table2).
Theduplicatedregionsarewidelyspreadallalong the chro-mosomes, without any main locations (Fig. 2 and SuppData). Theputativelyduplicatedgenesthemselvesseemstoberelated partiallytostressresponse,butnogeneraltrendwereobserve con-cerningGene Ontology (datanotshown).We observeda mean valueof38AEHlociperregion,showingthattheseduplications arequite recent,but dating from beforetheradiation between O.glaberrimaandO.barthii.Indeed,ifapplyingameanvalueof 1.3×10−8mutationpermillionyears(ascalculatedin[35]),the meanageoftheputativeduplicatedregionsisof∼800,000years, spanningfrom∼605,000to972,000years,asexpectedfromthe twogroupseparation[33,34].
DetailedanalysisofmutationinducedbytheAEHlociin dupli-catedgenesshowedthatmostofthemutationsoccurredoutside ofthegenecodingregions(7–20%onlyinexonicsequences; Sup-pData). For exonicmutation, a mean Ka/Ks ratio of 1.53±0.45 (1.01–2.75)wasobserved,indicatingalowlevelofglobal diver-gent selection. The mean observed Transition vs Transversion
Fig.2. ChromosomallocationofduplicationsinAfricanricescomparedtoAsianonChromosome1.Inredaresymbolizedduplicationinvolvingannotatedgenes,ingreen regionwithoutannotatedgenes.
ratio is around 2 (1.7–2.3), as expected for genomic mutation
[36,37].
Whenfocusingonindividualgenes,wewereabletoidentify asputativelyduplicatedtheLOCOs07g09900gene(Chromosome 7,from5,263,409to5,267,310),DiseaseresistanceproteinRPM1, locatedunderamajorQTLofresistancetoAfricanstrainof Xan-thomonas(qABB-7,from[38]).Asithasbeenshowninotherplant species,duplicationofresistancegenesmayincreasedisease resis-tance.
3.4. Limitsoftheapproach
In a recent paper, Hutin et al. [39] shown a recent partial duplicationof3.2kboftheLOCOs11g31190geneOsSWEET14[39]
(Chromosome11,from18,171,678to18,174,478),involvedinan increasedresistancetoXanthomonasoryzaepvoryzae.Asetof12 highlevelvariantpositions,includingthe18bpdeletionbetween theoriginalcopyandtheduplicatedone[39],wereidentifiedin thevariantselectionstep.However,post-filtrations(especiallythe minimalSNPdensity,hereof177bpbetweentwoSNPinsteadof 25)didnotallowtorecoverthisduplicatedblockinourcurrent test.
4. Discussion
Detectionofduplicatedregionsbetweenindividualsisa chal-lengingtaskusingmoleculartools,andacostingcomputingtask withsequencedata.Uptonow,useofthelatterapproach(mainly using NGS) was based on raw mapping, DoC divergence com-putationand CopyNumberVariation(CNV)analyses.Numerous toolsexperimentedsuchapproaches,withmoreorlessefficiency
[14–16],butallofthemrequiresalongcomputationtimeand
can-notcomparemassivelydifferentindividuals.
In thepresent study,we rely onabusivemapping and sub-sequentAEH loci to identifythe duplicatedregions. Moreover, ourtooldoesnotrequireintensere-calculationormapping,asit reliesdirectlyontherawVCFdata(alreadygeneratedin numer-ousgeneticanalyses)toidentifythoseAEHloci.Thisapproachis fastandallowstoworkonlargesamples;inaddition,itoffersthe possibilitytoincludenegativecontrolswhichallowusersto iden-tifyduplicationsexistinginonlyasubsampleoftheirsequenced individuals.Ourapproachishoweverquiteconservative,asshown onsimulateddata,andwillnotdetectforinstancenewcopiesof analreadyexistingrepeatedsequenceinthereferencegenome. Inthesameway,itcannotidentifytoorecentduplications,due tothelownumberofmutationsbetweenthetwocopies(asfor OsSWEET14).However,appliedtothedivergencebetweenAfrican andAsianrices,wewereabletoidentifymorethan200putatively duplicatedgenesandalmost780totalregions.Thedetectedgenes arewidelyspreadallalongthechromosomes,generallyrelatedto stressresponse,atleastmarginally,andunderaquitelowdivergent selection.
5. Conclusion
DuplicationDetectoristhusaveryefficienttooltodetect dupli-cationinhaploidandhighlyhomozygousdiploidorganisms,such asrice(testedhere),butalsobacteria,yeasts,autogamousplants, haploidfungi,andsoon.Thefuturedevelopmentofourtoolwill includetheimplementationofdetectioninheterozygousor poly-ploidorganismsorboth,aswellasadditionalcriteriaforfiltering (suchashard-clippinglevel).
Author’scontribution
GD and FS manage the whole study and wrote the whole pipeline.SE&Genoscopeperformedthesequencingandinitialdata
treatmentsandQC.FSperformedthesimulation,GDandCM per-formedthebasicdataanalyses,andGDanalyzedtheresults.GDand FSwrotethemanuscript,andallauthorscorrectedandapprovethe currentversion.
Acknowledgments
GD was supported by an IRD grant (2013–2017 BEST FellowShip). CM was supported by ANR (AfriCrop project #ANR-13-BSV7-0017)andNUMEVlabex(LandPanToggle #2015-1-030-LARMANDE).AuthorswanttothanksGenoscopemembers forthesequencingofallricedata.Thisworkwassupportedby FranceGénomiqueFrenchNationalinfrastructure,fundedaspart of“Investissementd’avenir”programmanagedbyANR (#ANR-10-INBS-09),intheframeoftheIRIGINproject(http://irigin.org).
Conflictofintereststatement
Theauthorshavenoconflictofinterest.
AppendixA. Supplementarydata
Supplementarydataassociatedwiththisarticlecanbefound,in theonlineversion,athttp://dx.doi.org/10.1016/j.cpb.2017.07.001.
References
[1]S.Ohno,Theenormousdiversityingenomesizesoffishasareflectionof nature’sextensiveexperimentswithgeneduplication,Trans.Am.Fish.Soc. 99(1970)120–130.
[2]S.DeBodt,S.Maere,Y.VandePeer,Genomeduplicationandtheoriginof angiosperms,TrendsEcol.Evol.20(2005)591–597.
[3]R.Aburomia,O.Khaner,A.Sidow,Functionalevolutionintheancestrallineage ofvertebratesorwhengenomiccomplexitywaswaggingitsmorphological tail,in:GenomeEvolution,Springer,Netherlands,Dordrecht,2003,pp.45–52.
[4]J.S.Taylor,J.Raes,Duplicationanddivergence:theevolutionofnewgenes andoldideas,Annu.Rev.Genet.38(2004)615–643.
[5]R.Chandan,D.Indra,Geneduplication:amajorforceinevolutionand bio-diversity,Int.J.Biodivers.Conserv.6(2014)41–49.
[6]L.E.Flagel,J.F.Wendel,Geneduplicationandevolutionarynoveltyinplants, NewPhytol.183(2009)557–564.
[7]C.Feschotte,Transposableelementsandtheevolutionofregulatory networks,Nat.Rev.Genet.9(2008)397–405.
[8]Y.Tu,A.Jiang,L.Gan,M.Hossain,J.Zhang,B.Peng,Y.Xiong,Z.Song,D.Cai,W. Xu,etal.,Genomeduplicationimprovesricerootresistancetosaltstress,Rice 7(2014)15.
[9]A.Jiang,L.Gan,Y.Tu,H.Ma,J.Zhang,Z.Song,Y.He,D.Cai,X.Xue,Theeffect ofgenomeduplicationonseedgerminationandseedlinggrowthofriceunder saltstress,Aust.J.CropSci.7(2013)1814–1821.
[10]C.Rizzon,L.Ponger,B.S.Gaut,S.Maere,S.Bodt,J.DeRaes,T.Casneuf,M. Montagu,G.VanBlanc,K.Wolfe,etal.,Strikingsimilaritiesinthegenomic distributionoftandemlyarrayedgenesinArabidopsisandrice,PLoSComput. Biol.2(2006)e115.
[11]Q.Li,W.Yan,H.Chen,C.Tan,Z.Han,W.Yao,G.Li,M.Yuan,Y.Xing, DuplicationofOsHAPfamilygenesandtheirassociationwithheadingdatein rice,J.Exp.Bot.67(2016)1759–1768.
[12]S.Solinas-Toldo,S.Lampel,S.Stilgenbauer,J.Nickolenko,A.Benner,H. Döhner,T.Cremer,P.Lichter,Matrix-basedcomparativegenomic
hybridization:biochipstoscreenforgenomicimbalances,GenesChromosom. Cancer20(1997)399–407.
[13]C.Shaw-Smith,R.Redon,L.Rickman,M.Rio,L.Willatt,H.Fiegler,H.Firth,D. Sanlaville,R.Winter,L.Colleaux,etal.,Microarraybasedcomparative genomichybridisation(array-CGH)detectssubmicroscopicchromosomal deletionsandduplicationsinpatientswithlearningdisability/mental retardationanddysmorphicfeatures,J.Med.Genet.41(2004)241–248.
[14]S.Goodwin,J.D.McPherson,W.R.McCombie,Comingofage:tenyearsof next-generationsequencingtechnologies,Nat.Rev.Genet.17(2016) 333–351.
[15]T.C.Glenn,Fieldguidetonext-generationDNAsequencers,Mol.Ecol.Resour. 11(2011)759–769.
[16]T.C.Glenn,<http://molecularecologist.com//>.
[17]S.Newman,K.E.Hermetz,B.Weckselblatt,M.K.Rudd,Next-generation sequencingofduplicationCNVsrevealsthatmostaretandemandsome createfusiongenesatbreakpoints,Am.J.Hum.Genet.96(2015)208–220.
[18]P.Guan,W.-K.Sung,Structuralvariationdetectionusingnext-generation sequencingdata:acomparativetechnicalreview,Methods102(1June) (2016)36–49,http://dx.doi.org/10.1016/j.ymeth.2016.01.020.
28 G.Djedatinetal./CurrentPlantBiology9–10(2017)23–28
[19]P.Rice,I.Longden,A.Bleasby,EMBOSS:theEuropeanmolecularbiologyopen softwaresuite,TrendsGenet.16(2000)276–277.
[20]P.Stothard,Thesequencemanipulationsuite:JavaScriptprogramsfor analyzingandformattingproteinandDNAsequences,Biotechniques28 (2000).
[21]W.Huang,L.Li,J.R.Myers,G.T.Marth,ART:anext-generationsequencing readsimulator,Bioinformatics28(2012)593–594.
[22]J.Orjuela,F.Sabot,S.Chéron,Y.Vigouroux,H.Adam,H.Chrestin,K.Sanni,M. Lorieux,A.Ghesquière,AnextensiveanalysisoftheAfricanricegenetic diversitythroughaglobalgenotyping–Springer,Theor.Appl.Genet.127(10) (2014)2211–2223,http://dx.doi.org/10.1007/s00122-014-2374-z. [23]FASTX-Toolkit,<http://hannonlab.cshl.edu/fastxtoolkit/index.html/>. [24]R.Li,C.Yu,Y.Li,T.-W.Lam,S.-M.Yiu,K.Kristiansen,J.Wang,SOAP2:an
improvedultrafasttoolforshortreadalignment,Bioinformatics25(2009) 1966–1967.
[25]Y.Kawahara,M.delaBastide,J.P.Hamilton,H.Kanamori,W.R.McCombie,S. Ouyang,D.C.Schwartz,T.Tanaka,J.Wu,S.Zhou,etal.,Improvementofthe OryzasativaNipponbarereferencegenomeusingnextgenerationsequence andopticalmapdata–Springer,Rice6(2013)4.
[26]H.Li,R.Durbin,Fastandaccuratelong-readalignmentwith Burrows–Wheelertransform,Bioinformatics26(2010)589–595.
[27]H.Li,B.Handsaker,A.Wysoker,T.Fennell,J.Ruan,N.Homer,G.Marth,G. Abecasis,R.Durbin,Thesequencealignment/mapformatandSAMtools, Bioinformatics25(2009)2078–2079.
[28]PicardTools–ByBroadInstitute.
[29]A.McKenna,M.Hanna,E.Banks,A.Sivachenko,K.Cibulskis,A.Kernytsky,K. Garimella,D.Altshuler,S.Gabriel,M.Daly,etal.,TheGenomeAnalysisToolkit: aMapReduceframeworkforanalyzingnext-generationDNAsequencingdata, GenomeRes.20(2010)1297–1303.
[30]C.Monat,C.Tranchant-Dubreuil,A.Kougbeadjo,C.Farcy,E.Ortega-Abboud,S. Amanzougarene,S.Ravel,M.Agbessi,J.Orjuela-Bouniol,M.Summo,etal., TOGGLE:toolboxforgenericNGSanalyses,BMCBioinform.16(2015)374.
[31]A.R.Quinlan,I.M.Hall,BEDTools:aflexiblesuiteofutilitiesforcomparing genomicfeatures,Bioinformatics26(2010)841–842.
[32]P.Cingolani,A.Platts,L.L.Wang,M.Coon,T.Nguyen,L.Wang,S.J.Land,X.Lu, D.M.Ruden,Aprogramforannotatingandpredictingtheeffectsofsingle nucleotidepolymorphisms,SnpEff,Fly(Austin)6(2012)80–92.
[33]D.a.Vaughan,B.-R.Lu,N.Tomooka,Theevolvingstoryofriceevolution,Plant Sci.174(2008)394–408.
[34]D.a.Vaughan,K.Kadowaki,A.Kaga,N.Tomooka,Onthephylogenyand biogeographyofthegenusOryza,Breed.Sci.55(2005)113–122.
[35]J.Ma,J.L.Bennetzen,Rapidrecentgrowthanddivergenceofricenuclear genomes,Proc.Natl.Acad.Sci.U.S.A.101(2004)12404–12410.
[36]Z.Yang,D.Yodera,Estimationofthetransition/transversionratebiasand speciessampling,J.Mol.Evol.48(1999)274–283.
[37]S.Duchêne,S.Ho,E.C.Holmes,T.Jukes,C.Cantor,W.Brown,E.Prager,A. Wang,A.Wilson,R.Lewontin,etal.,Decliningtransition/transversionratios throughtimereveallimitationstotheaccuracyofnucleotidesubstitution models,BMCEvol.Biol.15(2015)36.
[38]G.Djedatin,M.-N.Ndjiondjop,A.Sanni,M.Lorieux,V.Verdier,A.Ghesquiere, IdentificationofnovelmajorandminorQTLsassociatedwithXanthomonas oryzaepv.oryzae(Africanstrains)resistanceinrice(OryzasativaL.),Rice(N. Y.)9(2016)18.
[39]M.Hutin,F.Sabot,A.Ghesquière,R.Koebnik,B.Szurek,Aknowledge-based molecularscreenuncoversabroadspectrumOsSWEET14resistancealleleto bacterialblightfromwildrice,PlantJ.84(4)(2015)694–703,http://dx.doi. org/10.1111/tpj.13042.