CurrentPlantBiology7–8(2016)6–9
ContentslistsavailableatScienceDirect
Current
Plant
Biology
j ourn a l h o m e pa g e :w w w . e l s e v i e r . c o m / l o c a t e / c p b
The
South
Green
portal:
a
comprehensive
resource
for
tropical
and
Mediterranean
crop
genomics
夽
South
Green
collaborators
a,b,c,d,e,f,1aBioversityInternational,ParcScientifiqueAgropolisII,34397MontpellierCedex5,France bUMRAGAPCIRAD/INRA/SupAgro,AvenueAgropolis,34398MontpellierCedex5,France
CUMRBGPICIRAD/INRA/SupAgro,CampusInternationaldeBaillarguet,34398Montpellier,Cedex5,France dUMRDIADEIRD/UM,AvenueAgropolis,34934MontpellierCedex5,France
eUMRInterTrypCIRAD/IRD,CampusInternationaldeBaillarguet,34398Montpellier,Cedex5,France fUMRIPMEIRD/UM/CIRAD,AvenueAgropolis,34394Montpellier,Cedex5,France
a
r
t
i
c
l
e
i
n
f
o
Articlehistory:Received28October2016
Receivedinrevisedform3December2016 Accepted3December2016
Keywords: Cropdatabases Genomics Galaxy
NextGenerationSequencing Pipeline
Sequencevariants Workflow
a
b
s
t
r
a
c
t
TheSouthGreenWebportal(http://www.southgreen.fr/)providesaccesstoalargepanelofpublic databases,analyticalworkflowsandbioinformaticsresourcesdedicatedtothegenomicsoftropicaland Mediterraneancrops.Theportalcontainscurrentlyabout20informationsystemsandtoolsandtargets abroadrangeofcropssuchasBanana,Cacao,Cassava,Coconut,Coffee,Grape,RiceandSugarcane.
©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-ND license(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Analysisandvisualizationofmassivegenomicsdatasetsarean ongoingtrendinplantsciences,especiallyfortropicalcropswhere thissubjectcanhelptotacklechallengeslinkedtohumanactivities andClimateChange:conservationandanalysisofbiodiversity(loss becauseofhumanactivity),foodsecurity(the9billionpeople ques-tion),newusageofplantmaterial,etc.Intherecentyears,because ofsuccessivelylowercostsofgenomicsequencing,alarge num-berofgroupshavedevelopedmassiveresourcesanddatawhich requirehighperformancecomputationalresourcesandnew ana-lyticalapproaches[1,2].
TheSouthGreenportalisanecosystemoftoolsthatwere orig-inallydeveloped as independententities to fulfillthe need for specificprojectsorcrops,buthaveevolvedovertimetogeneric toolstocomprehensivelystudycropgenomics.Thosegenerictools areadaptabletoawiderangeofotherorganisms,especially culti-vated,wildororphanplants.
夽 Thisarticleispartofaspecialissueentitled“Genomicresourcesanddatabases”, publishedinthejournalCurrentPlantBiology7–8,2016.
1 ThelistofparticipantsandtheiraffiliationsareprovidedinAppendixA
2. Specializedresources
Some resources from the South Green portal are dedicated tospecializeddatasets.AmongtheGene-baseddatabases, Green-PhylDB[3]providesaccesstoresourcesforcomparativegenomics andorthologyidentificationinplantgenomes,andOryGenesDB[4]
isadatabasedevelopedforricereversegenetics,andcontaining FSTs (flanking sequence tags) of various mutagens and func-tionalgenomicsdata,collectedfrombothinternationalinsertion collectionsand theliterature.Otherresourcesare− (i) Marker-based databases like TropGeneDB [5] which connects data on molecularmarkers(e.g.SimpleSequenceRepeats,DiversityArrays Technology),Quantitative TraitLoci,geneticandphysicalmaps, phenotypingstudies,andinformationongeneticresources (geo-graphicorigin,parentage,collection),(ii)SNiPlay[6]whichallows queryingbothSNPs(Single-NucleotidePolymorphism)andInDels derivedfromNGS(Next-GenerationSequencing)projects (Whole-Genome Sequencing,Genotyping bySequencing, RNA-Seq) and computing a set of web analytical workflows for the resulting variants(diversity,populationstratification,Genome-Wide Associ-ationStudy),andproposesgraphicalrepresentationsoftheresults, and(iii) Gigwa[7]providingexploration ofverylarge genotyp-ingstudiesbyfilteringthembasedonnotonlyvariantfeatures
http://dx.doi.org/10.1016/j.cpb.2016.12.002
2214-6628/©2016TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4. 0/).
/CurrentPlantBiology7–8(2016)6–9 7
Fig.1. SouthGreenresourcesandstrategytoprocess,analyzeanddisplayNext-GenerationSequencingdatasetsinthecontextofgenomicvariationstudies.
(SNP,Indels),includingfunctionalannotations,butalsogenotype patterns,inordertoextractsubsetsinvariouspopularexport for-mats.Currentdevelopmentsareonthewayfordatabasesaimed tohost/pathogeninteractions,catalogsofplantpathogensstrains andsoon.Recently,basedonourexperiencewithWebSemantic technologies[8],wedevelopedAgroLD(www.agrold.org)anRDF (ResourceDescriptionFramework)knowledgebasethatconsistsof integrateddatafromavarietyofplantresourcesandontologies.
AcompletesummaryoftheSouthGreendatabaseswiththeir descriptionisavailableathttp://www.southgreen.fr/databases.
3. GenomeHubs
Our participation in several reference genome sequencing projects[9–11] hasledusto develop crop-specificinformation systems,socalledGenome Hubs, tomanagethecorresponding genomeannotationsandlinkeddatasets.Dataavailableareunder differentforms,fromcompletegenomesequencealongwithgene structure, gene product information, metabolicpathways, gene families, transcriptomicassays (ExpressedSequence Tags, RNA-Seq),geneticmarkersaswellasgeneticandphysicalmaps.Several GenomeHubs werereleased: Banana[12], Cassava,Cacao, Cof-fee[13],andothers(e.g.Rice,Magnaporthae)arecurrentlyunder development.GenomeHubsarepoweredbymajorGMOD(Generic ModelOrganismDatabase)components(i.e.Chado,Cmap,Jbrowse,
Tripal,Galaxy,Pathwaytools)andcomplementedbyresourcesand toolsdevelopedwithintheSouthGreenframeworksuchas Green-PhylDB,SNiPlayandTropGeneDB.
4. Workflowanalyses
Targetusersofbioinformaticapplicationsareusuallydivided betweenpeoplewhousecommand-lineandthosewhodonot.Our strategyhasbeentoaddressbothcategoriesbyoffering comple-mentarysolutionstoperformdataanalyses(Fig.1).
4.1. Galaxyworkflows
Galaxy [14] is a web-based service that allows an easy access to the bioinformatic applications and strongly supports reproducibility of analysis steps. Through its graphical inter-face, non-bioinformatician scientists are able to conduct small scale as well as medium range analyses in a user-friendly manner.Inadditiontothegenerictoolsprovidedwiththe stan-dard installation of Galaxy, the South Green Galaxy instance (http://galaxy.southgreen.fr/galaxy/)containsalargecollectionof exclusivetools.TheGalaxyToolShedallowsthesemodulestobe easily sharedwithany otherGalaxy instance. In addition, pre-configuredworkflowsweredesignedforrecurrentanalysesinplant genomicssuchasNGSmapping/cleaning,SNPcallingand
filter-8 /CurrentPlantBiology7–8(2016)6–9
ing,Genome-WideAssociation Study,basicpopulationgenetics, structuralvariationsandphylogenetics.
4.2. TOGGLE
TOGGLE[15]isacommand-linetooldesignedtoallow biolo-giststocreatelargeworkflowswithoutknowinghowtocodeit (https://github.com/SouthGreenPlatform/).Usingeditable config-urationfiles,itproposesamorein-depthcontrolofcustomizable pipelinesandallowsthemanagementofmanymoresamplesthan Galaxy(ex:massiveresequencingprojects,morethan50samples), aswellasthemanagement oftoolsversions.It canbeusedon widearrayofanalyses,suchasNGSdatacleaning,DNAseq,RNAseq, restrictionsite-associatedDNAsequencing(RADseq)/Genotyping bySequencing(GBS),genomeand transcriptomeassembly,SNP calling,structuralvariationsdetection,andsoon.Thecurrent ver-sioncanbedeployedonlaptop,largemachineaswellasaHPC cluster,usingnormalsystemofDockermachines,andis scheduler-aware.Alreadycustomizedpipelinesarealsoavailableforclassical NGSanalyses.
4.3. HandlingbigdatasetsinRice
Ourgoalwastosetupasimpleframeworktohelpuserstowork efficientlyandeasilywiththehugevolumeofgenomicresources generatedintheRicedatastorebasedonthe3,000ricegenomes
[16]andadditionalresources(i.e.High DensityRiceArray700k SNPsandIRIGIN projecthttp://irigin.org).Galaxymoduleswere implemented toquery efficiently millions of genomic variants, tofilterongenomicannotations, tofilteronlistof individuals, topredict coding effects of genomic variants, and toexport in severalformats (see Rice Variant Analysisin theTool Panel of
http://galaxy.southgreen.fr/galaxy/).Outputscanbeeasilysentto anyGalaxyworkflows.TheequivalentTOGGLEimplementationis alsoontheway.
5. Otherdevelopments
Wedevelopedalargenumberofothercodes,snippets, mod-ulesandpluginsfordatabasemanagements,duplicationdetection andrepresentation, fastafile manipulationand soon.Allthose codes are available under GPLv3 license on the GitHub, at
https://github.com/SouthGreenPlatform.
6. ExamplesofstudiesusingtheSouthGreenplatform
VariousgroupsusedtheSouthGreeninfrastructuretoobtain theirdataandresults,andwereabletopublishhigh-quality bio-logicalinformation,onCoffeegenome[9],Banana[10],Cocoa[11], Africanrice[17]orlargetranscriptomeresources[18].Tools devel-oped for these studies are adaptable to a wide range of other organisms.
7. Conclusion
TheSouthGreenportalisdevelopedbyalocalnetworkof sci-entistsgathering bioinformatics and genomicsskills locatedon theAgropoliscampus fromMontpellierasa jointcollaboration betweenCIRAD,IRD,INRA,SupAgroandBioversityinternational. Based on this strong local community in the field of agricul-ture,foodandbiodiversity,variousbioinformaticsapplicationsand resourcesdedicatedto genomicsoftropical and Mediterranean plantshavebeendeveloped,publishedandprovidedtothe inter-nationalcommunity. Sourcecode canbealso foundonGitHub (https://github.com/SouthGreenPlatform).
Funding
ThisworkwassupportedbyAgropolisFondation/AGROLabex (ID ARCAD 0900-001, ID RTB Multi-Genomes Hub 1403-018), the French Institute of Bioinformatics (IFB), ANR (GNPAnnot, MusaTract, CoffeaSeq,AfriCrop #NR-13-BSV7-0017), CIRAD and IRD (computer infrastructures), the CGIAR Research Programs on Roots, Tubers and Bananas and GRiSP/RICE, IRIGIN/France Génomique(Nationalinfrastructure,fundedaspartof “Investisse-mentd’avenir”programmanagedbyANR#ANR-10-INBS-09),and NUMEVLabex(LandPanToggle#2015-1-030-LARMANDE).
Acknowledgements
WethanktheScientificCommitteeofSouthGreenfortheir help-fuladviceandsuggestions,aswellasourdifferentinstitutionsfor theirsupport.
AppendixA.
Authors’information
Bioversity International: Guignon Valentin; Hueber Yann; RouardMathieu(email:m.rouard@cgiar.org).
UMR AGAP (CIRAD/INRA/SupAgro): Bocs Stephanie; Couvin David;deLamotteFrédéric;DrocGaëtan;DufayardJean-Franc¸ois; ElHassouniNordine;FarcyCédric;GkanogiannisAnestis;Hamelin Chantal;Larivière Delphine; MartinGuillaume;Ortega Enrique; PitollatBertrand;PointetStéphanie;RuizManuel(email:manuel. ruiz@cirad.fr);SarahGautier(email:gautier.sarah@supagro.inra. fr);SummoMarilyne;ThisDominique.
UMRBGPI(CIRAD):RavelSébastien.
UMR DIADE (IRD): Larmande Pierre; Monat Cécile; Sabot Franc¸ois;TandoNdomassi;Tranchant-DubreuilChristine(email:
christine.tranchant@ird.fr).
UMRInterTryp(CIRAD):SempéréGuilhem. UMRIPME(IRD):DereeperAlexis.
References
[1]O.Spjuth,E.Bongcam-Rudloff,J.Dahlberg,M.Dahlo,A.Kallio,L.Pireddu,F. Vezzi,E.Korpelainen,Recommendationsone-infrastructuresfor next-generationsequencing,GigaScience5(2016)26.
[2]R.R.Gullapalli,K.V.Desai,L.Santana-Santos,J.A.Kant,M.J.Becich,Next generationsequencinginclinicalmedicine:Challengesandlessonsfor pathologyandbiomedicalinformatics,JPatholInform3(2012)40.
[3]M.Rouard,V.Guignon,C.Aluome,M.-A.Laporte,G.Droc,C.Walde,C.M. Zmasek,C.Périn,M.G.Conte,GreenPhylDBv2.0:comparativeandfunctional genomicsinplants,NucleicAcidsRes.39(2011)D1095–1102.
[4]G.Droc,C.Périn,S.Fromentin,P.Larmande,OryGenesDB2008update: databaseinteroperabilityforfunctionalgenomicsofrice,NucleicAcidsRes. 37(2009)D992–995.
[5]C.Hamelin,G.Sempere,V.Jouffe,M.Ruiz,TropGeneDB,themulti-tropical cropinformationsystemupdatedandextended,NucleicAcidsRes.41(2013) D1172–1175.
[6]A.Dereeper,F.Homa,G.Andres,G.Sempere,G.Sarah,Y.Hueber,J.-F. Dufayard,M.Ruiz,SNiPlay3:aweb-basedapplicationforexplorationand largescaleanalysesofgenomicvariations,NucleicAcidsRes.43(2015) W295–300.
[7]G.Sempéré,F.Philippe,A.Dereeper,M.Ruiz,G.Sarah,P.Larmande, Gigwa-Genotypeinvestigatorforgenome-wideanalyses,GigaScience5 (2016)25.
[8]J.Wollbrett,P.Larmande,F.deLamotte,M.Ruiz,Clevergenerationofrich SPARQLqueriesfromannotatedrelationalschema:applicationtoSemantic WebServicecreationforbiologicaldatabases,BMCBioinformatics14(2013) 126.
[9]F.Denoeud,L.Carretero-Paulet,A.Dereeper,G.Droc,R.Guyot,M.Pietrella,C. Zheng,A.Alberti,F.Anthony,G.Aprea,etal.,Thecoffeegenomeprovides insightintotheconvergentevolutionofcaffeinebiosynthesis,Science345 (2014)1181–1184.
[10]A.D’Hont,D.AngéliqueD.h.France,A.Jean-Marc,B.Franc-Christophe,C. Franc¸oise,G.Olivier,N.Benjamin,B.Stéphanie,D.Gaëtan,etal.,Thebanana
/CurrentPlantBiology7–8(2016)6–9 9
(Musaacuminata)genomeandtheevolutionofmonocotyledonousplants, Nature488(2012)213–217.
[11]X.Argout,J.Salse,J.-M.Aury,M.J.Guiltinan,G.Droc,J.Gouzy,M.Allegre,C. Chaparro,T.Legavre,S.N.Maximova,etal.,ThegenomeofTheobromacacao, Nat.Genet.43(2011)101–108.
[12]G.Droc,D.Larivière,V.Guignon,N.Yahiaoui,D.This,O.Garsmeur,A. Dereeper,C.Hamelin,X.Argout,J.-F.Dufayard,etal.,Thebananagenomehub, Database2013(2013)bat035.
[13]A.Dereeper,S.Bocs,M.Rouard,V.Guignon,S.Ravel,C.Tranchant-Dubreuil,V. Poncet,O.Garsmeur,P.Lashermes,G.Droc,Thecoffeegenomehub:a resourceforcoffeegenomes,NucleicAcidsRes.43(2015)D1028–1035.
[14]J.Goecks,A.Nekrutenko,J.Taylor,T.Galaxy,Galaxy:acomprehensive approachforsupportingaccessible,reproducible,andtransparent computationalresearchinthelifesciences,GenomeBiol.11(2010)R86.
[15]C.Monat,C.Tranchant-Dubreuil,A.Kougbeadjo,C.Farcy,E.Ortega-Abboud,S. Amanzougarene,S.Ravel,M.Agbessi,J.Orjuela-Bouniol,M.Summo,etal., TOGGLE:toolboxforgenericNGSanalyses,BMCBioinformatics16(2015)374.
[16]r.g.project,The3000ricegenomesproject,GigaScience3(2014)7.
[17]B.Nabholz,G.Sarah,F.Sabot,M.Ruiz,H.Adam,S.Nidelet,A.Ghesquiere,S. Santoni,J.David,S.Glemin,Transcriptomepopulationgenomicsreveals severebottleneckanddomesticationcostintheAfricanrice(Oryza glaberrima),MolEcol23(2014)2210–2227.
[18]G.Sarah,F.Homa,S.Pointet,S.Contreras,F.Sabot,B.Nabholz,S.Santoni,L. Saune,M.Ardisson,N.Chantret,etal.,Alargesetof26newreference transcriptomesdedicatedtocomparativepopulationgenomicsincropsand wildrelatives,Molecularecologyresources.(2016).