The South Green portal: A comprehensive resource for tropical and Mediterranean crop genomics

(1)

CurrentPlantBiology7–8(2016)6–9

ContentslistsavailableatScienceDirect

Current

Plant

Biology

j ourn a l h o m e pa g e :w w w . e l s e v i e r . c o m / l o c a t e / c p b

The

South

Green

portal:

a

comprehensive

resource

for

tropical

and

Mediterranean

crop

genomics

夽

South

Green

collaborators

a,b,c,d,e,f,1

a_Bioversity_{International,}_Parc_{Scientiﬁque}_Agropolis_II,₃₄₃₉₇_Montpellier_Cedex_5,_France b_UMR_AGAP_{CIRAD/INRA/SupAgro,}_Avenue_Agropolis,₃₄₃₉₈_Montpellier_Cedex_5,_France

C_UMR_BGPI_{CIRAD/INRA/SupAgro,}_Campus_{International}_de_Baillarguet,₃₄₃₉₈_Montpellier,_Cedex_5,_France d_UMR_DIADE_IRD/UM,_Avenue_Agropolis,₃₄₉₃₄_Montpellier_Cedex_5,_France

e_UMR_InterTryp_CIRAD/IRD,_Campus_{International}_de_Baillarguet,₃₄₃₉₈_Montpellier,_Cedex_5,_France f_UMR_IPME_{IRD/UM/CIRAD,}_Avenue_Agropolis,₃₄₃₉₄_Montpellier,_Cedex_5,_France

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received28October2016

Receivedinrevisedform3December2016 Accepted3December2016

Keywords: Cropdatabases Genomics Galaxy

NextGenerationSequencing Pipeline

Sequencevariants Workﬂow

a

b

s

t

r

a

c

t

TheSouthGreenWebportal(http://www.southgreen.fr/)providesaccesstoalargepanelofpublic databases,analyticalworkﬂowsandbioinformaticsresourcesdedicatedtothegenomicsoftropicaland Mediterraneancrops.Theportalcontainscurrentlyabout20informationsystemsandtoolsandtargets abroadrangeofcropssuchasBanana,Cacao,Cassava,Coconut,Coffee,Grape,RiceandSugarcane.

1. Introduction

Analysisandvisualizationofmassivegenomicsdatasetsarean ongoingtrendinplantsciences,especiallyfortropicalcropswhere thissubjectcanhelptotacklechallengeslinkedtohumanactivities andClimateChange:conservationandanalysisofbiodiversity(loss becauseofhumanactivity),foodsecurity(the9billionpeople ques-tion),newusageofplantmaterial,etc.Intherecentyears,because ofsuccessivelylowercostsofgenomicsequencing,alarge num-berofgroupshavedevelopedmassiveresourcesanddatawhich requirehighperformancecomputationalresourcesandnew ana-lyticalapproaches[1,2].

TheSouthGreenportalisanecosystemoftoolsthatwere orig-inallydeveloped as independententities to fulﬁllthe need for speciﬁcprojectsorcrops,buthaveevolvedovertimetogeneric toolstocomprehensivelystudycropgenomics.Thosegenerictools areadaptabletoawiderangeofotherorganisms,especially culti-vated,wildororphanplants.

夽 Thisarticleispartofaspecialissueentitled“Genomicresourcesanddatabases”, publishedinthejournalCurrentPlantBiology7–8,2016.

1 _The_list_of_participants_and_their_{afﬁliations}_are_provided_in_Appendix_A

2. Specializedresources

Some resources from the South Green portal are dedicated tospecializeddatasets.AmongtheGene-baseddatabases, Green-PhylDB[3]providesaccesstoresourcesforcomparativegenomics andorthologyidentiﬁcationinplantgenomes,andOryGenesDB[4]

isadatabasedevelopedforricereversegenetics,andcontaining FSTs (flanking sequence tags) of various mutagens and func-tionalgenomicsdata,collectedfrombothinternationalinsertion collectionsand theliterature.Otherresourcesare− (i) Marker-based databases like TropGeneDB [5] which connects data on molecularmarkers(e.g.SimpleSequenceRepeats,DiversityArrays Technology),Quantitative TraitLoci,geneticandphysicalmaps, phenotypingstudies,andinformationongeneticresources (geo-graphicorigin,parentage,collection),(ii)SNiPlay[6]whichallows queryingbothSNPs(Single-NucleotidePolymorphism)andInDels derivedfromNGS(Next-GenerationSequencing)projects (Whole-Genome Sequencing,Genotyping bySequencing, RNA-Seq) and computing a set of web analytical workflows for the resulting variants(diversity,populationstratification,Genome-Wide Associ-ationStudy),andproposesgraphicalrepresentationsoftheresults, and(iii) Gigwa[7]providingexploration ofverylarge genotyp-ingstudiesbyfilteringthembasedonnotonlyvariantfeatures

http://dx.doi.org/10.1016/j.cpb.2016.12.002

(2)

/CurrentPlantBiology7–8(2016)6–9 7

Fig.1. SouthGreenresourcesandstrategytoprocess,analyzeanddisplayNext-GenerationSequencingdatasetsinthecontextofgenomicvariationstudies.

(SNP,Indels),includingfunctionalannotations,butalsogenotype patterns,inordertoextractsubsetsinvariouspopularexport for-mats.Currentdevelopmentsareonthewayfordatabasesaimed tohost/pathogeninteractions,catalogsofplantpathogensstrains andsoon.Recently,basedonourexperiencewithWebSemantic technologies[8],wedevelopedAgroLD(www.agrold.org)anRDF (ResourceDescriptionFramework)knowledgebasethatconsistsof integrateddatafromavarietyofplantresourcesandontologies.

AcompletesummaryoftheSouthGreendatabaseswiththeir descriptionisavailableathttp://www.southgreen.fr/databases.

3. GenomeHubs

Our participation in several reference genome sequencing projects[9–11] hasledusto develop crop-speciﬁcinformation systems,socalledGenome Hubs, tomanagethecorresponding genomeannotationsandlinkeddatasets.Dataavailableareunder differentforms,fromcompletegenomesequencealongwithgene structure, gene product information, metabolicpathways, gene families, transcriptomicassays (ExpressedSequence Tags, RNA-Seq),geneticmarkersaswellasgeneticandphysicalmaps.Several GenomeHubs werereleased: Banana[12], Cassava,Cacao, Cof-fee[13],andothers(e.g.Rice,Magnaporthae)arecurrentlyunder development.GenomeHubsarepoweredbymajorGMOD(Generic ModelOrganismDatabase)components(i.e.Chado,Cmap,Jbrowse,

Tripal,Galaxy,Pathwaytools)andcomplementedbyresourcesand toolsdevelopedwithintheSouthGreenframeworksuchas Green-PhylDB,SNiPlayandTropGeneDB.

4. Workﬂowanalyses

Targetusersofbioinformaticapplicationsareusuallydivided betweenpeoplewhousecommand-lineandthosewhodonot.Our strategyhasbeentoaddressbothcategoriesbyoffering comple-mentarysolutionstoperformdataanalyses(Fig.1).

4.1. Galaxyworkﬂows

Galaxy [14] is a web-based service that allows an easy access to the bioinformatic applications and strongly supports reproducibility of analysis steps. Through its graphical inter-face, non-bioinformatician scientists are able to conduct small scale as well as medium range analyses in a user-friendly manner.Inadditiontothegenerictoolsprovidedwiththe stan-dard installation of Galaxy, the South Green Galaxy instance (http://galaxy.southgreen.fr/galaxy/)containsalargecollectionof exclusivetools.TheGalaxyToolShedallowsthesemodulestobe easily sharedwithany otherGalaxy instance. In addition, pre-conﬁguredworkﬂowsweredesignedforrecurrentanalysesinplant genomicssuchasNGSmapping/cleaning,SNPcallingand

(3)

ﬁlter-8 /CurrentPlantBiology7–8(2016)6–9

ing,Genome-WideAssociation Study,basicpopulationgenetics, structuralvariationsandphylogenetics.

4.2. TOGGLE

TOGGLE[15]isacommand-linetooldesignedtoallow biolo-giststocreatelargeworkflowswithoutknowinghowtocodeit (https://github.com/SouthGreenPlatform/).Usingeditable config-urationfiles,itproposesamorein-depthcontrolofcustomizable pipelinesandallowsthemanagementofmanymoresamplesthan Galaxy(ex:massiveresequencingprojects,morethan50samples), aswellasthemanagement oftoolsversions.It canbeusedon widearrayofanalyses,suchasNGSdatacleaning,DNAseq,RNAseq, restrictionsite-associatedDNAsequencing(RADseq)/Genotyping bySequencing(GBS),genomeand transcriptomeassembly,SNP calling,structuralvariationsdetection,andsoon.Thecurrent ver-sioncanbedeployedonlaptop,largemachineaswellasaHPC cluster,usingnormalsystemofDockermachines,andis scheduler-aware.Alreadycustomizedpipelinesarealsoavailableforclassical NGSanalyses.

4.3. HandlingbigdatasetsinRice

Ourgoalwastosetupasimpleframeworktohelpuserstowork efﬁcientlyandeasilywiththehugevolumeofgenomicresources generatedintheRicedatastorebasedonthe3,000ricegenomes

[16]andadditionalresources(i.e.High DensityRiceArray700k SNPsandIRIGIN projecthttp://irigin.org).Galaxymoduleswere implemented toquery efficiently millions of genomic variants, tofilterongenomicannotations, tofilteronlistof individuals, topredict coding effects of genomic variants, and toexport in severalformats (see Rice Variant Analysisin theTool Panel of

http://galaxy.southgreen.fr/galaxy/).Outputscanbeeasilysentto anyGalaxyworkﬂows.TheequivalentTOGGLEimplementationis alsoontheway.

5. Otherdevelopments

Wedevelopedalargenumberofothercodes,snippets, mod-ulesandpluginsfordatabasemanagements,duplicationdetection andrepresentation, fastaﬁle manipulationand soon.Allthose codes are available under GPLv3 license on the GitHub, at

https://github.com/SouthGreenPlatform.

6. ExamplesofstudiesusingtheSouthGreenplatform

VariousgroupsusedtheSouthGreeninfrastructuretoobtain theirdataandresults,andwereabletopublishhigh-quality bio-logicalinformation,onCoffeegenome[9],Banana[10],Cocoa[11], Africanrice[17]orlargetranscriptomeresources[18].Tools devel-oped for these studies are adaptable to a wide range of other organisms.

7. Conclusion

TheSouthGreenportalisdevelopedbyalocalnetworkof sci-entistsgathering bioinformatics and genomicsskills locatedon theAgropoliscampus fromMontpellierasa jointcollaboration betweenCIRAD,IRD,INRA,SupAgroandBioversityinternational. Based on this strong local community in the ﬁeld of agricul-ture,foodandbiodiversity,variousbioinformaticsapplicationsand resourcesdedicatedto genomicsoftropical and Mediterranean plantshavebeendeveloped,publishedandprovidedtothe inter-nationalcommunity. Sourcecode canbealso foundonGitHub (https://github.com/SouthGreenPlatform).

Funding

ThisworkwassupportedbyAgropolisFondation/AGROLabex (ID ARCAD 0900-001, ID RTB Multi-Genomes Hub 1403-018), the French Institute of Bioinformatics (IFB), ANR (GNPAnnot, MusaTract, CoffeaSeq,AfriCrop #NR-13-BSV7-0017), CIRAD and IRD (computer infrastructures), the CGIAR Research Programs on Roots, Tubers and Bananas and GRiSP/RICE, IRIGIN/France Génomique(Nationalinfrastructure,fundedaspartof “Investisse-mentd’avenir”programmanagedbyANR#ANR-10-INBS-09),and NUMEVLabex(LandPanToggle#2015-1-030-LARMANDE).

Acknowledgements

WethanktheScientiﬁcCommitteeofSouthGreenfortheir help-fuladviceandsuggestions,aswellasourdifferentinstitutionsfor theirsupport.

AppendixA.

Authors’information

Bioversity International: Guignon Valentin; Hueber Yann; RouardMathieu(email:m.rouard@cgiar.org).

UMR AGAP (CIRAD/INRA/SupAgro): Bocs Stephanie; Couvin David;deLamotteFrédéric;DrocGaëtan;DufayardJean-Franc¸ois; ElHassouniNordine;FarcyCédric;GkanogiannisAnestis;Hamelin Chantal;Larivière Delphine; MartinGuillaume;Ortega Enrique; PitollatBertrand;PointetStéphanie;RuizManuel(email:manuel. ruiz@cirad.fr);SarahGautier(email:gautier.sarah@supagro.inra. fr);SummoMarilyne;ThisDominique.

UMRBGPI(CIRAD):RavelSébastien.

UMR DIADE (IRD): Larmande Pierre; Monat Cécile; Sabot Franc¸ois;TandoNdomassi;Tranchant-DubreuilChristine(email:

christine.tranchant@ird.fr).

UMRInterTryp(CIRAD):SempéréGuilhem. UMRIPME(IRD):DereeperAlexis.

References

[1]O.Spjuth,E.Bongcam-Rudloff,J.Dahlberg,M.Dahlo,A.Kallio,L.Pireddu,F. Vezzi,E.Korpelainen,Recommendationsone-infrastructuresfor next-generationsequencing,GigaScience5(2016)26.

[2]R.R.Gullapalli,K.V.Desai,L.Santana-Santos,J.A.Kant,M.J.Becich,Next generationsequencinginclinicalmedicine:Challengesandlessonsfor pathologyandbiomedicalinformatics,JPatholInform3(2012)40.

[3]M.Rouard,V.Guignon,C.Aluome,M.-A.Laporte,G.Droc,C.Walde,C.M. Zmasek,C.Périn,M.G.Conte,GreenPhylDBv2.0:comparativeandfunctional genomicsinplants,NucleicAcidsRes.39(2011)D1095–1102.

[4]G.Droc,C.Périn,S.Fromentin,P.Larmande,OryGenesDB2008update: databaseinteroperabilityforfunctionalgenomicsofrice,NucleicAcidsRes. 37(2009)D992–995.

[5]C.Hamelin,G.Sempere,V.Jouffe,M.Ruiz,TropGeneDB,themulti-tropical cropinformationsystemupdatedandextended,NucleicAcidsRes.41(2013) D1172–1175.

[6]A.Dereeper,F.Homa,G.Andres,G.Sempere,G.Sarah,Y.Hueber,J.-F. Dufayard,M.Ruiz,SNiPlay3:aweb-basedapplicationforexplorationand largescaleanalysesofgenomicvariations,NucleicAcidsRes.43(2015) W295–300.

[7]G.Sempéré,F.Philippe,A.Dereeper,M.Ruiz,G.Sarah,P.Larmande, Gigwa-Genotypeinvestigatorforgenome-wideanalyses,GigaScience5 (2016)25.

[8]J.Wollbrett,P.Larmande,F.deLamotte,M.Ruiz,Clevergenerationofrich SPARQLqueriesfromannotatedrelationalschema:applicationtoSemantic WebServicecreationforbiologicaldatabases,BMCBioinformatics14(2013) 126.

[9]F.Denoeud,L.Carretero-Paulet,A.Dereeper,G.Droc,R.Guyot,M.Pietrella,C. Zheng,A.Alberti,F.Anthony,G.Aprea,etal.,Thecoffeegenomeprovides insightintotheconvergentevolutionofcaffeinebiosynthesis,Science345 (2014)1181–1184.

[10]A.D’Hont,D.AngéliqueD.h.France,A.Jean-Marc,B.Franc-Christophe,C. Franc¸oise,G.Olivier,N.Benjamin,B.Stéphanie,D.Gaëtan,etal.,Thebanana

(4)

/CurrentPlantBiology7–8(2016)6–9 9

(Musaacuminata)genomeandtheevolutionofmonocotyledonousplants, Nature488(2012)213–217.

[11]X.Argout,J.Salse,J.-M.Aury,M.J.Guiltinan,G.Droc,J.Gouzy,M.Allegre,C. Chaparro,T.Legavre,S.N.Maximova,etal.,ThegenomeofTheobromacacao, Nat.Genet.43(2011)101–108.

[12]G.Droc,D.Larivière,V.Guignon,N.Yahiaoui,D.This,O.Garsmeur,A. Dereeper,C.Hamelin,X.Argout,J.-F.Dufayard,etal.,Thebananagenomehub, Database2013(2013)bat035.

[13]A.Dereeper,S.Bocs,M.Rouard,V.Guignon,S.Ravel,C.Tranchant-Dubreuil,V. Poncet,O.Garsmeur,P.Lashermes,G.Droc,Thecoffeegenomehub:a resourceforcoffeegenomes,NucleicAcidsRes.43(2015)D1028–1035.

[14]J.Goecks,A.Nekrutenko,J.Taylor,T.Galaxy,Galaxy:acomprehensive approachforsupportingaccessible,reproducible,andtransparent computationalresearchinthelifesciences,GenomeBiol.11(2010)R86.

[15]C.Monat,C.Tranchant-Dubreuil,A.Kougbeadjo,C.Farcy,E.Ortega-Abboud,S. Amanzougarene,S.Ravel,M.Agbessi,J.Orjuela-Bouniol,M.Summo,etal., TOGGLE:toolboxforgenericNGSanalyses,BMCBioinformatics16(2015)374.

[16]r.g.project,The3000ricegenomesproject,GigaScience3(2014)7.

[17]B.Nabholz,G.Sarah,F.Sabot,M.Ruiz,H.Adam,S.Nidelet,A.Ghesquiere,S. Santoni,J.David,S.Glemin,Transcriptomepopulationgenomicsreveals severebottleneckanddomesticationcostintheAfricanrice(Oryza glaberrima),MolEcol23(2014)2210–2227.

[18]G.Sarah,F.Homa,S.Pointet,S.Contreras,F.Sabot,B.Nabholz,S.Santoni,L. Saune,M.Ardisson,N.Chantret,etal.,Alargesetof26newreference transcriptomesdedicatedtocomparativepopulationgenomicsincropsand wildrelatives,Molecularecologyresources.(2016).