The South Green Bioinformatics
platform, a comprehensive resource
for crop genomics
Who are we?
Montpellier
Collaborative network of bioinformaticians from different
institutes : CIRAD, IRD, INRA and Bioversity International.
Banana, Cacao, Coconut, Coffee, Cotton,
Oil Palm, Rice, Sorghum, Sugarcane, …
Objectives of South Green
• Promote the original tools developed within the network,
• Propose a single web portal to access the tools,
• Promote exchange and collaborative developments within
the network,
• Promote interoperability between applications,
• Provide training for biologists and bioinformaticians,
• Promote high quality process within the network (e.g.
certifications),
• Promote South Green visibility at the national and
international level.
Nature Genetics, 2010
Nature Biotechnology, 2014 Nature, 2012
2014
2017. The cacao Criollo genome v2.0: an improved version of the genome for genetic and functional genomic studies. BMC Genomics 2016. Improvement of the banana “Musa acuminata” reference
sequence using NGS data and semi-automated bioinformatics methods.
BMC Genomics 17:243.
2017. De Novo Assemblies of Three Oryza glaberrima Accessions Provide First Insights about Pan-Genome of African Rices. Genome Biol Evol 9:1– 6.
2017. New Insights on Leucine-Rich Repeats Receptor-Like Kinase Orthologous Relationships in Angiosperms. Front Plant Sci 8.
2017. Evolutionary forces affecting synonymous variations in plant genomes. PLoS Genet 13.
Reference
Genomes
Recent
Outputs
Challenges with Big Genomics Data
Volume
Tens of TB of raw sequence data.
Hundreds of GB of processed and analyzed data.
Velocity
New and improved assemblies and annotation. New sequencing technologies and lower cost.
Variety
Genomic sequences, RNASeq, RADSeq, etc. Annotation
Variants
Metabolomic Etc.
•
Design techniques that scale up to big data sources
•
Integrate metadata available over data sources, and data
sources themselves
•
Metadata : data structure, data semantics, based on ontologies, and other useful information about data provenance (publisher, tools, methods, etc.)•
Develop flexible and scalable workflows, starting from NGS
sequence cleaning and assembly to more global genetic
analyses (
whole genome QTL analyses, genome wide association studies)
Providing a graphical interface allowing
non-bioinformaticians to:
• Host / share large amounts of genotyping
data, possibly enriched with metadata
• Applying real-time filters on this data
• Browse selected variants and genotypes
• Re-exporting thus defined subsets of data into
various popular formats
A user-friendly tool providing a graphical interface for
filtering large variation datasets.
Sempéré G, Gigascience. 2016 Jun 6;5:25
• Responsive with Gigabytes of data
• Single instance handling multiples
databases
• 2 setup modes: local instance /
data portal
• Data privacy / user permissions
implemented
• Variant filtering based on:
– marker features, functional annotations
– genotype patterns, data quality
• Support for haploid, diploid, polyploid data
• No loss of phasing information when provided
• Instant variant density graph display
• Easy connection with IGV (Integrative Genomics
Viewer)
A user-friendly tool providing a graphical interface for
filtering large variation datasets.
Agronomic Linked Data
• Knowledge base that integrates data from a variety of plant
resources.
• Integrate information at different levels.
AgroLD
Apply
Semantic
Web and Linked
Data
approaches
Data integration challenges
Semantic Web :
To remedy the fragmentation of all potentially useful
information on the web.
Ontologies
:
Gene Ontology (GO), Plant Ontology (PO), Plant Trait
Ontology (TO), Crop Ontology (CO), Plant Environment Ontology (EO)…
• Retrieve the local neighbourhood of Oryza sativa japonica protein:
IAA16 - Auxin-responsive protein
• Identify Rice proteins that are involved in root development.
• Retrieve Proteins associated with a given QTL: DTHD (days to
Galaxy : an online analysis platform with multiple advantages
Direct access to computing facilities
Usable from any computer connected to the internet Workflow engine enabling to chain tools
Centralisation and sharing data
Centralisation and sharing procedures/analyses between partners => tracking
tool
box history
8 highlighted workflows
Effort to spotlight our main workflows, relevant for our thematics: => 8 preconfigured and validated workflows:
Highlight 1: Phylogeny and Gene families
M. Rouard JF. Dufayard
S. Bocs D. Larivière
Build gene families:
i) by full proteome clustering
ii) from sequences of interest, searching for homologs Compute phylogenetic trees
Integrate data around gene families (syntenies, proteic domains, gene annotations, gene expression…)
Transfer annotation by comparative analyses Visualize results
Highlight 2: Structural variations
Highlight 3: SNP analysis and GWAS workflow
G. Andres
8 preconfigured and validated workflows
G. Sarah
8 preconfigured and validated workflows
Documentation on our workflows
Galaxy pages including : - workflow(s)
- datasets
- history of analysis - documentation
Efforts for development of visualization
Efforts for development of visualization
Circos - Circular overview of genome content
Efforts for development of visualization
Jbrowse Circos - Circular overview
of genome content Structure analysis
Efforts for development of visualization
Networks (Cytoscape) Circos - Circular overview
of genome content Structure analysis
Galaxy as a training platform
Data sharing facilities between trainees Workflows we hope “intuitive”
To the North
To the South Senegal (2012-2013), Vietnam (2015), Burkina
Faso (2016)
Galaxy IFB - Elixir - CGIAR : Perspectives
Towards a national Galaxy service:
http://usegalaxy.fr
+ instances focused on thematic
+ national repository of Galaxy instances
Galaxy as a platform for CGIAR
=> Bioinformatics hackathon at IRRI (feb 2018) … and a european Galaxy service:
http://usegalaxy.eu
Galaxy as a Elixir “Use case” (french contribution)
Example of
contributions:
Cassava
A. Gkanogiannis A. Dereeper L. A. Becerra• Diversity, Evolution
• Functional genomics
• Pests and Pathogens
Diversity, Evolution
FST values (in sliding windows and per marker) comparing 17 accessions from RADSeq - R/S Green Mite
Manes.17G042000.1.p Chromosome17 17455060 17455813 Toll-Interleukin-Resistance (TIR) domain family protein Manes.17G042100.1.p Chromosome17 17458851 17462826 disease resistance protein (TIR class), putative
Manes.17G055200.1.p Chromosome17 19287448 19291467 LRR and NB-ARC domains-containing disease resistance protein
Manes.17G072100.1.p Chromosome17 21123079 21129453 LRR and NB-ARC domains-containing disease resistance protein
Manes.17G081500.1.p Chromosome17 22133101 22133931 PLANT CADMIUM RESISTANCE 2
New challenges: the characterization
of plant mosaic genomes, the impact
of genomic structural variations, the
Focus
→ Inter (sub) specific hybridization events were frequent during the history of the cultivated plants (also animals, human)
→ Lack bioinformatic tools to efficiently address these questions
→ Mosaic genomes , hybrids between (sub)species
Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr 7
sub-species 1 sub-species 2
→ Impact on gene expression and thus phenotypes
sub-species 3
sub-species 4 → Impact on gene and thus character
transmission
→ Impact on genetic analysis (QTL, GWAS, ...)
→ Impact on fertility
→ Impact on chromosome
recombination and transmission
Translocation Inversion
Contribution of genetic group along the genome ?
Idea: Performing “chromosome painting” of sequenced accessions
Assign alleles to ancestral groups
Constraints:
Heterozygosity level variable depending on Musa acuminata ssp Probable introgressions in some ssp representatives
Method : Multivariate and clustering approaches
COA
K-mean clustering
Analysis performed on a reduced accession set: Polymorphous sites of homogenous
accessions according to admixture .
100 713 polymorphous sites From G Martin et al.
chr01
“Chromosome painting”
Normalization
Simulating ancestral population based on accessible representatives to infer
grouped alleles along chromosomes
*Calculating expected group alleles along chromosome based on simulation
-
window size of 400 SNPs*Calculating this proportion along each accessions chromosomes
Expected allele number if both haplotypes from Gp1 origin Expected allele number if one haplotype from Gp1 origin Observed allele number of Gp1 origin in the accession Allele number of Gp1 origin found in
simulated data not implying Gp1
ancestors
Gp1 (green) Gp2 (yellow) Gp3 (blue) Gp4 (red)
“Chromosome painting”
Final representationMonat C, Tranchant-Dubreuil C, Kougbeadjo A et al. (2015) TOGGLE: toolbox for generic NGS analyses. BMC Bioinformatics 16, 374.
From Monat et al, in prep
Comparison in k-mer
contents of different rice
reference genomes
→ To develop data
management system to
for analyzing collections
of similar genomes
Our hypothesis
- In rice genome, the number of distinct k-mers (which appear at least once)
tends to stabilize from a x number of genomes.
- Adding a new genome to an existing index created on 1000 genomes will be
equivalent to adding information of positions.
Labex CeMEB
Centre méditerrané en environnement et biodiversité
Labex agro
Agronomie et Développement Durable
Labex NUMEV
Solutions Numériques, Matérielles et Modélisation pour
l’Environnement et le Vivant
ID EVODYN
Phylogenetics and graph
approaches Population genetics,
statistical approaches CRP-RPB CRP-GRISP/IRRI/CIAT/AfricaRice CRP-Legume