platform, a comprehensive resource

(1)

The South Green Bioinformatics

platform, a comprehensive resource

for crop genomics

(2)

Who are we?

(3)

Montpellier

Collaborative network of bioinformaticians from different

institutes : CIRAD, IRD, INRA and Bioversity International.

Banana, Cacao, Coconut, Coffee, Cotton,

Oil Palm, Rice, Sorghum, Sugarcane, …

(4)

Objectives of South Green

• Promote the original tools developed within the network,

• Propose a single web portal to access the tools,

• Promote exchange and collaborative developments within

the network,

• Promote interoperability between applications,

• Provide training for biologists and bioinformaticians,

• Promote high quality process within the network (e.g.

certifications),

• Promote South Green visibility at the national and

international level.

(5)

Nature Genetics, 2010

Nature Biotechnology, 2014 Nature, 2012

2014

2017. The cacao Criollo genome v2.0: an improved version of the genome for genetic and functional genomic studies. BMC Genomics 2016. Improvement of the banana “Musa acuminata” reference

sequence using NGS data and semi-automated bioinformatics methods.

BMC Genomics 17:243.

2017. De Novo Assemblies of Three Oryza glaberrima Accessions Provide First Insights about Pan-Genome of African Rices. Genome Biol Evol 9:1– 6.

2017. New Insights on Leucine-Rich Repeats Receptor-Like Kinase Orthologous Relationships in Angiosperms. Front Plant Sci 8.

2017. Evolutionary forces affecting synonymous variations in plant genomes. PLoS Genet 13.

Reference

Genomes

Recent

Outputs

(6)

(7)

Challenges with Big Genomics Data

Volume

Tens of TB of raw sequence data.

Hundreds of GB of processed and analyzed data.

Velocity

New and improved assemblies and annotation. New sequencing technologies and lower cost.

Variety

Genomic sequences, RNASeq, RADSeq, etc. Annotation

Variants

Metabolomic Etc.

(8)

• Design techniques that scale up to big data sources

• Integrate metadata available over data sources, and data

sources themselves

•

Metadata : data structure, data semantics, based on ontologies, and other useful information about data provenance (publisher, tools, methods, etc.)

• Develop flexible and scalable workflows, starting from NGS

sequence cleaning and assembly to more global genetic

analyses (

whole genome QTL analyses, genome wide association studies

)

(9)

Providing a graphical interface allowing

non-bioinformaticians to:

• Host / share large amounts of genotyping

data, possibly enriched with metadata

• Applying real-time filters on this data

• Browse selected variants and genotypes

• Re-exporting thus defined subsets of data into

various popular formats

(10)

A user-friendly tool providing a graphical interface for

filtering large variation datasets.

Sempéré G, Gigascience. 2016 Jun 6;5:25

• Responsive with Gigabytes of data

• Single instance handling multiples

databases

• 2 setup modes: local instance /

data portal

• Data privacy / user permissions

implemented

(11)

• Variant filtering based on:

– marker features, functional annotations

– genotype patterns, data quality

• Support for haploid, diploid, polyploid data

• No loss of phasing information when provided

• Instant variant density graph display

• Easy connection with IGV (Integrative Genomics

Viewer)

(12)

(13)

(14)

A user-friendly tool providing a graphical interface for

filtering large variation datasets.

(15)

(16)

Agronomic Linked Data

• Knowledge base that integrates data from a variety of plant

resources.

• Integrate information at different levels.

AgroLD

Apply

Semantic

Web and Linked

Data

approaches

(17)

(18)

(19)

Data integration challenges

Semantic Web :

To remedy the fragmentation of all potentially useful

information on the web.

Ontologies

:

Gene Ontology (GO), Plant Ontology (PO), Plant Trait

Ontology (TO), Crop Ontology (CO), Plant Environment Ontology (EO)…

(20)

• Retrieve the local neighbourhood of Oryza sativa japonica protein:

IAA16 - Auxin-responsive protein

• Identify Rice proteins that are involved in root development.

• Retrieve Proteins associated with a given QTL: DTHD (days to

(21)

(22)

Galaxy : an online analysis platform with multiple advantages

Direct access to computing facilities

Usable from any computer connected to the internet Workflow engine enabling to chain tools

Centralisation and sharing data

Centralisation and sharing procedures/analyses between partners => tracking

tool

box history

(23)

(24)

8 highlighted workflows

Effort to spotlight our main workflows, relevant for our thematics: => 8 preconfigured and validated workflows:

(25)

Highlight 1: Phylogeny and Gene families

M. Rouard JF. Dufayard

S. Bocs D. Larivière

Build gene families:

i) by full proteome clustering

ii) from sequences of interest, searching for homologs Compute phylogenetic trees

Integrate data around gene families (syntenies, proteic domains, gene annotations, gene expression…)

Transfer annotation by comparative analyses Visualize results

(26)

Highlight 2: Structural variations

(27)

Highlight 3: SNP analysis and GWAS workflow

G. Andres

(28)

8 preconfigured and validated workflows

G. Sarah

(29)

8 preconfigured and validated workflows

(30)

Documentation on our workflows

Galaxy pages including : - workflow(s)

- datasets

- history of analysis - documentation

(31)

Efforts for development of visualization

(32)

Efforts for development of visualization

Circos - Circular overview of genome content

(33)

Efforts for development of visualization

Jbrowse Circos - Circular overview

of genome content Structure analysis

(34)

Efforts for development of visualization

Networks (Cytoscape) Circos - Circular overview

of genome content Structure analysis

(35)

Galaxy as a training platform

Data sharing facilities between trainees Workflows we hope “intuitive”

To the North

To the South Senegal (2012-2013), Vietnam (2015), Burkina

Faso (2016)

(36)

Galaxy IFB - Elixir - CGIAR : Perspectives

Towards a national Galaxy service:

http://usegalaxy.fr

+ instances focused on thematic

+ national repository of Galaxy instances

Galaxy as a platform for CGIAR

=> Bioinformatics hackathon at IRRI (feb 2018) … and a european Galaxy service:

http://usegalaxy.eu

Galaxy as a Elixir “Use case” (french contribution)

(37)

Example of

contributions:

Cassava

A. Gkanogiannis A. Dereeper L. A. Becerra

• Diversity, Evolution

• Functional genomics

• Pests and Pathogens

(38)

(39)

Diversity, Evolution

(40)

(41)

(42)

(43)

FST values (in sliding windows and per marker) comparing 17 accessions from RADSeq - R/S Green Mite

Manes.17G042000.1.p Chromosome17 17455060 17455813 Toll-Interleukin-Resistance (TIR) domain family protein Manes.17G042100.1.p Chromosome17 17458851 17462826 disease resistance protein (TIR class), putative

Manes.17G055200.1.p Chromosome17 19287448 19291467 LRR and NB-ARC domains-containing disease resistance protein

Manes.17G072100.1.p Chromosome17 21123079 21129453 LRR and NB-ARC domains-containing disease resistance protein

Manes.17G081500.1.p Chromosome17 22133101 22133931 PLANT CADMIUM RESISTANCE 2

(44)

New challenges: the characterization

of plant mosaic genomes, the impact

of genomic structural variations, the

(45)

Focus

→ Inter (sub) specific hybridization events were frequent during the history of the cultivated plants (also animals, human)

→ Lack bioinformatic tools to efficiently address these questions

→ Mosaic genomes , hybrids between (sub)species

Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr 7

sub-species 1 sub-species 2

→ Impact on gene expression and thus phenotypes

sub-species 3

sub-species 4 _{→ Impact on gene and thus character}

transmission

→ Impact on genetic analysis (QTL, GWAS, ...)

→ Impact on fertility

→ Impact on chromosome

recombination and transmission

Translocation Inversion

(46)

Contribution of genetic group along the genome ?

Idea: Performing “chromosome painting” of sequenced accessions

 Assign alleles to ancestral groups

Constraints:

 Heterozygosity level variable depending on Musa acuminata ssp  Probable introgressions in some ssp representatives

Method : Multivariate and clustering approaches

 COA

 K-mean clustering

Analysis performed on a reduced accession set: Polymorphous sites of homogenous

accessions according to admixture .

 100 713 polymorphous sites From G Martin et al.

(47)

chr01

“Chromosome painting”

Normalization

Simulating ancestral population based on accessible representatives to infer

grouped alleles along chromosomes

*Calculating expected group alleles along chromosome based on simulation

-

window size of 400 SNPs

*Calculating this proportion along each accessions chromosomes

Expected allele number if both haplotypes from Gp1 origin Expected allele number if one haplotype from Gp1 origin Observed allele number of Gp1 origin in the accession Allele number of Gp1 origin found in

simulated data not implying Gp1

ancestors

(48)

Gp1 (green) Gp2 (yellow) Gp3 (blue) Gp4 (red)

“Chromosome painting”

Final representation

(49)

Monat C, Tranchant-Dubreuil C, Kougbeadjo A et al. (2015) TOGGLE: toolbox for generic NGS analyses. BMC Bioinformatics 16, 374.

(50)

From Monat et al, in prep

(51)

Comparison in k-mer

contents of different rice

reference genomes

→ To develop data

management system to

for analyzing collections

of similar genomes

(52)

Our hypothesis

- In rice genome, the number of distinct k-mers (which appear at least once)

tends to stabilize from a x number of genomes.

- Adding a new genome to an existing index created on 1000 genomes will be

equivalent to adding information of positions.

(53)

Labex CeMEB

Centre méditerrané en environnement et biodiversité

Labex agro

Agronomie et Développement Durable

Labex NUMEV

Solutions Numériques, Matérielles et Modélisation pour

l’Environnement et le Vivant

ID EVODYN

Phylogenetics and graph

approaches Population genetics,

statistical approaches CRP-RPB CRP-GRISP/IRRI/CIAT/AfricaRice CRP-Legume

Partners/collaborators

(54)