HAL Id: hal-02734486
https://hal.inrae.fr/hal-02734486
Submitted on 2 Jun 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Building artificial genetical genomic datasets to optimize the choice of gene regulatory network inference methods
Lise Pomies, Louise Gody, Charlotte Penouilh-Suzette, Nicolas Langlade, Brigitte Mangin, Simon de Givry
To cite this version:
Lise Pomies, Louise Gody, Charlotte Penouilh-Suzette, Nicolas Langlade, Brigitte Mangin, et al..
Building artificial genetical genomic datasets to optimize the choice of gene regulatory network in- ference methods. 17th European Conference on Computational Biology (ECCB 2018), Sep 2018, Athènes, Greece. 2018. �hal-02734486�
3 missings 2 falses
1 false 2 missings
1 false
Best Results
Method 1 Method 2 Method 3
Test of different inference methods on artificial expression datasets
The method with the best result on the artificial data set
will be used in the real data set
close properties
g1 g2 g3 g4 g5
h1 h2 h3 h4 h5 h6 h7
Unknown
Real dataset
genotypes (hybrids)
genes
g1 g2 g3 g4 g5
h1 h2 h3 h4 h5 h6 h7
Artificial dataset
genes
genotypes (hybrids)
Lise Pomiès1, Louise Gody2, Charlotte Penouilh-Suzette2, Nicolas Langlade2, Brigitte Mangin2, Simon de Givry1
1 Unité de Mathématiques et Informatique Appliquées de Toulouse
MIAT INRA - UR875 - Chemin de Borde Rouge, 31320 Castanet Tolosan, France 2 Laboratoire des interactions plantes micro-organismes
LIPM INRA - UMR2594 – Chemin de Borde Rouge, 31320 Castanet Tolosan, France
We study the transcriptomic response of sunflower to drought combined to the heterosis phenomenom, across 180 gene expressions on 400 hybrids genotypes, coming from a pool of 72 parents. SNP present on the parental genomes were measured.
Our goal is to infer the gene regulatory network among those genes. However, because of the non- independency of the data, accuracy of inference results is unpredictibl. Therefore, we need to test different methods, to select the best inference method for our biological question.
How to build an artificial dataset with the same biological properties as our real one?
Building artificial genetical genomic datasets to optimize the choice of gene regulatory inference methods
Test different methods on an artificial dataset with known network, expression levels and genotypes (DNA variant). Biological properties of this artificial dataset must be closed to the properties of the real dataset.
1. Build artificial network
based on real biological information available for the same biological process, on a close organism
3. Select and adapt an existing gene expression simulator
emulating the same type of experiment that the one we performed, with steady state measurements on different genotypes
G62
G15 G29G30
G144
G32 G71
G6 G119 G63
G24
G90 G3
G19G22 G20 G111
G21
G136 G72
G141
G41 G7
G116G101
G44G43
G57 G58
G31 G64
G132
G135
G9 G17 G53
G51
G25 G68
G87 G121
G108 G2 G1
G75
G85 G77 G84
G78 G33
G117
G122 G79
G39 G10
G46 G145
G37
G70 G109
G76 G91
G18 G128
G88G89
G45 G93G92 G142
G126 G137
G4 G139
G146
G138
G140
G36
G95 G26
G96
G35
G106
G14
G23 G82
G38 G125 G97
G83 G113
G133
G47 G80G81
G105 G143
G54 G130 G42
G118
G16
G5 G34
G28
G27
G115 G112
G134
G12 G13G73
G99 G59
G55
G100 G60
G49G50 G48
G11 G52
G69
G56 G40 G124
G131 G74
G129 G127
G65
G67 G86
G66 G114
G120 G123
For the homologues on A. thaliana of our 180 genes of interest, we selected regulations described between them in 3 databases.
AtRegNet [1]
AtPID [2]
PlantRegMap [3]
Regulations can be supported by experiments or predicted (----)
Our artifical network based on biological information is composed of 137 genes linked by 364 edges
SNP for each parental genotype are associated to a score 0 if it is like in XRQ-line or 1 if different. The hybrids SNP are obtained by combining locus-per-locus SNP of their parents.
We created artificial hybrids associated to DNA variant on each measured gene. We considered one variant per gene, those DNA variants are based on SNP of the real data.
A. Selection of SNP on each parental genotype
Ex : gene HanXRQChr001g0030841
B. K-medoid clustering
Manhattan distance on the SNP data
Genotypes are classified in 2 groups Cluster with XRQ - DNA variant score = 0
Other Cluster - DNA variant score = 1
XRQ
C. Artifcial hybrids 1 DNA variant per gene
DNA variant of hybrid
= mean score of it parents 3 possible value per gene
Collection of hybrid genotypes, with known DNA variations on our genes of interest.
0 homozygous = XRQ variant 1 homozygous ≠ XRQ variant 0.5 heterozygous
2. Create artificial hybrid genotypes
based on genomic information available for the real hybrids used in the experiment
[mRNA] of gene g basal transcription
rate of gene g degradation rate
constant of gene g
effect of gene g SNP(c) on gene g expression
SNP(c) noise
for each gene k
degradation noise role of gene k
on gene g [mRNA] of
gene k
min [mRNA] of gene k for k to have an effect
on gene g
effect of SNP(t) of k on it activity (more or less
efficient regulator)