• Aucun résultat trouvé

Building artificial genetical genomic datasets to optimize the choice of gene regulatory network inference methods

N/A
N/A
Protected

Academic year: 2022

Partager "Building artificial genetical genomic datasets to optimize the choice of gene regulatory network inference methods"

Copied!
2
0
0

Texte intégral

(1)

HAL Id: hal-02734486

https://hal.inrae.fr/hal-02734486

Submitted on 2 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Building artificial genetical genomic datasets to optimize the choice of gene regulatory network inference methods

Lise Pomies, Louise Gody, Charlotte Penouilh-Suzette, Nicolas Langlade, Brigitte Mangin, Simon de Givry

To cite this version:

Lise Pomies, Louise Gody, Charlotte Penouilh-Suzette, Nicolas Langlade, Brigitte Mangin, et al..

Building artificial genetical genomic datasets to optimize the choice of gene regulatory network in- ference methods. 17th European Conference on Computational Biology (ECCB 2018), Sep 2018, Athènes, Greece. 2018. �hal-02734486�

(2)

3 missings 2 falses

1 false 2 missings

1 false

Best Results

Method 1 Method 2 Method 3

Test of different inference methods on artificial expression datasets

The method with the best result on the artificial data set

will be used in the real data set

close properties

g1 g2 g3 g4 g5

h1 h2 h3 h4 h5 h6 h7

Unknown

Real dataset

genotypes (hybrids)

genes

g1 g2 g3 g4 g5

h1 h2 h3 h4 h5 h6 h7

Artificial dataset

genes

genotypes (hybrids)

Lise Pomiès1, Louise Gody2, Charlotte Penouilh-Suzette2, Nicolas Langlade2, Brigitte Mangin2, Simon de Givry1

1 Unité de Mathématiques et Informatique Appliquées de Toulouse

MIAT INRA - UR875 - Chemin de Borde Rouge, 31320 Castanet Tolosan, France 2 Laboratoire des interactions plantes micro-organismes

LIPM INRA - UMR2594 – Chemin de Borde Rouge, 31320 Castanet Tolosan, France

We study the transcriptomic response of sunflower to drought combined to the heterosis phenomenom, across 180 gene expressions on 400 hybrids genotypes, coming from a pool of 72 parents. SNP present on the parental genomes were measured.

Our goal is to infer the gene regulatory network among those genes. However, because of the non- independency of the data, accuracy of inference results is unpredictibl. Therefore, we need to test different methods, to select the best inference method for our biological question.

How to build an artificial dataset with the same biological properties as our real one?

Building artificial genetical genomic datasets to optimize the choice of gene regulatory inference methods

Test different methods on an artificial dataset with known network, expression levels and genotypes (DNA variant). Biological properties of this artificial dataset must be closed to the properties of the real dataset.

1. Build artificial network

based on real biological information available for the same biological process, on a close organism

3. Select and adapt an existing gene expression simulator

emulating the same type of experiment that the one we performed, with steady state measurements on different genotypes

G62

G15 G29G30

G144

G32 G71

G6 G119 G63

G24

G90 G3

G19G22 G20 G111

G21

G136 G72

G141

G41 G7

G116G101

G44G43

G57 G58

G31 G64

G132

G135

G9 G17 G53

G51

G25 G68

G87 G121

G108 G2 G1

G75

G85 G77 G84

G78 G33

G117

G122 G79

G39 G10

G46 G145

G37

G70 G109

G76 G91

G18 G128

G88G89

G45 G93G92 G142

G126 G137

G4 G139

G146

G138

G140

G36

G95 G26

G96

G35

G106

G14

G23 G82

G38 G125 G97

G83 G113

G133

G47 G80G81

G105 G143

G54 G130 G42

G118

G16

G5 G34

G28

G27

G115 G112

G134

G12 G13G73

G99 G59

G55

G100 G60

G49G50 G48

G11 G52

G69

G56 G40 G124

G131 G74

G129 G127

G65

G67 G86

G66 G114

G120 G123

For the homologues on A. thaliana of our 180 genes of interest, we selected regulations described between them in 3 databases.

AtRegNet [1]

AtPID [2]

PlantRegMap [3]

Regulations can be supported by experiments or predicted (----)

Our artifical network based on biological information is composed of 137 genes linked by 364 edges

SNP for each parental genotype are associated to a score 0 if it is like in XRQ-line or 1 if different. The hybrids SNP are obtained by combining locus-per-locus SNP of their parents.

We created artificial hybrids associated to DNA variant on each measured gene. We considered one variant per gene, those DNA variants are based on SNP of the real data.

A. Selection of SNP on each parental genotype

Ex : gene HanXRQChr001g0030841

B. K-medoid clustering

Manhattan distance on the SNP data

Genotypes are classified in 2 groups Cluster with XRQ - DNA variant score = 0

Other Cluster - DNA variant score = 1

XRQ

C. Artifcial hybrids 1 DNA variant per gene

DNA variant of hybrid

= mean score of it parents 3 possible value per gene

Collection of hybrid genotypes, with known DNA variations on our genes of interest.

0 homozygous = XRQ variant 1 homozygous ≠ XRQ variant 0.5 heterozygous

2. Create artificial hybrid genotypes

based on genomic information available for the real hybrids used in the experiment

[mRNA] of gene g basal transcription

rate of gene g degradation rate

constant of gene g

effect of gene g SNP(c) on gene g expression

SNP(c) noise

for each gene k

degradation noise role of gene k

on gene g [mRNA] of

gene k

min [mRNA] of gene k for k to have an effect

on gene g

effect of SNP(t) of k on it activity (more or less

efficient regulator)

SysGenSIM simulates steady state gene expressions using ordinary differential equations.

Simulation is based on a gene network topology and DNA variant for each gene.

Work only on RIL (both allele of a gene are identical) [4]

We modified the simulator to use our heterogenous hybrids, and mimetized the allelic dominance caused by the heterosis phenomenon.

wt - wt 1 m - m 0.75 wt - m

0.75 mutated dominance (10%) 0.87 additif effect (80%) 1 wt dominance (10%) wt-wt 1

m-m 0.75

SysGenSIM

Z parameter 2 possible values

Modified SysGenSIM

Z parameter 3 possible values

heritability

expression variance

hybrid

expression level

We use a mixed model to estimate the heritability [5]

Artificial dataset Real dataset

4. Comparison of biological score

obtained on real and simulated datasets (in our case the heritability score) to adjust parameters of the simulator

[1] Li et al. (2011). AtPID: The overall hierarchical functional protein interaction network interface and analytic platform for arabidopsis. Nucleic Acids Research, 39(SUPPL. 1), 1130–1133.

[2] Palaniswamy et al. (2006). AGRIS and AtRegNet. A Platform to Link cis-Regulatory Elements and Transcription Factors into Regulatory Networks. Plant Physiology, 140(3), 818–829.

[3] Jin et al. (2017). PlantTFDB 4.0: Toward a central hub for transcription factors and regulatory interactions in plants. Nucleic Acids Research, 45(D1), D1040–D1045.

[4] Pinna et al. (2011). Simulating systems genetics data with SysGenSIM. Bioinformatics, 27(17), 2459–2462.

[5] Mangin et al. (2017). Genomic Prediction of Sunflower Hybrids Oil Content. Frontiers in Plant Science, 8, 1–12.

The artificial dataset produced have the same biological properties as our real dataset. We can now test different methods of network inference and test the accuracy of these methods by comparing networks inferred by the algorithms to the artificial network. Network inference methods with the best results will be used on the experimental dataset to answer our biological question.

Select inference methods based on artificial datasets

% of expression variance of a gene explained by its parent genotypes

Funded by the SunRise project (2011-2019) : http://www.sunrise-project.fr

SysGenSIM parameters adjusted

to obtain the same heritability distribution

SNP cis effect 25%

SNP cis effect 75%

SNP cis effect

50%

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Each author of a GRNI method tested his method in one or several of the following manners: inferring network from real gene expression data and comparison of the inferred network to

From a complete weighted network (obtained from any network infer- ence method) BRANE Clust favors edges both having higher weights (as in standard thresholding) and link- ing

Despite the large number of available Gene Regulatory Network inference methods, the problem remains challenging: the underdetermination in the space of possible solutions

BINGO has been benchmarked using data from the DREAM4 in silico network challenge, simulated data from the circadian clock of the plant Arabidopsis thaliana with different

The results show that IRG1, similarly to the corresponding mouse gene [5], is expressed at the highest levels at 6–9 hours in both monocytes and macro- phages (S1 Fig) and that

In situations, where reliable and comprehensive data on the network struc- ture are not available, the distinction between the network architecture derived from structural

• As we believe that floating-point on FPGA should exploit the flexibility of the target and therefore not be limited to IEEE single and double precision, the algorithm