Reconstructing haplotypes from genotypes in multiparental populations using Artificial Neural Networks

(1)

HAL Id: hal-02151672

https://hal.archives-ouvertes.fr/hal-02151672

Submitted on 12 Jun 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Reconstructing haplotypes from genotypes in multiparental populations using Artificial Neural

Networks

Jérémy Andréoletti, Luke Noble, Henrique Teotónio

To cite this version:

Jérémy Andréoletti, Luke Noble, Henrique Teotónio. Reconstructing haplotypes from genotypes in multiparental populations using Artificial Neural Networks. Session Poster pour Immersion Experi- mentale, May 2019, Paris, France. 2019. �hal-02151672�

(2)

• Currently, the best haplotype imputation algorithms are based on Hidden Markov Models (HMM) [4]

• In the last years, geneticists started to unfold the power of Convolutional Neural Networks in population genetic inference [5]

• Our goal is to explore the potential and robustness of Neural Networks for haplotype reconstruction, and secondarily for predicting the number of recombination events

• We worked on simulations before applying our model to the C. elegans multiparental experimental evolution (CeMEE) dataset [6], with the Single Nucleotide Polymorphisms (SNPs) of the chromosome I as a case study

Jérémy ANDRÉOLETTI, Luke NOBLE and Henrique TEOTÓNIO Team Experimental Evolutionary Genetics

• Previous methods (HMM) : require the non-trivial computation of statistical indicators vs. Neural Networks (NN) : only need the genotypes

• Training data : thousands of genotypes + associated haplotypes obtained by recombining simulated or real founders

• Input : matrix constructed by alternating 1 descendant with the 16 founders and the genetic map (= probability of recombination between markers)

• Types of Neural Networks employed :

1. CNN = Convolutional Neural Network ⇢ suitable for image analysis piece by piece, reduces the dimensionality

2. LSTM = Long-Short Term Memory ⇢ suitable for sequence analysis, returns sequences, refines the results of the CNNs

3. Dense Neural Network ⇢ classical fully connected NN, more efficient on reduced input, returns a multi-categorical classification = haplotype

Reconstructing haplotypes from genotypes in multiparental populations using Artificial Neural Networks

References :

[1] Crow and Kimura. An Introduction to Population Genetics Theory, 1970 (Harper & Row) [2] Wakeley. Coalescent Theory: an introduction, 2009 (Roberts & Company Pub.)

[3] Rakshit S., Rakshit A. & Patil. Journal of Genetics, 2012 (91: 111.)

[4] Zheng, Boer, Eeuwijk. Genetics, 2015 (vol. 200 no. 4 1073-1087) and 2018 (vol. 210 no. 1 71-82)

[5] Flagel, Brandvain, Schrider. Molecular Biology and Evolution, 2019 (vol. 36, p. 220–238) [6] Noble, Chelo et al. Genetics, 2017 (vol. 207 no. 4 1663-1685)

Results

• Neural Network are efficient at predicting haplotypes from genotypes, but the upcoming comparison with HMMs will be crucial to assess their valuableness

• Our model is robust to varying mean or variance of genotypes similarities – in a limited range – and it is likely to give accurate predictions for other datasets

• A “scalable” model that requires less memory is also being developed

• Future models should be trainable without the founders and they may be optimised with a meta-modelling approach to further improve their accuracy

Artificial Neural Networks Introduction

Conclusion and prospects

Aim and methods

SNPs along the chromosome

16 founders

Simulations

= random founders

Real data

= similar founders 87% accuracy after 100

epochs (= training runs of the model)

43% accuracy after 100 epochs (48% if related

founders are merged)

SNPs along the chromosome

Accuracies for varying genotypes similarities between founders

Comparisons between a predicted haplotype (1) and the real one (2)

1 2 1

2

• A major goal in population genetics is to predict the genetic history of contemporary populations from sequence data [1] [2]

• In experimental and agricultural genetics there are many cases where multiple founders (of known genotypes) are combined to produce recombinant progeny [3]

• For each descendant, reconstructing its haplotype means finding which regions along the genome descend from which founder

• Limiting factors for haplotype reconstruction are the number of founders, relatedness between them, and the number of generations to the focal descendants

Simulation of haplotypes and genotypes, mimicking the experimental design [6]

Architecture of our most accurate neural network model for haplotype reconstruction :

A = input formatting, B = dimensionality reduction, C = core analysis, D = prediction checking

Recombinations number prediction