Récemment recherché

Aucun résultat trouvé

Étiquettes

Aucun résultat trouvé

Document

Aucun résultat trouvé

Accueil Écoles Thèmes

Connexion

Clustering strings with mutations using an expectation-maximization algorithm In the context of RNA structure prediction

Partager "Clustering strings with mutations using an expectation-maximization algorithm In the context of RNA structure prediction"

N/A

N/A

Protected

Année scolaire: 2021

Info

Protected

Academic year: 2021

Partager "Clustering strings with mutations using an expectation-maximization algorithm In the context of RNA structure prediction"

Copied!

2

0

0

2

0

0

Chargement.... (Voir le texte intégral maintenant)

Télécharger maintenant ( 2 Page )

Texte intégral

(1)

HAL Id: hal-02332313

https://hal.inria.fr/hal-02332313

Submitted on 24 Oct 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Clustering strings with mutations using an

expectation-maximization algorithm In the context of RNA structure prediction

Afaf Saaidi, Yann Ponty, Mireille Regnier

To cite this version:

Afaf Saaidi, Yann Ponty, Mireille Regnier. Clustering strings with mutations using an expectation- maximization algorithm In the context of RNA structure prediction. 34th Clemson Mini-Conference on Discrete Mathematics and Algorithms, Oct 2019, Clemson, United States. �hal-02332313�

(2)

,

Clustering strings with mutations using an expectation-maximization algorithm

In the context of RNA structure prediction

Afaf Saaidi ^1,2 Yann Ponty ² Mireille R´egnier ²

1 School of Mathematics, Georgia Institue of Technology , ² Laboratory of Computer Sciences LIX, Ecole Polytechnique, France

,

Outline

In comparative analysis, an RNA structure (a set of base pairs and unpaired nu- cleotides) is predicted from a set of RNA variants (similar sequences) under the as- sumption of the conservation of the structure during evolution.

The combination of RNA variants with Experimental data informing about the local (nucleotide) structure may lead to more accurate structure prediction.

The experimental protocol consists of mutating nucleotides likely to be ’unpaired’.

A simultaneous reading of RNA variants sequences that underwent the experimental mutation protocol lead to the following issue:

How to cluster ’mutated’ substrings of similar parent strings such that each substring is correctly assigned to its parent string?

We developed an Expectation Maximization algorithm that uses Mutational profiles (mutation distributions) to assign the substrings to their strings of origin.

RNA structure

I RNA is key to understand many biological processes (As in viral RNA).

I RNA maintains a stable functional structure during its evolution.

I Computational methods allow to have accurate 2D structure predictions (PPV ≈ 75%), less accurate predictions for long RNA.

I + Experimental probing data informing about local (nucleotide) structure improve predictions.

Figure 1: RNA secondary structure model of the HIV1 Gag-IRES with the projection of Experimental probing data [1]

Mutational profiles

A set of 3 similar strings with symbols in the Alphabet Σ = {A, C , G , U } p ₁ p ₂ p ₃ p ₄ p ₅ p ₆ p ₇ p ₈ p ₉ p ₁₀

WT C G A C G C U C U U V ₁ C G A C U C A C U U V ₂ U A A C G C U C U U

WT

V1 V2 Mutation of ’unpaired’ Nucleotide

The assignment problem

Given the substring r ₁ CAAC:

Variant V Locus (r ₁ , V ) Distance(r ₁ , V )

WT p ₁ 1

V ₁ p ₁ 1

V ₂ p ₁ 1

Q.: Since r ₁ is showing the same nucleotide distance to all the variants, what is its string of origin ?

→ Mutational Profiles with the Expectation Maximization algorithm.

The objective

Our goal is to resolve the assignment problem: given sub- strings derived from a set of similar sequences and where structural mutations happen, what are the original strings of each substring?

Key idea of using the Expectation Maximization algorithm I Circular dependency between the mutational profiles and the

assignments of substrings [2]

→ Performs a joint inference of the mutational profile and the assignment

s _i e _i WT r _i

L(θ; x , z ) =

r

Y

i =1

V

Y

j =1

f (x _i ; M _j ) V

1

¹

_zi =j

I θ: Mutation probabilities θ = {M _j } ^V _j ₌₁

I M: Mutational profile Mj : [1, n] × {m, m} → ¯ [0, 1] for v _j I Density function:

f (x _i ; M _j ) = Q

k ∈[s _i ,e _i ]

M _j (k , x _i _,k _−s _i ) I Assignment probability:

T _j ^(t _,i ⁾ := P (Z _i = j | X _i = x _i ; θ ^(t ⁾ ) =

f

x _i ; M _j ^(t ⁾

V

P

j ⁰ =1

f

x _i ; M _j ^(t 0 ⁾

Steps on EM for the substrings assignment

1. Initialization: For each substring j choose the initial estimates M _j (k , c ) ⁽⁰⁾ , c ∈ N , and compute the initial loglikelihood.

2. E step: Compute the density function f (x _i ; M _j ^(t ⁾ ).

3. M step: Update T _j _,i then, compute the new estimates M _j (k , c ) ^(t ⁺¹⁾ , c ∈ N = {m, m}. ¯

4. Iterate until convergence.

Results on simulated substrings with mutation

EM-assignment TMAP (MapQ ≤ 1) Set Iteration ET(s) correct incorrect correct incorrect

1 1 292 2144 239 148

2 295 2141

117 1444 992

1000 5713.5 1538 898

2 1 983 1243 215 2

2 1096 1130

295 1122 1104

1000 5537.8 1138 1088

References

[1] J. Deforges, S. de Breyne, M. Ameur, N. Ulryck, N. Chamond, A. Saaidi, Y. Ponty, T. Ohlmann, and B. Sargueil.

Two ribosome recruitment sites direct multiple translation events within HIV1 Gag.

Nucleic Acids Research, 2017.

[2] Afaf Saaidi.

Multi-dimensional probing for rna secondary structure(s) prediction.

Ph.D. thesis, Chap.6, 2018.

Acknowledgment

I Ph.D. grant (2015-2018), Fondation pour la Recherche M´edicale, France.

I Post-doc NIH grant (2019-2022).

math.gatech.edu School of Mathematics [email protected]

Références

Télécharger maintenant ( PDF - 2 Page - 0.91 MB )

Documents relatifs

A generic structure-from-motion algorithm for cross-camera scenarios

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Using Visible Light Communication in the Smart City Context

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Non-convex onion peeling using a shape hull algorithm

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Source and channel adaptive rate control for multicast layered video transmission based on a clustering algorithm

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Variable selection for model-based clustering using the integrated complete-data likelihood

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Protein structure mining using a structural alphabet.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Aircraft ground traffic optimisation using a genetic algorithm

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

On an algorithm for optimal control using Pontryagin's maximum principle

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Documents relatifs

A parallel transitive closure algorithm using hash-based clustering

A parallel transitive closure algorithm using hash-based clustering

26

0

0

Constrained Clustering using SAT

Constrained Clustering using SAT

13

0

0

Post-Retrieval Clustering Using Third-Order Similarity Measures

Post-Retrieval Clustering Using Third-Order Similarity Measures

7

0

0

Supercapacitors aging diagnosis using least square algorithm

Supercapacitors aging diagnosis using least square algorithm

6

0

0

MLE for partially observed diffusions : direct maximization vs. The EM algorithm

MLE for partially observed diffusions : direct maximization vs. The EM algorithm

47

0

0

Using conceptual clustering to learn about function, structure and behaviour in design

Using conceptual clustering to learn about function, structure and behaviour in design

22

0

0

An Algorithm for Estimating all Matches Between Two Strings

An Algorithm for Estimating all Matches Between Two Strings

18

0

0

A classification EM algorithm for clustering and two stochastic versions

A classification EM algorithm for clustering and two stochastic versions

19

0

0