HAL Id: hal-01558227
https://hal.archives-ouvertes.fr/hal-01558227
Submitted on 7 Jul 2017
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Computational methods for comparing and integrating multiple probing assays to predict RNA secondary
structure
Afaf Saaidi, Delphine Allouche, Yann Ponty, Bruno Sargueil, Mireille Regnier
To cite this version:
Afaf Saaidi, Delphine Allouche, Yann Ponty, Bruno Sargueil, Mireille Regnier. Computational meth- ods for comparing and integrating multiple probing assays to predict RNA secondary structure . Doctorial journey Interface, Ecole polytechnique, Palaiseau, Nov 2016, Palaiseau, France. 10. �hal- 01558227�
Computational methods for comparing
and integrating multiple probing assays to predict RNA secondary structure
Afaf Saaidi
1,2,3, Delphine Allouche
3,4Supervisors: Yann Ponty
1,2,3, Bruno sargueil
3,4, Mireille Regnier
1,21-AMIB team, Inria Saclay 2-LIX ,3-CNRS ,4-Laboratoire de cristallographie et RMN Biologiques, Paris Descartes
1-Introduction
• RNA is key to understand many biological processes.
• RNA maintains a stable tertiary structure.
• The determination of the structure allows understanding its operating mechanism.
• We study the 444nt long VIH1 Gag-IRES.
RNA Structure determination
• 3D structure can be resolved experimentally [remains expensive and time-consuming].
• Computational methods allows to have accurate
secondary structure predictions (PPV ≈ 75%). Less accurate predictions for long RNA.
• + Experimental Data [Chemical(SHAPE) \ Enzymatic]
improve predictions.
State of the art
Evolution of computational approaches to predict the RNA secondary structure:
1981
Thermo
dynamic
model
[4]
1990
Partition
function
[McCaskill]
1999
Structures
sampling
app roach
2003
Sto
chastic samp
ling [2]
2004
Exp
erimental data
probing
2009
SHAPE data
as
pseudo-free
energy [1]
2016
Ongoing
work
>
Objectives
• Apply sampling approach with SHAPE data to predict the RNA secondary structure.
2-Material & Methods 2-1 Experimental data
SHAPE-Map experiments High Throughput Sequencing
SHAPE reactivity calculation
Reactivity(n) = mutShape(n) − mutControl(n)
mutDenatured(n) [3] V-Enzymatic cleavage targets paired nucleotides.
T-Enzymatic cleavage reveals unpaired nucleotides.
Boltzmann probability to observe a structure S: P (S) = e−EkTS
Z .
with Z the partition function:Z = X
S0∈S e
−ES0 kT .
For a cluster C
• Coherence ≡ Base pairs Mean Distance in C.
• Diversity ≡ Number of present conditions.
• Stability ≡ Sum of Boltzmann probabilities.
2-2 Sampling/Clustering workflow
experimental data from different conditions
A B C D
Data processing
A B C D
Structure sampling
***** ***** ***** *****
Set of ensemble structures
** ** ** ~~** ** ~~**
** ** ~~** ** ** **
Clustering[Affinity propagation]
** ****** **** *****
Optimal clusters?
Diversity
Maximization –> Pareto Frontier
Optimal Centroid Structures
Stability Coherence
3-Results
Optimal centroid structures from 140. [8000 structures]
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
A UG GG U GC G A G A G C G U C G G U A U U A A G CGGG G GA G A A U U A G A U A A A U GG GA A A A A A U U C G GUUAA G G C CA GG GG GA A A G A A A C A A U A U A A A C U A A A ACA U A U A G U A U G G GCA A GCA G GGA G C U A G A A C G A U U C GCA G U U A A U C C U GG C CU U U U A G A G A C A U C A G A A G G C UGU A G A C A A A U A C U G G G A C A GCU A C A A C C AUC C C U U C A G A C A G G A U C A G A A G A A CUU A GAUCA U U AUA U A A U A C A A U A G C A G U C C U C U A U U G UGU G CA UC A A A G G A U A G A U G UAA A A G ACA C C A A G G A A G C C U U A G A U A A G A U AG A G GA A G A G C A A AAC A A A A G U A A G A A A A A G G C ACA GCA A GC AGCA GCU G A C A C A G G A A A C A ACA GCC A G G UCA GC CA A A A U U A C C C U A U A G U G C A G A A C C U C C A G G G G C A A AU G G UA C A U C A G G CCA U A UCA
1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 444
-5.02.4060000000000006
HIV
4-Conclusion & perspectives
• We have obtained a set of models supported by our integrative approach, those models are subject to validation.
• Some of centroid structures have shown high
compatibility with existing proposed structural models.
• We will extend the approach to the simultaneous analysis of probing data for a set of RNA variants.
References
[1] Katherine E. Deigan, Tian W. Li, David H. Mathews, and Kevin M. Weeks.
Accurate SHAPE-directed RNA structure determination.
106(1):97–102, jan 2009.
[2] Lawrence CE. Ding.
A statistical sampling algorithm for RNA secondary structure prediction.
Nucleic Acids Research, 31(24):7280–7301, 2003.
[3] Matthew J Smola, Greggory M Rice, Steven Busan, Nathan A Siegfried, and Kevin M Weeks.
Selective 2’-hydroxyl acylation analyzed by primer extension and mutational profiling (SHAPE-MaP) for direct, versatile and accurate RNA structure analysis.
Nature protocols, 10(11):1643–1669, 2015.
[4] Michael Zuker and Patrick Stiegler.
Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information.
Nucleic Acids Research, 9(1):133–148, jan 1981.
Acknowledgments
PhD funded by the "Fondation pour la Recherche Médicale".
Contact address
saaidi.afaf@polytechnique.edu