HAL Id: hal-02818506
https://hal.inrae.fr/hal-02818506
Submitted on 6 Jun 2020
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
The PROTEUS project: Improving large-scale structure prediction for proteins
T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand
To cite this version:
T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand. The PROTEUS project: Improving large-scale structure prediction for proteins. journée restitution ANR Calcul Intensif et Simulation, Feb 2010, Paris, France. 39 p. �hal-02818506�
T Simonson, R Andonov, JF Gibrat, J Pothier, YH Sanejouand
Ecole Polytechnique, IRISA-Rennes, INRA-Jouy, Paris VII, ENS Lyon
http://migale.jouy.inra.fr/proteus
http://biology.polytechnique.fr/proteinsathome
The PROTEUS project:
Improving large-scale
structure prediction for proteins
2010: year XV of the genomic era
1995 Complete sequence of Haemophilus influenzae (2 106 nucleotides)
1996 Complete sequence of yeast (107 nucleotides, ~5000 genes)
2000 Arabidopsis thaliana; fruit fly (108 nucleotides)
2004 Homo sapiens, complete sequence (3 109 nucleotides, 25000 genes)
~4000 genomes sequenced; ~107 proteins, but:
Sequence
Structure
Function
NLS YA
Unfolded
protein
1
2
3
4
5
6
7
6
1
2
3
4
5
7
The folding problem: can we predict
protein structure?
Folded
protein
Unfolded
protein
1
2
3
4
5
6
7
6
1
2
3
4
5
7
The folding problem: can we predict
protein structure?
Folded
protein
Protein fold space is finite and discrete
K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T
?
The fold recognition problem: for a given sequence,
determine to which fold it belongs.
Classic (but difficult) strategy: test the
sequence against a library of structural models
Sequence of
unknown
structure
(Simplified)
3D structural
model from
library
Sequence optimally
“aligned” with the structure:
energy evaluation
“Alignment”
or
“threading”
The lowest energy alignment defines the optimal 3D model.
Improved structural libraries
Improved sequence/structure alignment methods
Improved scoring functions
“Inverse” strategy (new, but speculative):
generate theoretical sequences from
a library of structural models;
Search sequence- and conformation-space for preferred sequences
Generate sequences using
DIRECTED EVOLUTION:
random mutagenesis + selective pressure
DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...
“Inverse” strategy (new, but speculative):
generate theoretical sequences from
a library of structural models;
test experimental sequences against these
Search sequence- and conformation-space for preferred sequences
Generate sequences using
DIRECTED EVOLUTION:
random mutagenesis + selective pressure
DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...
“Inverse” strategy (new, but speculative):
generate theoretical sequences from
a library of structural models;
test experimental sequences against these
Test similarity
Sequence of
unknown structure
K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T
?
The inverse protein folding problem: can we predict
which sequences adopt a given fold?
The fold recognition problem: for a given sequence,
determine to which fold it belongs.
Trpcage protein = 20 amino acids
20
20= 10
26possible sequences
1.9 10
46possible conformations
100 amino acids
1.3 10
130sequences
2.4 10
231conformations
Even with a discretized conformational
space for the sidechains, the size of the
problem is enormous
Inverse folding problem:
for a given “backbone structure”, or fold,
which sidechains “fit”?
Interaction energy = molecular mechanics
= van der Waals + Coulomb
+ dielectric continuum solvent
CH2 N H H H .35 -.30 -.55 .55 CH2 CH2 CH2 CH N H C O -.35 .25 Discrete conformational space: sidechain rotamers
Posit ion Type Rota mer
1
2
3
A A B B A A B B A A B B 1 2 1 2 1 2 1 2 1 2 1 21
2
3
Rot 1 Rot 2 Rot 1 Rot 2Structure
Energy matrix
A A B B A A B B A A B B1 2 1 2 1 2 1 2 1 2 1 2Download our screensaver from
biology.polytechnique.fr/proteinsathome and help predict protein structures!
T. Simonson, D. Mignon, M. Schmidt am Busch, A. Lopes, C. Bathelt, 2008, in Distributed Computing: Principles and Applications, Tektum Press, Berlin. See also Chemical & Engineering News, April 2007.
I. Quality control for the theoretical sequences:
SH3 domains as a test family
767 known sequences
67 known 3D structures
Generation of 25 x 450,000 theoretical sequences
Evaluation by
standard fold
recognition tools
81% of low energy
sequences recognized
as SH3 domains
More recently: 24 SH2 domains,
17 PDZ domains, 8 Kunitz domains,
12 chemokines, 6 caspases
Generation of 67 x 200,000 theoretical sequences
Evaluation by
standard fold
recognition tools
~85% of low energy
sequences recognized
as SH3 domains
II. Using the theoretical sequences to classify
natural proteins: “structure prediction”
Generation of 25 x 450,000 theoretical sequences
Simple statistical model
, representing the SH3 family (
mean sequence
)
Query experimental sequence databases (10
7proteins)
675 SH3 sequences retrieved (
88% coverage
+ 5% de false positives)
No new SH3 sequences :(
767 known SH3 sequences
67 known 3D structures
Two more test families: Kunitz domains and chemokines
91% coverage, no false positives :))
Theoretical sequences can assist fold recognition for some families
Generation of 20 x 200,000 theoretical sequences
Simple statistical model
, representing the family (mean sequence)
Query experimental sequence databases (10
7proteins)
S1 H1 S4 H2 S5 S1 S5 S6 S7 H1 H2 S4 17 19 96 63 42 HDWWW 16.4% DLWWL 9.6% WDWWW 6.0% DLWWH 5.7% LLWWW 5.4% DDWWH 3.5% LDWWW 3.0% HDDDW 2.5% HDWAW 1.9% WLWWW 1.8%
Mean sequence is not an optimal descriptor
of the theoretical sequence ensemble, as
shown by covariance analysis
Directed evolution yields good quality sequences, which can
“fool” standard fold recognition tools.
The sequences reproduce the physicalchemical properties
and the diversity of the natural families.
Their performance for retrieving natural homologues is good
for some families.
Improved descriptors of the theoretical sequence ensembles
should lead to improved homologue retrieval.
Volunteer distributed computing provides a unique
computational ressource.
Thomas Simonson Najette Amara, thèse Audrey Sedano, thèse David Mignon, IEEP Marcel Schmidt am Busch, postdoctorant Anne Lopes, thèse Christine Bathelt, postdoctorant G Archontis, U. de Chypre
Polytechnique
JF Gibrat, INRA Sophie Schbath, INRA Francois Rodophe, INRA Afshin Fayyaz, postdoctorant Valentin Loux, IEINRA Christophe Caron, IEINRA Joel Pothier Guillaume Santini, postdoctorant AnneLaure Abraham, thèseABI/U. Paris 7
Rumen Andono Francois Coste, INRIA Y Vutov, IECDD Goulven Kerbellec, thèse Noel MalodDognien, thèse Guillaume Collet, thèse Tristan Feildel, M2 Guillaume Launay, postdoctorant Nicola Yanev, U. de SofiaIRISA/Rennes
MIG/Jouy INRA
YvesHenri Sanejouand, CR1 CNRS Brice Juanico, postdoctorantENS Lyon
PUBLICATIONS Inverse folding problem Lopes, Aleksandrov, Bathelt, Archontis, Simonson, Proteins 2007 Schmidt am Busch, Lopes, Mignon, Simonson, J Comp Chem 2008 Schmidt am Busch, Lopes, Amara, Bathelt, Simonson, BMC Bioinf 2008 Schmidt am Busch, Mignon, Simonson, Proteins 2009 Lopes, Schmidt am Busch, Simonson, J Comp Chem, 2009 Schmidt am Busch, Sedano, Simonson, Plos One, 2010 Mignon, Schmidt am Busch, Sedano, Sanejouand, Simonson, in preparation Classic fold recognition Sam, Tai, Garnier, Gibrat, Lee, Munson (2008) BMC Bioinf Andonov, Yanev, MalodDognin (2008) Lect NotesBioinf Yanev, Andonov, Veber, et al. (2008) Computers & Math with Applications Abraham, Rocha, Pothier, Bioinf 2008 Taly, Marin, Gibrat, BMC Bioinf 2008 Andonov, Collet, Gibrat, Poirriez, Yanev, Grid Comp for Bioinformatics & Compl Biol 2008 Collet, Andonov, Yanev, Gibrat, Discrete Appl Math, submitted MalodDognin, Andonov, Yanev, J Comp Biol, submitted
Download our screensaver from
biology.polytechnique.fr/proteinsathome
and help design new proteins
Search sequence and structurespace for new functional proteins
A powerful technology for protein engineering: in silico directed evolution
DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...Structural model of both states
Free energy functionSequence score:
G = G
folded–
G
unfoldedEscherichia coli
TyrosyltRNA
synthetase
LTyr bound to TyrRS
We are using computational design and directed evolution
Complete redesign of ~100 proteins
Identity scores compared to the native proteins Earlier Our Shared work work proteins Baker, 2005 37% 33% Handel, 2005 3141% 32% 1 Pande, 2003 26% 38% 1 Wodak, 2002 24% 35% 8 Koehl, Levitt, 1999 26% 32% 2 Mayo, Dahiyat, 1997 36% 52% 1Optimizing and testing the energy function
PSIBLAST is profilebased: no information on covariances:
we need a better statistical model to describe our sequences
Parameterization of profileHMMs (
David M
)
SAM package
Analyze correlations within our designed sequences (
YHS
)
Identify important patterns, and group sequences accordingly.
Construct a PSSM from each group.
Perform PSIBLAST searches separately with each PSSM.
Mean detection rate: 81% (100% with reset PBPs) Mean entropy experimental: 2.8 designed: 3.0
Detection by SUPERFAMILY
profileHMMs
Sequence entropy
FROST analysis
# of No Incorrect Correct Number of Number of Family sequences hit hits hits correct hits incorrect hits SH3 2800 433 8 2360 5.2 0.1