• Aucun résultat trouvé

The PROTEUS project: Improving large-scale structure prediction for proteins

N/A
N/A
Protected

Academic year: 2021

Partager "The PROTEUS project: Improving large-scale structure prediction for proteins"

Copied!
40
0
0

Texte intégral

(1)

HAL Id: hal-02818506

https://hal.inrae.fr/hal-02818506

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The PROTEUS project: Improving large-scale structure prediction for proteins

T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand

To cite this version:

T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand. The PROTEUS project: Improving large-scale structure prediction for proteins. journée restitution ANR Calcul Intensif et Simulation, Feb 2010, Paris, France. 39 p. �hal-02818506�

(2)

T Simonson, R Andonov, JF Gibrat, J Pothier, YH Sanejouand

Ecole Polytechnique, IRISA-Rennes, INRA-Jouy, Paris VII, ENS Lyon

http://migale.jouy.inra.fr/proteus

http://biology.polytechnique.fr/proteinsathome

The PROTEUS project:

Improving large-scale

structure prediction for proteins

(3)

2010: year XV of the genomic era

1995 Complete sequence of Haemophilus influenzae (2 106 nucleotides)

1996 Complete sequence of yeast (107 nucleotides, ~5000 genes)

2000 Arabidopsis thaliana; fruit fly (108 nucleotides)

2004 Homo sapiens, complete sequence (3 109 nucleotides, 25000 genes)

~4000 genomes sequenced; ~107 proteins, but:

(4)

Sequence

Structure

Function

NLS YA

(5)

Unfolded

protein

1

2

3

4

5

6

7

6

1

2

3

4

5

7

The folding problem: can we predict

protein structure?

Folded

protein

(6)

Unfolded

protein

1

2

3

4

5

6

7

6

1

2

3

4

5

7

The folding problem: can we predict

protein structure?

Folded

protein

(7)

Protein fold space is finite and discrete

(8)

K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T

?

The fold recognition problem: for a given sequence,

determine to which fold it belongs.

(9)

Classic (but difficult) strategy: test the

sequence against a library of structural models

Sequence of

unknown

structure

(Simplified)

3D structural

model from

library

Sequence optimally

“aligned” with the structure:

       energy evaluation

“Alignment”

or

“threading”

The lowest energy alignment defines the optimal 3D model.

(10)

Improved structural libraries

Improved sequence/structure alignment methods

Improved scoring functions

(11)

“Inverse” strategy (new, but speculative): 

generate theoretical sequences from

a library of structural models;

(12)

Search sequence- and conformation-space for preferred sequences

Generate sequences using

DIRECTED EVOLUTION:

random mutagenesis + selective pressure

DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...

“Inverse” strategy (new, but speculative): 

generate theoretical sequences from

a library of structural models;

test experimental sequences against these

(13)

Search sequence- and conformation-space for preferred sequences

Generate sequences using

DIRECTED EVOLUTION:

random mutagenesis + selective pressure

DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...

“Inverse” strategy (new, but speculative): 

generate theoretical sequences from

a library of structural models;

test experimental sequences against these

Test similarity

Sequence of

unknown structure

(14)

K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T

?

The inverse protein folding problem: can we predict

which sequences adopt a given fold?

The fold recognition problem: for a given sequence,

determine to which fold it belongs.

(15)

Trpcage protein = 20 amino acids

20

20

 =  10

26

 possible sequences

1.9 10

46

 possible conformations

100 amino acids

1.3 10

130

 sequences

2.4 10

231

 conformations

Even with a discretized conformational

space for the sidechains, the size of the

problem is enormous

Inverse folding problem:

for a given “backbone structure”, or fold,

which sidechains “fit”?

(16)

Interaction energy = molecular mechanics

= van der Waals + Coulomb

+ dielectric continuum solvent

CH2 N H H H .35 -.30 -.55 .55 CH2 CH2 CH2 CH N H C O -.35 .25 Discrete conformational space: sidechain rotamers

(17)

Posit ion Type Rota mer

1

2

3

A A B B A A B B A A B B 1 2 1 2 1 2 1 2 1 2 1 2

1

2

3

Rot 1 Rot 2 Rot 1 Rot 2

Structure

Energy matrix

A   A   B   B    A   A    B   B   A   A   B   B1   2    1    2 1    2    1    2 1    2    1    2

(18)

Download our screensaver from

biology.polytechnique.fr/proteinsathome and help predict protein structures!

(19)

T. Simonson, D. Mignon, M. Schmidt am Busch, A. Lopes, C. Bathelt, 2008, in Distributed Computing: Principles and Applications, Tektum Press, Berlin. See also Chemical & Engineering News, April 2007.

(20)

I. Quality control for the theoretical sequences:

SH3 domains as a test family

767 known sequences

67 known 3D structures

Generation of 25 x 450,000 theoretical sequences

Evaluation by 

standard fold 

recognition tools

81% of low energy 

sequences recognized 

as SH3 domains

(21)

More recently: 24 SH2 domains,

17 PDZ domains, 8 Kunitz domains,

12 chemokines, 6 caspases

Generation of 67 x 200,000 theoretical sequences

Evaluation by 

standard fold 

recognition tools

~85% of low energy 

sequences recognized 

as SH3 domains

(22)

II. Using the theoretical sequences to classify

natural proteins: “structure prediction”

Generation of 25 x 450,000 theoretical sequences

Simple statistical model

, representing the SH3 family (

mean sequence

)

Query experimental sequence databases (10

7

 proteins)

675 SH3 sequences retrieved (

88% coverage

 + 5% de false positives)

No new SH3 sequences   :­(

767 known SH3 sequences

67 known 3D structures

(23)

Two more test families: Kunitz domains and chemokines

91% coverage, no false positives    :­))

Theoretical sequences can assist fold recognition for some families

Generation of 20 x 200,000 theoretical sequences

Simple statistical model

, representing the family (mean sequence)

Query experimental sequence databases (10

7

 proteins)

(24)

S1 H1 S4 H2 S5 S1 S5 S6 S7 H1 H2 S4 17 19 96 63 42 HDWWW 16.4% DLWWL 9.6% WDWWW 6.0% DLWWH 5.7% LLWWW 5.4% DDWWH 3.5% LDWWW 3.0% HDDDW 2.5% HDWAW 1.9% WLWWW 1.8%

Mean sequence is not an optimal descriptor 

of the theoretical sequence ensemble, as 

shown by covariance analysis

(25)

Directed evolution yields good quality sequences, which can 

“fool” standard  fold recognition tools. 

The sequences reproduce the physical­chemical properties 

and the diversity of the natural families. 

Their performance for retrieving natural homologues is good 

for some families.

Improved descriptors of the theoretical sequence ensembles 

should lead to improved homologue retrieval.

Volunteer distributed computing provides a unique 

computational ressource.

(26)

Thomas Simonson Najette Amara, thèse Audrey Sedano, thèse David Mignon, IE­EP Marcel Schmidt am Busch, postdoctorant Anne Lopes, thèse Christine Bathelt, postdoctorant G Archontis, U. de Chypre

Polytechnique

JF Gibrat, INRA Sophie Schbath, INRA Francois Rodophe, INRA Afshin Fayyaz, postdoctorant Valentin Loux, IE­INRA Christophe Caron, IE­INRA Joel Pothier Guillaume Santini,  postdoctorant Anne­Laure Abraham, thèse

ABI/U. Paris 7

Rumen Andono Francois Coste, INRIA Y Vutov, IE­CDD Goulven Kerbellec, thèse Noel Malod­Dognien, thèse Guillaume Collet, thèse Tristan Feildel, M2 Guillaume Launay, postdoctorant Nicola Yanev, U. de Sofia

IRISA/Rennes

MIG/Jouy INRA

Yves­Henri Sanejouand, CR1 CNRS Brice Juanico, postdoctorant

ENS Lyon

(27)

PUBLICATIONS Inverse folding problem Lopes, Aleksandrov, Bathelt, Archontis, Simonson, Proteins 2007 Schmidt am Busch, Lopes, Mignon, Simonson, J Comp Chem 2008 Schmidt am Busch, Lopes, Amara, Bathelt, Simonson, BMC Bioinf 2008 Schmidt am Busch, Mignon, Simonson, Proteins 2009 Lopes, Schmidt am Busch, Simonson, J Comp Chem, 2009 Schmidt am Busch, Sedano, Simonson, Plos One, 2010 Mignon, Schmidt am Busch, Sedano, Sanejouand, Simonson, in preparation Classic fold recognition Sam, Tai, Garnier, Gibrat, Lee, Munson (2008)  BMC Bioinf Andonov, Yanev, Malod­Dognin (2008) Lect NotesBioinf Yanev, Andonov, Veber, et al. (2008) Computers & Math with Applications Abraham, Rocha, Pothier, Bioinf 2008 Taly, Marin, Gibrat, BMC Bioinf 2008 Andonov,  Collet, Gibrat, Poirriez, Yanev, Grid Comp for Bioinformatics & Compl Biol 2008  Collet, Andonov, Yanev, Gibrat, Discrete Appl Math, submitted Malod­Dognin, Andonov, Yanev, J Comp Biol, submitted

(28)
(29)

Download our screensaver from

biology.polytechnique.fr/proteinsathome

 

and help design new proteins

(30)

Search sequence­ and structure­space for new functional proteins

A powerful technology for protein engineering: in silico directed evolution

DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...

Structural model of both states

Free energy function

Sequence score:

G = G

folded

G

unfolded

(31)

Escherichia coli

Tyrosyl­tRNA

synthetase

L­Tyr bound to TyrRS

We are using computational design and directed evolution

(32)

Complete redesign of ~100 proteins

Identity scores compared to the native proteins Earlier Our        Shared work work       proteins Baker, 2005        37% 33% Handel, 2005        31­41% 32% 1 Pande, 2003 26% 38% 1 Wodak, 2002 24% 35% 8 Koehl, Levitt, 1999 26% 32% 2 Mayo, Dahiyat, 1997 36% 52% 1

Optimizing and testing the energy function

(33)

PSI­BLAST is profile­based: no information on covariances:

we need a better statistical model to describe our sequences

Parameterization of profile­HMMs (

David M

)

SAM package

Analyze correlations within our designed sequences (

YHS

)

Identify important patterns, and group sequences accordingly.

Construct a PSSM from each group.

Perform PSI­BLAST searches separately with each PSSM.

(34)
(35)

Mean detection rate: 81% (100% with reset PBPs) Mean entropy experimental: 2.8 designed: 3.0

Detection by SUPERFAMILY

profile­HMMs

Sequence entropy

(36)

FROST analysis

      # of       No      Incorrect    Correct    Number of      Number of Family  sequences    hit        hits       hits       correct hits       incorrect hits    SH3        2800      433      8       2360        5.2       0.1

(37)

Homologues retrieved

from SwissProt

using designed

and random PSSMs

47%

53%

38%

53%

(38)

Randomize sequences according to SUPERFAMILY profile

(for a particular template)

For each sequence, explore possible sidechain rotamers using Proteus

4 SH2 and 4 SH3 proteins,

total of 2800 sequences

For each sequence, spectrum of folding free energies.

Compare to our designed sequences.

(39)

Sequences randomized according to a SUPERFAMILY profile

SH3

(40)

Sequences randomized according to a SUPERFAMILY profile

SH3

SH2

The profile does not contain information on sidechain­sidechain interactions/

      correlations

Références

Documents relatifs

• As expected, heuristics based tools are fast at the expense of quality; for instance, UBLAST is very fast but has often poor quality for the chosen metrics.. Drezen and Lavenier

In sum, gene density, strand shifts and pVOGs hits combined as features in a machine learning approach allow more accurate and more sensible prediction of phage genomes compared

Our discrete Laplace-Beltrami operator is defined by modelling mesh sequences as CW com- plexes embedded in a 4-dimensional space and using the Dis- crete Exterior Calculus

They are generally composed of a statistical part allowing the generation of the purely random aspect of the climate from distribution laws and determined

In Section 4 we will apply the large deviation principle (1.6) to a variety of important regularly vary- ing time series models, including the stochastic volatility model, solutions

The simple occurrence problem consists in finding in the text K all positions where at least one possible instance of the multi-structure matches the putative helices K.. This

MSBIO831 - Master 1 (Nutrition, Besoins Nutritionnels et Prévention des grandes Pathologies (NBPP)), 2012.. Contrôle central de la

The domain structures of the proteins are labelled with