The PROTEUS project: Improving large-scale structure prediction for proteins

(1)

HAL Id: hal-02818506

https://hal.inrae.fr/hal-02818506

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

The PROTEUS project: Improving large-scale structure prediction for proteins

T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand

To cite this version:

T. Simonson, R. Andonov, Jean-François Gibrat, J. Pothier, Yh Sanejouand. The PROTEUS project: Improving large-scale structure prediction for proteins. journée restitution ANR Calcul Intensif et Simulation, Feb 2010, Paris, France. 39 p. �hal-02818506�

(2)

T Simonson, R Andonov, JF Gibrat, J Pothier, YH Sanejouand

Ecole Polytechnique, IRISA-Rennes, INRA-Jouy, Paris VII, ENS Lyon

http://migale.jouy.inra.fr/proteus

http://biology.polytechnique.fr/proteinsathome

The PROTEUS project:

Improving large-scale

structure prediction for proteins

(3)

2010: year XV of the genomic era

1995 Complete sequence of Haemophilus influenzae (2 106_nucleotides)

1996 Complete sequence of yeast (107_{nucleotides, ~5000 genes)}

2000 Arabidopsis thaliana; fruit fly (108_nucleotides)

2004 Homo sapiens, complete sequence (3 109_{nucleotides, 25000 genes)}

~4000 genomes sequenced; ~107_{proteins, but:}

(4)

Sequence

Structure

Function

NLS YA

(5)

Unfolded

protein

1

2

3

4

5

6

7

6

1

2

3

4

5

7 The folding problem: can we predict

protein structure?

Folded

protein

(6)

Unfolded

protein

1

2

3

4

5

6

7

6

1

2

3

4

5

7 The folding problem: can we predict

protein structure?

Folded

protein

(7)

Protein fold space is finite and discrete

(8)

K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T

?

The fold recognition problem: for a given sequence,

determine to which fold it belongs.

(9)

Classic (but difficult) strategy: test the

sequence against a library of structural models

Sequence of

unknown

structure

(Simplified)

3D structural

model from

library

Sequence optimally

“aligned” with the structure:

energy evaluation

“Alignment”

or

“threading”

The lowest energy alignment defines the optimal 3D model.

(10)

Improved structural libraries

Improved sequence/structure alignment methods

Improved scoring functions

(11)

“Inverse” strategy (new, but speculative):

generate theoretical sequences from

a library of structural models;

(12)

Search sequence- and conformation-space for preferred sequences

Generate sequences using

DIRECTED EVOLUTION:

random mutagenesis + selective pressure

DKPAIFTDLGDWV... EKPLEVDDAAEWS... MKPVTLTDVAEYA... QKPVSLSDVGEFA... AHGSQNTTlLILP... DKPAIFTDLGDWV... EKPLEVDDAAEWS... PLIKRYWWNAQAG... MKPVTLTDVAEYA... GHYILKQSANCCM... FKPIEASDIAEFV... QKPVSLSDVGEFA...

“Inverse” strategy (new, but speculative):

generate theoretical sequences from

a library of structural models;

test experimental sequences against these

(13)

Search sequence- and conformation-space for preferred sequences

Generate sequences using

DIRECTED EVOLUTION:

random mutagenesis + selective pressure

“Inverse” strategy (new, but speculative):

generate theoretical sequences from

a library of structural models;

test experimental sequences against these

Test similarity

Sequence of

unknown structure

(14)

K L H G G P M L D S D Q K F W R T P A A L H Q N E G F T

?

The inverse protein folding problem: can we predict

which sequences adopt a given fold?

The fold recognition problem: for a given sequence,

determine to which fold it belongs.

(15)

Trpcage protein = 20 amino acids

20

_= 10

26

_{possible sequences}

1.9 10

46

_{possible conformations}

100 amino acids

1.3 10

130

_sequences

2.4 10

231

_{conformations}

Even with a discretized conformational

space for the sidechains, the size of the

problem is enormous

Inverse folding problem:

for a given “backbone structure”, or fold,

which sidechains “fit”?

(16)

Interaction energy = molecular mechanics

= van der Waals + Coulomb

+ dielectric continuum solvent

CH₂ N H H H .35 -.30 -.55 .55 CH₂ CH₂ CH₂ CH N H C O -.35 .25 Discrete conformational space: sidechain rotamers

(17)

Posit ion Type Rota mer

1

2

3

A A B B A A B B A A B B 1 2 1 2 1 2 1 2 1 2 1 2

1

2

3

Rot 1 Rot 2 Rot 1 Rot 2

Structure

Energy matrix

A A B B A A B B A A B B1 2 1 2 1 2 1 2 1 2 1 2

(18)

Download our screensaver from

biology.polytechnique.fr/proteinsathome and help predict protein structures!

(19)

T. Simonson, D. Mignon, M. Schmidt am Busch, A. Lopes, C. Bathelt, 2008, in Distributed Computing: Principles and Applications, Tektum Press, Berlin. See also Chemical & Engineering News, April 2007.

(20)

I. Quality control for the theoretical sequences:

SH3 domains as a test family

767 known sequences

67 known 3D structures

Generation of 25 x 450,000 theoretical sequences

Evaluation by

standard fold

recognition tools

81% of low energy

sequences recognized

as SH3 domains

(21)

More recently: 24 SH2 domains,

17 PDZ domains, 8 Kunitz domains,

12 chemokines, 6 caspases

Generation of 67 x 200,000 theoretical sequences

Evaluation by

standard fold

recognition tools

~85% of low energy

sequences recognized

as SH3 domains

(22)

II. Using the theoretical sequences to classify

natural proteins: “structure prediction”

Generation of 25 x 450,000 theoretical sequences

Simple statistical model

, representing the SH3 family (

mean sequence

)

Query experimental sequence databases (10

7

_proteins)

675 SH3 sequences retrieved (

88% coverage

+ 5% de false positives)

No new SH3 sequences :(

767 known SH3 sequences

67 known 3D structures

(23)

Two more test families: Kunitz domains and chemokines

91% coverage, no false positives :))

Theoretical sequences can assist fold recognition for some families

Generation of 20 x 200,000 theoretical sequences

Simple statistical model

, representing the family (mean sequence)

Query experimental sequence databases (10

7

_proteins)

(24)

S1 H1 S4 H2 S5 S1 S5 S6 S7 H1 H2 S4 17 19 96 63 42 HDWWW 16.4% DLWWL 9.6% WDWWW 6.0% DLWWH 5.7% LLWWW 5.4% DDWWH 3.5% LDWWW 3.0% HDDDW 2.5% HDWAW 1.9% WLWWW 1.8%

Mean sequence is not an optimal descriptor

of the theoretical sequence ensemble, as

shown by covariance analysis

(25)

Directed evolution yields good quality sequences, which can

“fool” standard fold recognition tools.

The sequences reproduce the physicalchemical properties

and the diversity of the natural families.

Their performance for retrieving natural homologues is good

for some families.

Improved descriptors of the theoretical sequence ensembles

should lead to improved homologue retrieval.

Volunteer distributed computing provides a unique

computational ressource.

(26)

Thomas Simonson Najette Amara, thèse Audrey Sedano, thèse David Mignon, IEEP Marcel Schmidt am Busch, postdoctorant Anne Lopes, thèse Christine Bathelt, postdoctorant G Archontis, U. de Chypre

Polytechnique

JF Gibrat, INRA Sophie Schbath, INRA Francois Rodophe, INRA Afshin Fayyaz, postdoctorant Valentin Loux, IEINRA Christophe Caron, IEINRA Joel Pothier Guillaume Santini, postdoctorant AnneLaure Abraham, thèse

ABI/U. Paris 7

Rumen Andono Francois Coste, INRIA Y Vutov, IECDD Goulven Kerbellec, thèse Noel MalodDognien, thèse Guillaume Collet, thèse Tristan Feildel, M2 Guillaume Launay, postdoctorant Nicola Yanev, U. de Sofia

IRISA/Rennes

MIG/Jouy INRA

YvesHenri Sanejouand, CR1 CNRS Brice Juanico, postdoctorant

ENS Lyon

(27)

PUBLICATIONS Inverse folding problem Lopes, Aleksandrov, Bathelt, Archontis, Simonson, Proteins 2007 Schmidt am Busch, Lopes, Mignon, Simonson, J Comp Chem 2008 Schmidt am Busch, Lopes, Amara, Bathelt, Simonson, BMC Bioinf 2008 Schmidt am Busch, Mignon, Simonson, Proteins 2009 Lopes, Schmidt am Busch, Simonson, J Comp Chem, 2009 Schmidt am Busch, Sedano, Simonson, Plos One, 2010 Mignon, Schmidt am Busch, Sedano, Sanejouand, Simonson, in preparation Classic fold recognition Sam, Tai, Garnier, Gibrat, Lee, Munson (2008) BMC Bioinf Andonov, Yanev, MalodDognin (2008) Lect NotesBioinf Yanev, Andonov, Veber, et al. (2008) Computers & Math with Applications Abraham, Rocha, Pothier, Bioinf 2008 Taly, Marin, Gibrat, BMC Bioinf 2008 Andonov, Collet, Gibrat, Poirriez, Yanev, Grid Comp for Bioinformatics & Compl Biol 2008 Collet, Andonov, Yanev, Gibrat, Discrete Appl Math, submitted MalodDognin, Andonov, Yanev, J Comp Biol, submitted

(28)

(29)

Download our screensaver from

biology.polytechnique.fr/proteinsathome

and help design new proteins

(30)

Search sequence and structurespace for new functional proteins

A powerful technology for protein engineering: in silico directed evolution

Structural model of both states

Free energy function

Sequence score:

G = G

_folded

–

G

_unfolded

(31)

Escherichia coli

TyrosyltRNA

synthetase

LTyr bound to TyrRS

We are using computational design and directed evolution

(32)

Complete redesign of ~100 proteins

Identity scores compared to the native proteins Earlier Our Shared work work proteins Baker, 2005 37% 33% Handel, 2005 3141% 32% 1 Pande, 2003 26% 38% 1 Wodak, 2002 24% 35% 8 Koehl, Levitt, 1999 26% 32% 2 Mayo, Dahiyat, 1997 36% 52% 1

Optimizing and testing the energy function

(33)

PSIBLAST is profilebased: no information on covariances:

we need a better statistical model to describe our sequences

Parameterization of profileHMMs (

David M

)

SAM package

Analyze correlations within our designed sequences (

YHS

)

Identify important patterns, and group sequences accordingly.

Construct a PSSM from each group.

Perform PSIBLAST searches separately with each PSSM.

(34)

(35)

Mean detection rate: 81% (100% with reset PBPs) Mean entropy experimental: 2.8 designed: 3.0

Detection by SUPERFAMILY

profileHMMs

Sequence entropy

(36)

FROST analysis

# of No Incorrect Correct Number of Number of Family sequences hit hits hits correct hits incorrect hits SH3 2800 433 8 2360 5.2 0.1