Interfacing sequences and structures of proteins: applications to protein annotation and sequence feature visualization

(1)

Thesis

Reference

Interfacing sequences and structures of proteins: applications to protein annotation and sequence feature visualization

DAVID, Fabrice Pierre André

Abstract

L'annotation présente dans UniProtKB (UniProt Knowledgebase) et les structures 3D dans PDB (Protein Data Bank) sont des sources de données importantes et complémentaires pour décrire la fonction et les caractéristiques des protéines. D'une part, une fraction de l'annotation séquentielle présente dans UniProtKB est générée à partir de données structurales. D'autre part, le transfert d'annotations séquentielles sur les structures peut souvent fournir un niveau supérieur de compréhension quant à leur rôle fonctionnel. Dans ces deux cas, il faut avoir recours à une interface séquence-structure. Le sujet principal de cette thèse a été de créer une plateforme qui permette de manipuler ensemble séquences et structures de protéines, dans le but de simplifier et d'aider à réaliser les deux types de transfert d'information cités plus haut. L'élément fondamental d'un tel système est une table de correspondance (ou mapping) bidirectionnelle et à jour entre les séquences et les structures de protéines.

DAVID, Fabrice Pierre André. Interfacing sequences and structures of proteins:

applications to protein annotation and sequence feature visualization. Thèse de doctorat : Univ. Genève, 2009, no. Sc. 4147

URN : urn:nbn:ch:unige-238313

DOI : 10.13097/archive-ouverte/unige:23831

Available at:

http://archive-ouverte.unige.ch/unige:23831

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSITÉ DE GENÈVE Département d’informatique

Département de biologie structurale et bioinformatique

FACULTÉ DES SCIENCES Professeur R. Appel

FACULTÉ DE MÉDECINE Professeur A. Bairoch Docteur Y.L. Yip

UNIVERSITÉ DE LAUSANNE FACULTÉ DE BIOLOGIE ET MÉDECINE Professeur O. Michielin

Interfacing sequences and structures of proteins: applications to protein

annotation and sequence feature visualization

THÈSE

présentée à la Faculté des sciences de l’Université de Genève

pour obtenir le grade de Docteur ès sciences, mention bioinformatique

par

Fabrice Pierre André DAVID de

Le Creusot (France) Thèse n° 4147

GENÈVE

DII-I Reprographie EPFL

2009

(3)

(4)

Remerciements

Je tiens à remercier tout d’abord chaleureusement mes superviseurs Dr Yum Lina Yip et Dr Anne-Lise Veuthey pour leur aide précieuse et leurs conseils avisés dispensés tout au long de ce travail de thèse. Ceux-ci ont été déterminants pour réaliser ce travail et pour rédiger le présent document. J’adresse ma profonde reconnaissance à mes directeurs de thèse, Pr Amos Bairoch et Pr Olivier Michielin pour leur soutien continu au cours de ces années de thèse. Je souhaite également exprimer toute ma gratitude au Pr Ron Appel et au Pr Joel Sussman pour avoir accepté de faire partie du jury de cette thèse.

Je n’aurais pas pu réaliser ce travail sans l’expertise de Ursula Hinz. Je tiens à lui exprimer ma profonde gratitude, ainsi qu’à tous les annotateurs à Swiss-Prot ayant eu la patience de tester l’outil d’annotation principal développé dans cette thèse (SAALSA). Pour cela, j’adresse un remerciement particulier à Elisabeth Coudert, Lydie Lane et Ursula Hinz pour le temps passé à tester et valider cet outil, ainsi que pour leurs idées constructives qui ont participé à son développement. Merci également aux membres de la Software Development Unit et du service IT avec qui j’ai eu beaucoup de plaisir de travailler, et au contact desquels j’ai pu enrichir mes connaissances en terme de programmation. Un merci spécial à Marc Feuermann pour nos nombreuses discussions et pour ses encouragements qui m’ont beaucoup aidé. Plus généralement, je remercie chacun à Swiss-Prot pour m’avoir accompagné pendant ces années de thèse et pour savoir rendre l’ambiance de travail si agréable et stimulante.

Je remercie les stagiaires avec lesquels j’ai eu le plaisir de travailler, Gregory Loichot, Harris Procopiou et Diego Poggioli, et qui ont largement contribué à étendre les applications et perspectives pour SSMap. Un grand merci également à Christoph Gille, auteur de STRAP pour notre récente collaboration, qui elle aussi contribue à la promotion de ce travail de thèse au regard de la communauté scientifique.

Je souhaite aussi remercier mes amis de longue date qui m’ont toujours encouragé au cours de cette réalisation, et notamment Jérôme, Eric et Olivier.

Je termine ces remerciements en dédiant ce document de thèse à mes parents, à ma sœur Lauriane et à mon grand-père André qui m’ont toujours fidèlement soutenu et été d’une aide morale essentielle.

(5)

(6)

K$6 33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 IL! -"&./0! 1/2"334)%/5.&(#.(&65/.'/567(6)#6 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 8,!

0+0! 9).&'*(#.4')+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ :0! 0+8! 26.;'*5+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ :<! 43I34! %&'>12($'#!-$-.;$#.!'@!(A.!6,--$#?333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333338C!

43I3434! %9*!-,&/$#?!,#>!&.2'#/(&12($'#!'@!%9*!/.01.#2./3333333333333333333333333333333333333333333333333333333333333333333333333 8C!

43I343I! "#$%&'()*!/.01.#2./333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 8L!

43I3438! D.,&2A$#?!%9*!&.2'#/(&12(.>!/.01.#2./!/$6$;,&!('!.,2A!"#$%&'()*!/.01.#2.333333333333333333 8H!

43I3437! K((&$B1($'#!'@!%9*!2A,$#/!('!"#$%&'()*!/.01.#2./3333333333333333333333333333333333333333333333333333333333333333333333333 8H!

43I343C! %'/(:-&'2.//$#?!'@!(A.!6,--$#? 333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 8M! 43I3I! K&2A$(.2(1&.!'@!(A.!DDN,-!>,(,B,/.333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333337O! 43I38! "->,(.!-&'2.>1&.3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333377! 43I37! %&'?&,66,($2!,22.//!('!(A.!6,--$#?3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333377! 43I3C! JP,;1,($'#!'@!(A.!6,--$#?3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333337C! 0+:! =65(>.5 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ?@! 43834! DDN,-!/(,($/($2/ 3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333337L! 4383I! DDN,-!.P,;1,($'#3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333337H!

4383I34! Q'6-,&$/'#!'@!6,--$#?/ 33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 7H!

4383I3I! Q'6-,&$/'#!'@!%9*!2A,$#!,((&$B1($'#/!$#!(A.!>$@@.&.#(!6,--$#?!/'1&2./333333333333333333333333333333 CO!

4383I38! Q'6-,&$/'#!'@!;'2,;!,;$?#6.#(/!/1--'&($#?!(A.!6,--$#? 333333333333333333333333333333333333333333333333333333333333 C8! 0+?! A45#(554') ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ <B! -"&./8! 1/C33>4#".4')5/&6>".6*/.'/D)4-&'.E!/"))'.".4')5 +++++++++++++++++++++++++++++++++++++++++++++ B0!

8+0! 9).&'*(#.4')+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ B:! 8+8! 26.;'*5/")*/4F3>6F6).".4') +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ B<! I3I34! D.6$:K1('6,($2!K##'(,($'#!B,/.>!'#!R'2,;!D(&12(1&,;!K#,;S/$/!TDKKRDKU33333333333333333VC!

I3I3434! W,;$>,($'#!'@!>$/1;@$>.!B'#>/ 333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 VC!

I3I343I! %&.:2'6-1(,($'#!'@!(A.!;'2,;!.#P$&'#6.#(!'@!%9*!;$?,#>/ 3333333333333333333333333333333333333333333333333333333333 VV!

I3I3438! 9./$?#!'@!(A.!('';333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 VH!

I3I3437! X.2'&>.>!>,(,333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 VM!

I3I343C! 9.@$#$($'#!'@!#'#:&.>1#>,#(!.#P$&'#6.#(/!'@!%9*!;$?,#>/333333333333333333333333333333333333333333333333333333333 L4!

I3I343V! K1('6,($2!@$;(.&/!,#>!2A.2</!('!-&'>12.!'@!(A.!@$#,;!,##'(,($'# 33333333333333333333333333333333333333333333333 LI! I3I3I! %&'2.>1&.!@'&!&.?1;,&!2A.2</!,#>!1->,(./!'@!9X!%9*!;$#./ 333333333333333333333333333333333333333333333333L8!

I3I3I34! YP.&P$.=3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 L8!

I3I3I3I! Q'#>$($'#/!('!,1('6,($2,;;S!$#(.?&,(.!2&'//:&.@.&.#2./!$#!"#$%&'()*ZD=$//:%&'(33333333333 L7!

I3I3I38! 5'&6,(($#?!'@!9X!%9*!;$#./ 333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 L7! I3I38! "#$%&'()*!89!W$.=.&!T"%89U33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333LC!

I3I3834! [#(.?&,($'#!'@!B$';'?$2,;!1#$(!$#@'&6,($'#33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 LC!

I3I383I! 9./$?#!'@!(A.!('';333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 LC! 8+:! A65#&43.4')/'G/.''>5 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ @@! I3834! DKKRDK33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333LL!

I383434! \.#.&,;!='&<@;'=!@'&!(A.!,##'(,($'#!(A&'1?A!DKKRDK 33333333333333333333333333333333333333333333333333333333333333333 LL!

I38343I! YP.&P$.=!'@!,P,$;,B;.!/(&12(1&,;!>,(,!,#>!&.;,(.>!,##'(,($'#/ 3333333333333333333333333333333333333333333333333 LH!

I383438! QA.2<!'@!.#(&S!;.P.;!,#>!&./$>1.!;.P.;!6,--$#?/ 333333333333333333333333333333333333333333333333333333333333333333333333333333 H8!

(7)

I383437! QA.2<!,#>!>.@$#$($'#!'@!;$?,#>!#,6.!6,--$#?/333333333333333333333333333333333333333333333333333333333333333333333333333333333 HV!

I38343C! 9.@$#$($'#!'@!;$?,#>!B$#>$#?!/$(./ 3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 HM!

I38343V! %&'>12($'#!'@!(A.!@$#,;!,##'(,($'#33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 MI!

I38343L! D'6.!1/,?.!/(,($/($2/3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 M7! I383I! ]1,;$(S!2'#(&';!,#>!,1('6,($2!$#(.?&,($'#!'@!2&'//:&.@.&.#2./ 33333333333333333333333333333333333333333M7! I3838! W$/1,;$^,($'#!'@!.F$/($#?!,##'(,($'#/!BS!"#$%&'()*!1/.&/ 333333333333333333333333333333333333333333333333333ML! 8+?! A45#(554') +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0H0! I3734! DKKRDK!-&'P$>./!<.S!$#@'&6,($'#!@'&!,!6,#1,;!.F-.&($/. 333333333333333333333333333333333333333333333334O4!

I373434! [#(.?&,($'#!'@!6,--$#?/!,#>!B&'=/$#?!,;$?#6.#(/ 333333333333333333333333333333333333333333333333333333333333333333333334O4!

I37343I! E$?A;$?A($#?!>,(,!$#2'#/$/(.#2$./!,#>!$#@'&6,($'#!'@!$#(.&./(!@'&!,##'(,($'#3333333333333333334OI!

I373438! \&'1-$#?!$#@'&6,($'#33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333334O8! I373I! 5$#>$#?!,!&.,/'#,B;.!.01$;$B&$16!B.(=..#!6,#1,;!,#>!,1('6,($2!(,/</ 3333333333333333334O7!

I373I34! K--;S$#?!,1('6,($2,;;S!&1;./ 33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333334O7!

I373I3I! K;;'=$#?!6,#1,;!6'>$@$2,($'#/!,;'#?!(A.!,##'(,($'#!-&'2.//33333333333333333333333333333333333333333333333334OC! I3738! K!@&,6.='&<!@'&!2';;,B'&,($P.!,#>!$#(.?&,(.>!,##'(,($'# 333333333333333333333333333333333333333333333334OV!

I373834! )..-$#?!(&,2.!'@!,##'(,($'#!-&'2.//./!,#>!/A,&$#?!,##'(,($'#/ 333333333333333333333333333333333333333333334OV!

I37383I! [#(.?&,($'#!'@!(A.!,1('6,($2!,#>!6,#1,;!,##'(,($'#!'@!%9*!2&'//:&.@.&.#2./ 33333333333333334OV! I3737! 9$/21//$'#!,B'1(!(A.!"#$%&'()*!89!P$.=.& 33333333333333333333333333333333333333333333333333333333333333333333333333334OH! I')#>(54')/")*/36&536#.4J65++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++0H,!

=6G6&6)#65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++00,!

C))6K65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++08@!

C))6K/0+/-A!/G>"./G4>6/6K"F3>6/L8MN!O++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 08P! C))6K/8+/C))'.".4')/&(>65/G'&/#&'551&6G6&6)#65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0:0! C))6K/:+/C))'.".4')/&(>65/G'&/.;6/*6G4)4.4')/'G/>4%")*/Q4)*4)%/54.65+++++++++++++++++++++++++++++++++ 0::! C))6K/?+/C))'.".4')/6K#63.4')5/G'&/>4%")*/Q4)*4)%/54.65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0:B! C))6K/<+/C)">R545/'G/"))'.".6*/")*/#'F3(.6*/*45(>G4*6/Q')*5 ++++++++++++++++++++++++++++++++++++++++++ 0:P! C))6K/B+/S&"3;4#">/G6".(&65/4)/TCCUTC++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0?H! C))6K/@+/TT2"3/*"."Q"56/5#;6F"."/L4)/">3;"Q6.4#">/'&*6&O++++++++++++++++++++++++++++++++++++++++++++++++ 0?8!

(8)

Résumé

L’annotation présente dans UniProtKB (UniProt Knowledgebase) et les structures 3D dans PDB (Protein Data Bank) sont des sources de données importantes et complémentaires pour décrire la fonction et les caractéristiques des protéines. D’une part, une fraction de l’annotation séquentielle présente dans UniProtKB est générée à partir de données structurales. D’autre part, le transfert d’annotations séquentielles sur les structures peut souvent fournir un niveau supérieur de compréhension quant à leur rôle fonctionnel. Dans ces deux cas, il faut avoir recours à une interface séquence-structure. Le sujet principal de cette thèse a été de créer une plateforme qui permette de manipuler ensemble séquences et structures de protéines, dans le but de simplifier et d’aider à réaliser les deux types de transfert d’information cités plus haut. L’élément fondamental d’un tel système est une table de correspondance (ou mapping) bidirectionnelle et à jour entre les séquences et les structures de protéines.

Dans la première partie de cette thèse, nous décrivons SSMap, une nouvelle méthode qui a été développée pour associer les séquences aux structures (et vice- versa). SSMap a été comparé aux mapping existants (SIFTS et PDBSWS).

L’attribution de chaque chaine PDB à une entrée UniProtKB et les correspondances entre résidus (entre séquences et structures) ont été évalués. Cette comparaison a montré que SSMap partage environ 80% des attributions d’entrées avec les autres sources de mapping. Pratiquement toutes les nouvelles attributions proposées par SSMap sont correctes. Dans plus de 60% des cas où la chaine PDB est attribuée a une entrée UniProtKB différente entre SSMap et une autre source de mapping, la solution proposée par SSMap est correcte.

Dans une seconde partie, nous illustrerons l’utilité de SSMap à travers différentes applications pour la création semi-automatique de nouvelles annotations dans UniProtKB à partir des structures PDB, et pour l’exploration d’éléments de séquence de protéines dans leur contexte structural. L’outil principal pour le travail d’annotation est SAALSA (Semi-Automatic Annotation from Local Structural Analysis), une application Internet conçue pour aider à l’annotation des protéines en utilisant les données structurales. SAALSA a pour but de rassembler et de convertir l’information 3D disponible pour une protéine donnée en annotations. Un expert en annotation peut utiliser différentes vues pour accepter ou rejeter ces annotations. Un travail important a été de définir le meilleur équilibre possible entre les actions automatiques et manuelles tout au long de ce processus d’annotation. L’intégration de règles d’annotation et des mappings UniProtKB-PDB dans SAALSA renforce l’homogénéité et la cohérence de l’annotation dans UniProtKB/Swiss-Prot. En plus de SAALSA, nous avons développé une application qui fournit des mises à jour automatiques des références à PDB dans UniProtKB. Finalement, nous présentons un visualiseur 3D qui a aide à visualiser les annotations de séquence UniProtKB (ou n’importe quel ensemble de résidus appartenant à une séquence UniProtKB) sur les structures 3D partageant au moins 70% d’identité de séquence avec la séquence UniProtKB.

Pour conclure, ce travail fournit une base solide pour travailler de concert avec des données structurales et séquentielles, et ouvre de nouvelles perspectives pour la visualisation et l’interprétation des caractéristiques de protéines dans leur contexte structural.

(9)

Abstract

UniProtKB (UniProt Knowledgebase) annotation and 3D-structures in PDB (Protein Data Bank) bring important and complementary information related to protein features and function. On one hand, part of the annotation of protein features present in UniProtKB is imported from 3D structural data. On the other hand, the transfer of sequential protein features to structures provides often an additional level of understanding about protein function. Both of these tasks require an interface between sequences and structures. The main subject of this thesis is to create an integrated platform to allow easy interplay between protein sequences and structures in order to simplify and help scientists in the two tasks previously cited. The foundation of such system is a unique, bidirectional and up-to-date mapping between sequences and structures.

In the first part of the thesis, we describe SSMap, a new method that we developed to map sequences to structures (and vice-versa). Then, SSMap was compared to the existing mapping resources (SIFTS and PDBSWS). Both the correctness in the attributions of PDB chains to UniProtKB entries and the detailed residue-residue mapping supported by pairwise alignments were evaluated. The comparison showed that SSMap shared around 80% of entry-level mappings with other mapping resources. Almost all the new entry-level mappings proposed by SSMap were correct. When the entry-level mappings between different resources disagreed, over 60% of alternative mappings proposed by SSMap were correct.

In the second part, we present a series of tools that were designed on top of SSMap to create semi-automatically new UniProtKB annotations from PDB structures and to explore existing features in their structural context. The most important tool for the annotation task is SAALSA (Semi-Automatic Annotation from Local Structural Analysis), a web-based application designed to help annotation of proteins using structural data. SAALSA aims to gather and map 3D information on the protein sequence, and finally to propose annotations for various protein features. The curator can accept or reject any of these annotations, based on different views provided by SAALSA. One of the main tasks involved finding an equilibrium between automatic and manual actions through the annotation pipeline. The integration of annotation rules and UniProtKB-PDB mappings into SAALSA enforces the consistency of UniProtKB/Swiss-Prot annotation. Apart from SAALSA, we have also developed an application, which provides automatic updates for PDB cross-references in UniProtKB. Finally, we present a UniProtKB 3D Viewer that helps scientists to visualize UniProtKB features or any position in sequences on 3D structures having at least 70% sequence identity with the UniProtKB sequence.

To conclude, this work provides a strong basis for scientists to work together with sequential and structural data; and opens new perspectives for the visualization and the interpretation of protein features in their structural context.

(10)

Background

To understand life, researchers study the behavior of complex molecular components of the cells. All known living cells and even viruses are constituted of amino acids, lipids, sugars and nucleic acids. Each of these molecular entities can polymerize in more or less complex and various ways. In the context of a biopolymer (or macromolecule), elementary units can be called residues. Proteins (or polypeptides) are one of the most ubiquitous biopolymers present in cells, which reflect their high importance in life processes. Chemically, proteins are directed linear chains of amino acids covalently bound via peptidic links. Twenty different amino acids¹ can be incorporated in protein chains during their biosynthesis [Lodish H, 2007] (Figure 1).

Protein chain length can be variable, from several residues (i.e. peptide) to several tens of thousands of residues. However, in general, protein chains are composed of several hundreds of residues. The sequence of amino acids of a given protein corresponds to its primary structure and determines partially its function(s).

Figure 1 - The basic 20 amino acids constituting proteins (source: Wikimedia). The amino group, the carboxyl group and the lateral chain of each amino acid are displayed in blue, green and red, respectively.

1 Two additional amino acids incorporated rarely in protein biosynthesis: selenocysteine in some enzymes and pyrrolysine in methanogenic archaea enzymes.

(11)

However, this relationship between sequence and function is not obvious at first glance and function prediction of uncharacterized protein requires a sequence comparison to already characterized and studied proteins. Indeed, proteins sharing a high sequence similarity can be suspected to share a similar function (and then considered as homologous proteins). Then, to predict the function of a protein knowing its sequence, one can perform multiple sequence alignments with homologous proteins or query databases such like PROSITE, which archives sequence patterns [Bairoch A, 1992] and/or profiles [Hulo N, 2006] or Pfam [Bateman A, 2002], which stores multiple alignments and hidden Markov models (HMM). Sequences sharing a same pattern or matching a same profile/HMM are likely to belong to the same protein family and then have a common or at least a related function. Conserved patterns are also useful to understand various biological mechanisms: the subcellular location, the interaction with other proteins, et caetera.

Unfortunately, functional importance of residues cannot be always detected using sequence similarity. Any single residue conserved or not in sequence (among homologous proteins) can have a key role to understand the behavior of proteins.

The functional importance of single residues can be revealed by directed mutagenesis experiment results [Wu HF, 1993] or by the association of Mendelian diseases to variations of single amino-acids (SAPs), among others. The range of functions that can be attributed to individual residues is broad. We can cite non- exhaustively some of them. Some residues can be modified (post-translational modifications or PTMs) and provide proteins with new functionalities. Others are critical for the protein to interact with other proteins or other molecular species, or simply are critical for the stability of the protein. When non-conserved in sequence these residues and their associated functions are mainly found on an experimental basis.

All these sequential features, computationally or experimentally characterized, need to be reported in sequence databases to provide the scientific community with a comprehensive and overall view of the relations between sequence and function.

Annotation of protein features is also a necessary basis for further analysis of new protein sequences by homology, as described previously.

(12)

UniProtKB, a protein sequence resource

Protein primary sequence information is comprehensively stored in the UniProt Knowledgebase (UniProtKB) [Bairoch A, 2005][UniProt consortium, 2008]. This resource is completed with UniParc, an archive of all existing protein sequences and UniRef, providing sets of protein clusters according to sequence identity thresholds (UniRef 100, UniRef 90 and UniRef 70 for 100, 90 and 70 percent sequence identity thresholds, respectively).

The UniProtKB contains a set of reports on proteins identified by a unique accession number. It consists of the automatically annotated UniProtKB/TrEMBL section and the manually annotated UniProtKB/Swiss-Prot section.

UniProtKB/TrEMBL aims to keep pace with the incoming flow of data obtained mainly from sequencing projects. It is produced by in-silico translation of transcripts stored in the EMBL Nucleotide Sequence database. On the contrary, in UniProtKB/Swiss-Prot, biological information about the proteic product(s) of one gene is gathered into a unique entry. Since release 56.1, UniProtKB/Swiss-Prot proposes annotations for the complete Homo sapiens proteome, which consists of 20,320 entries. The proteomes of three other model organisms, Escherichia coli, Saccharomyces cerevisiae and Schizosaccharomyces pombe, are also fully represented in UniProtKB/Swiss-Prot.

UniProtKB/Swiss-Prot is well known to provide annotations about relevant biological properties and functions of proteins with a high quality standard. To guarantee this quality, professional database curators follow precisely a series of procedures for each protein, with the help of dedicated tools. We propose first an outlook of this annotation task.

Firstly, curators define a canonical (or master) sequence from the various sequences (mRNA, genomic) available, and report possible alternative sequences, as well as well-characterized sequences variations [Yip YL, 2008] or conflicts.

Secondly, as a core task, a general review of the literature is performed to collect the current scientific knowledge about the features and functions of the protein and to summarize it into annotations. Reviews and articles describing 3D-structures for the protein have a high potential to bring relevant information [Hinz U, 2008], and constitute a key criteria to prioritize annotation. We will present specifically the different sources of information for structural data and how they are incorporated into UniProtKB entries (cf. “How to explore structural data?” and “Usage of 3D structures

(13)

for the UniProtKB annotation” paragraphs, respectively). Another important source of information is obtained by running sequence analysis programs such as predictors of various protein features (e.g. transmembrane helix regions, glycosylation sites, signal sequences and sub-cellular location targeting, etc…).

Thirdly, curators report the list of literature references and cross-references to other databases that were used to generate annotations. Among them the most important are nucleotide sequence databases, domain family databases, structural database, disease databases, numerous organism-specific databases. The connectivity to other databases is essential to access additional information related to a given protein.

In order to ensure the respect of the UniProtKB/Swiss-Prot format during the manual transfer of information, controlled vocabularies exist for several fields, and the implementation of numerous annotation rules intrinsic for different data types or constraining relations between data types are implemented into a syntax checker [Bairoch A, 2004].

The annotation in UniProtKB entries is distributed over several sections (or blocs), each describing a different type of information:

- Header containing mainly nomenclature and naming about the protein and the associated gene.

- References to relevant articles in the literature.

- General annotation about the biological behavior and functions of the protein (CC bloc). Diverse keys (line types) in this CC bloc allow annotators to report various properties of proteins. Among them, are reported the spatio- temporal expression, the biological function, the subcellular location or for example the pharmaceutical interest of the protein.

- Cross-references to numerous other databases (DR bloc).

- Keywords (KW bloc).

- Description of sequential protein features (FT bloc). This bloc is dedicated to the description of sequential protein features. Among them, sequence variations, active sites, binding sites, post-translational modifications (PTMs), etc… We will present later several kinds of sequential features we were particularly interested in for this thesis work.

- The master sequence.

(14)

Through this description it becomes obvious that a UniProtKB entry provides most of all the kind of information useful to describe the function of single proteins. Selected parts of each bloc of the UniProtKB entry P12807 (describing a yeast peroxisomal copper amine oxidase) are provided Figure 5.

Understanding how amino-acid sequences are represented into UniProtKB is of primary importance for the present work. In a UniProtKB entry, only one sequence is stored and displayed: the master sequence. The positional annotations in the entry (mainly in the FT bloc) refer to this sequence. However, in a number of cases, active/functional proteins generated from one gene do not correspond to the master sequence. Thus, additional annotations are required to describe the sequences of different protein products.

Alternative products of the same gene generated by alternative splicing² (or isoforms) are described in the general annotation section (CC ALTERNATIVE PRODUCTS). The corresponding sequences are not directly recorded in the entry, but can be reconstructed from FT VAR_SEQ records, describing the sequence variations compared to the master sequence. Each of these alternative sequences can be accessed through a unique code: the corresponding UniProtKB accession number followed by a dash and a number. For example, the several isoforms of the Myelin basic protein (UniProt:P02686), and isoform sequences: from P02686-1 (the master sequence in this case) to P02686-7.

In other cases, UniProtKB entries correspond to precursor proteins, which can be cut into smaller products having various functions. For example, in the cells of the anterior pituitary vesicle, enzymes cleave corticotropin-lipotropin precursors (UniProt:P01189) to produce mainly adrenocorticotrophin and beta-lipotropin.

Another frequently encountered case is constituted by viral polyproteins. In UniProtKB, these protein products are not associated with a specific sequence identifier as for alternative splicing products. However, the boundaries of each

2 In eucaryotic cells, a general mechanism known as splicing consists in the excision of some regions of mRNAs (i.e. introns) and the juxtaposition of the remaining regions (i.e. exons) to produce mature mRNAs used as templates for the synthesis of proteins. In addition, in superior eukaryotic cells (e.g. human cells), splicing events can occur alternatively in some cases and lead to the production of different mature mRNAs and thus to the production of different proteins.

(15)

product are indicated in FT PEPTIDE lines. The corresponding names are indicated in the “Protein and gene names” section.

Finally, in rare cases, mainly in archaea and bacteria, proteins can excise by themselves a part of their sequence (i.e. inteins [Gogarten JP, 2002]) and rejoin the 2 boundaries. These intein regions are indicated in FT CHAIN records.

From sequences to structures

In their biological environment, proteins do not behave as linear chains of amino acids. In the cellular medium, the nascent chain of amino acids folds progressively into higher order structures, and by then, minimizes its free energy. As demonstrated several decades ago [Anfinsen, 1973], the information required for the folding of proteins, which is essential for their function, is encoded in their sequence.

Chaperones bind nascent polypeptide chains and help then to acquire their native fold. They also bind to partially unfolded proteins and protect cells against problems due to the accumulation of misfolded and non-functional proteins [Young JC, 2004].

The first step of the folding of proteins is the acquisition of well-characterized secondary structure elements: alpha-helices, beta-sheets, turns [Martin J, 2005]. For a given amino acid sequence, preferential conformations of the main chain drive the local formation of alpha helices and beta strands [Kleywegt GJ, 1996]. Then, through the probable formation of intermediate and unstable structures (molten globules) [Skolnick J, 1993][Kataoka M, 1997], secondary structures organize themselves into folds, leading to the tertiary structure of the protein.

Often proteins can associate with partner proteins (same or different proteins) to form active multimeric complexes. These stable protein-protein interactions (or assemblies) have often a regulation role of the function of proteins. For example, in adult humans, hemoglobins are well known biological active complexes resulting from the association of 2 alpha and 2 beta globins (UniProt:P69905 and UniProt:P68871, respectively). In this case, the association plays a role in an allosterical mechanism. The structural knowledge of these strong associations of proteins into functional complexes is also known as the quaternary structure of proteins.

(16)

3D-structural knowledge is essential for the understanding of biological mechanisms

Obviously the shape of proteins is directly associated to their sequence and to their function. The knowledge of the 3D structure of proteins gives a more precise idea about their real behavior than simple sequential information. Indeed, the proximity of residues in space is not necessarily dependent on their proximity in sequences. For example, residues forming the catalytic site of an enzyme (proteins catalyzing organic chemistry reaction) are rarely contiguous in sequence. If the structure of the enzyme has been obtained in presence of its substrate, the 3D- structure unveils the precise location of these residues. Detailed analysis of protein 3D-structures is often the key to the physical understanding of enzymatic reactions [Ghosh D, 2007][Calderone V, 2006][Rutten L, 2006], of interactions with macromolecules (e.g. protein-protein or protein-nucleic acid chains interactions) or compounds [Russell RB, 2004][Huang YJ, 2008][Ban N, 2000], of functional conformational changes [Janssen BJ, 2006], and the characterization of crosslinks between residues as in the case of disulfide bonds. As an example, the crystal structure of Glutathione Reductase (UniProt:P00390) [Thieme R, 1981][Karplus PA, 1987] has enabled the understanding of experimentally proven activity of glutathione reduction by revealing the catalytic center of this enzyme. Moreover, atomic detailed knowledge is necessary for many applications from the engineered enhancement of enzyme capabilities to the in silico refinement of compounds for drug design.

Structural data and dedicated repositories

There are two main methods to obtain protein structures: Nuclear Magnetic Resonance (NMR) [Wuthrich KJ, 1990] and X-ray crystallography [Kendrew JC, 1958]. Both of them have advantages and inconveniences [Doerr A, 2006][Snyder DA, 2005]. On one side, NMR structures can provide information on protein motion and conformational changes, and, in particular, are useful for studying 3D structures in solution. On the other side, X-ray crystallography can offer high-resolution structures of proteins without limits of size, but major limitation is that one has to obtain crystals of the protein which is often difficult, especially for proteins which have

(17)

intrinsically unstructured regions. Once obtained, structures are deposited and archived in the Protein Data Bank (PDB) [Berman HM, 2000]. Because of the constant improvement of techniques (mostly thanks to high-throughput apparatus for determining crystallization conditions [Stevens RC, 2000][Hui R, 2003] and synchrotron light sources [Hendrickson WA, 2000]), as well as the efficiency of recent large-scale structural genomics projects [Stevens RC, 2001][Kouranov A, 2006], the number of structures in PDB has undergone an exponential growth in recent years.

PDB structures are available in different formats. The most frequently used one is the PDB flat file format.

PDB flat files compose mainly of two sections: a header section and a coordinates section. Annex 1 presents chosen parts of a PDB file (PDB:2WEB). The header section (Figure 2-a) contains general information about macromolecules and compounds used for the experiment, as well as details about the experiment itself and data about the structure quality. In particular, SEQRES records show the sequence of each protein present in the structure, and SEQADV records describe sequence variations compared to sequences obtained from other PDB entries or other databases (mainly UniProtKB and NCBI RefSeq). Atom by atom, the coordinate section describes successively each protein (or chain) composing the structure (Figure 3). Several models are successively proposed in the case of NMR structures. The main type of field encountered in this section is ATOM/HETATM (Figure 2-b). For X-ray structures, each of these lines describes the position, the chain label (one character code to identity each PDB chain), the coordinates, the temperature factor and the occupancy of one atom mapped to the electronic density (Figure 3). For NMR structures, only the coordinates are indicated.

In PDB flat files, rigid column format is known to be a strong limitation to describe structures. PDB flat files are also very difficult to parse because of numerous inconsistencies. Typical problems include violations of the PDB format specifications, inconsistent residue numbering and missing values for experimental parameters. In addition, the coherency between different data types is not always respected. While most of these types of errors were introduced during data submission, the data model itself has also been criticized [Schierz AC, 2007].

Members of the wwPDB consortium [Berman HM, 2003] have recently concentrated their efforts to enforce the quality control of PDB files [Berman HM, 2007][Henrick K,

(18)

2008]. In 2006, the consortium released PDB remediated files that contain less syntax problems in comparison with older PDB file version, notably concerning taxonomy information. Another major change was the replacement of the chain label

‘<space>‘ by ‘A’ when only one chain is present in the PDB structure. Despite the recent efforts of data remediation, the PDB files still contain inconsistencies and errors. The existence of new formats like mmCIF is promising for the future validation of PDB data but does not solve inconsistency problems present in the data submitted with the previous standard.

Figure 2 - General organization of PDB flat files into a header section and an atomic coordinate section.

Figure 3 - Structural data hierarchy and description of the format for the representation of atomic details of 3D-structures in PDB flat files.

(19)

Known folds (i.e. represented in existing PDB structures) are catalogued in the SCOP [Andreeva A, 2008] and CATH [Orengo CA, 1997] [Cuff AL, 2009] databases.

Finally, biological units are built from existing PDB structures and archived in the PQS database [Henrick K, 1998] and PISA [Collaborative Computational Project, Number 4, 1994][Krissinel E, 2007]. In this repository, assemblies are stored in separated files in PDB flat file format.

How to explore structural data?

Given their complexity, structural data as presented in PDB files are not directly usable. Obviously, the list of atomic coordinates is not directly interpretable in terms of residue interactions, protein-protein interfaces, the presence of clefts, the specificities of folds, etc. Apart from this problem of 3D visualization, information located in different places of the file cannot be compared easily by manual inspection. For example, boundaries of secondary structure elements are described separately from the SEQRES sequence, in specific HELIX/SHEET/TURN lines.

Textual information, in the header of PDB flat files, is not presented in a user-friendly manner and in uppercase characters. Therefore, there is an important need to help users browse and mine structural data. Many tools have been developed to summarize structural data for a given PDB entry. Among them, the new official PDB website [Bourne PE, 2004] provides different views of structures and notably, a sequence-structure data view allowing to display either the SEQRES sequence or the related UniProtKB sequence together with the secondary structure information.

PDBsum [Laskowski RA, 2005] is another site providing useful information related to PDB entries in one main page; and presents different specific views to describe structural features. As an example, PDBsum provides a user-friendly 2D representation of ligand binding sites computed with LIGPLOT [Wallace AC, 1995]

(Figure 4). It also shows several 3D views of complete structures focusing on specific features (e.g. clefts, folds, surface) and provides links to numerous other specialized structural databases. Table 1 summarizes the type of information that can be retrieved from the PDB website, PDBsum and MMDB at NCBI [Wang Y, 2007]. Apart from such PDB browsers, specialized tools like MSD web services

(20)

[Velankar S, 2005] are useful to extract information from PDB through complex queries. For example, through the MSDsite interface, it is possible to retrieve structures that present a binding site for a given ligand, within a certain distance and interacting with specific amino-acid types. Similarly, iMoltalk [Diemand AV, 2005]

addresses the need to identify and highlight biochemically important regions in protein structures. It offers a tool to extract general information from PDB files, such as generic header information or the sequence derived from three-dimensional coordinates. It also provides various tools to map corresponding residues from sequence to structure; to search for contacts of residues (amino or nucleic acids) or heterogeneous groups to the protein, present cofactors and substrates; and to identify protein-protein interfaces between chains in a structure.

Recently, the emerging ability to integrate 3D structures into PDF publications [Kumar P, 2008] and the release of a new resource, Proteopedia [Hodis E, 2008]

also contribute to help scientists interpreting 3D-structures directly in articles.

Proteopedia uses the most advanced functionalities of the Jmol viewer [Willighagen E, 2007] like animation and dynamic labeling of residues or atoms.

Figure 4 - 2D representation of the binding site in PDBsum. PDB ligand PP4³ for PDB entry 2WEB.

3 (Methyl(2S)-[1-((N-formy)-L-valyl)amino-2-(2-naphtyl)ethyl)hydroxyphosphinyloxy]-3-phenyl propanoate).

(21)

PDB website PDBsum MMDB Experimental

information

Type, Resolution, Rfree, etc

Type, Resolution, Rfree, etc -

Literature Summary Summary + abstract + related figures

Summary

Secondary structures

Yes, on a separate page

Yes (small drawing);

boundaries not specified -

Folds Full hierarchy in

CATH and

SCOP

- -

Domains - Pfam 3D domains

Ligands Non redundant list + display

Non redundant list, number of occurrences, 2D rendering of the ligand and its environment

Non redundant list,

number of

occurrences, 2D rendering of the ligand

Go terms Full hierarchy Terminal nodes and links to access full information

-

Other additional information

- Procheck, surface, literature referencing the structure

-

Table 1 - Specificities of main web browsers for PDB entries: PDB website

(www.pdb.org), PDBsum (www.ebi.ac.uk/pdbsum/) and MMDB

(www.ncbi.nlm.nih.gov/sites/entrez?db=structure).

(22)

Usage of 3D structures for the UniProtKB annotation

In UniProtKB/Swiss-Prot, the usage of 3D-structures is an important complement to the literature for the high quality annotation of proteins [Bairoch A, 2005][Boeckmann B, 2005][Hinz U, 2008]. The database strives to provide users with the existing structural information. Therefore, manual annotation of protein structures in UniProtKB/Swiss-Prot involves not only, ensuring a complete and accurate coverage of cross-references to these structures in the PDB but also, exploiting the structural data to annotate protein functions and features (Figure 5).

Cross-references to PDB are indicated in the cross-reference bloc (DR PDB line in the flat file of UniProtKB). Each DR PDB line references one taxonomically matched structure available from PDB. The line indicates the PDB entry code, the name of the PDB chain(s) matched to the reference sequence (sequence shown in the UniProtKB entry), the experimental method used, the resolution (when available) and the boundaries of these matches on the UniProtKB reference sequence.

Detailed format is reported in Annex 2. Both the correct association of a PDB chain to a UniProtKB entry from the same species and the alignment between the UniProtKB sequence and the sequence derived from the PDB chain are essential for cross-referencing. Except in the case of fusion proteins, a PDB chain can be mapped to only one UniProtKB entry (in DR PDB lines).

The growing amount of information coming from PDB makes impossible manual update of PDB cross-references through the entire UniProt Knowledgebase.

For this reason, PDB cross-references are mostly generated automatically, although curators can modify them. Changes of UniProtKB sequences may require a simultaneous update of the boundaries indicated in existing cross-references. In other cases, the automatically generated cross-references can simply be wrong, and needs to be manually modified.

From PDB structures and the literature, UniProtKB curators extract protein features. To explore structures, tools presented in the last paragraph are widely used and in particular PDBsum, which provides data in the most concise way.

In UniProtKB/Swiss-Prot entries, the following sequence features can be directly generated from knowledge brought by PDB structures:

(23)

- For enzymes, active sites described in FT ACT_SITE fields;

- Small ligand and metal binding sites (FT REGION, FT BINDING, FT NP_BIND, FT METAL);

- Other important residues are reported in FT SITE (e.g. critical residues involved in protein-protein interactions or residues involved in ligand binding specificity);

- Post-translational modifications:

o Modified residues described in FT MOD_RES;

o Covalent linkage of sugars and lipids reported in FT CARBOHYD and FT LIPID lines respectively.

o intra- and inter-chain disulfide bond (FT DISULFID);

o atypical covalent bonds between residues (FT CROSSLNK).

- Secondary structure assignments (i.e. FT STRAND, FT HELIX, FT TURN).

- Sequence variations detected through high-resolution crystallography data (FT MUTAGEN mostly and more exceptionally FT VARIANT and FT CONFLICT).

Annotation of protein features and PDB cross-references follows a set of rules.

Annex 2 and Annex 3 describe the annotation rules used to produce DR PDB cross- references and feature annotations related to ligand binding sites, respectively. Some annotation exceptions are reported in the Annex 4 of this thesis.

From the knowledge of 3D-structures obtained from the literature and online tools, different kind of comments (CC bloc) can be added in the entry to describe protein features more in details and in their biological context. For example, comments can be added to provide information about the biological quaternary structure of the protein (CC SUBUNIT), the exact functions of the protein (CC FUNCTION), details about the binding of cofactors (CC COFACTOR) or details on post-translational modifications (CC PTM) (Figure 5). Protein-protein interactions (PPIs) validated by another experimental methods such as gel filtration are manually reported in CC SUBUNIT lines. Usually PPIs are not described in the FT bloc, even when it is demonstrated by 3D-structure. In the UniProtKB entry, specific keywords are related to information extracted from 3D-structures. For example, the “3D- structure” keyword indicates that the entry is mapped to at least one PDB structure

(24)

and the “Metal binding” keyword indicates that the protein binds at least one metal ion.

As a remark, the annotation of enzymes is particularly rich in features and comments potentially related to structural information. Indeed, as previously said, enzymes present essential residues close in space, which are directly involved in a given chemical reaction (i.e. the catalytic site). In addition, these proteins can be crystallized in different stages of the enzymatic reaction and in presence of their substrate(s), inhibitor(s), cofactors or metals, which helps reconstruct the organic chemistry suite of reactions.

Figure 5 - UniProtKB format and main fields involving 3D structures. An example with entry P12807. In light green, general information about the structure. In dark violet, information about metal binding sites. In orange, modified residue descriptions. In light violet active site definition. In yellow, disulfide bonds, in light blue experimental sequence variations and dark blue secondary structure list.

(25)

We propose in the following to present the limitations of the annotation tasks of structural features:

As we said, even if the cross-references to PDB are automatically generated from existing mappings, curators may want to add lacking cross-references in the UniProtKB entry or modify erroneous ones. This task is not straightforward: specific BLAST searches are required to find potentially interesting PDB information and additional investigation is necessary for each of them to find out the ones matching at the taxonomic level. Usually, the boundaries in the UniProtKB sequence that must be reported in the cross-references to PDB are indicated at the end of the DBREF line in the PDB flat file. However they are not reliable, probably because of update problems or annotation errors. Then, the evaluation and correction of boundaries require the usage of several sources of information, such as PDBsum or the PDB website.

Sometimes curators even have to look directly at the listed atoms and annotations (e.g. artifactual N- or C-terminal regions can be reported in SEQADV lines) in the PDB flat file, which is time consuming.

Similarly, during the manual definition of binding sites for feature annotation, curators browse 3D structural information using external resources for each PDB entry of interest. Apart from gathering information from structure-related literature review, curators mostly use PDBsum to help them define active sites, binding sites (mainly using the 2D-view as presented in Figure 4) and other protein features.

PDBsum may also be used to insert additional information in CC lines (e.g. CC COFACTOR). In PDBsum, the information displayed is centered on structures whereas the main interest for the UniProtKB curators is the UniProtKB entry. More concretely, in 3D representations, numbering of amino acids represent positions in the structure instead of the position in the corresponding UniProtKB sequence. For a routine annotation work using 3D structures, it is tedious and time consuming to deal with numbering differences between structures and sequences.

While incorporating 3D structural information into UniProtKB, often multiple 3D structures describing the same protein are available. Newer structures may provide additional structural knowledge about a given protein in comparison with older ones.

(26)

They can be simply better resolved, but also resolved on different domains, in the presence of different compounds or after specific mutagenesis. In any of these cases, the exploration of a single structure is not sufficient to collect all the features offered by 3D-structures. As a consequence, traditional one PDB entry-centric structural browsers are not adapted. Without a dedicated interface, curators are not able to browse data from different structures at the same time and to summarize multiple occurrences of the same or similar observation for a given protein. SAS (Sequences Annotated by Structures) is a pioneer tool that was designed to collect structural features from different structures in order to annotate sequences [Milburn D, 1998]. The interface transfers structural annotations to sequences and presents a multiple sequence alignment of related PDB structures that retrieved by a FASTA [Pearson WR, 1990] search on all PDB sequences. The interface and the results are clear, but still difficult to assess manually, as atomic level description of features is not available and direct visualization of features in the different structures is not possible.

Finally, existing annotation rules to produce UniProtKB/Swiss-Prot annotations are complex (cf. Annex 2, Annex 3 and Annex 4) and difficult to respect by fully manual processes. No tool is currently able to apply them automatically from the atomic-level structural information and the knowledge of equivalences between residues in PDB and UniProtKB.

Exploration of UniProtKB annotations in their 3D-structural context

In addition to the fact that structural data can be used to enrich sequence annotation, a wide range of sequential information can also benefit from an additional level of understanding, once put in their structural context. The structural exploration of positional UniProtKB annotations can provide new insights about the relation of these features and the function of the protein. For example, the analysis of the structural environment of sequential features such as amino acid polymorphisms brings a new dimension for the understanding of the effect of a mutation [Steward RE, 2003]. The structural visualization of UniProtKB features is also often very useful for the understanding of new biological data and can assist in the design of experiments, notably directed mutagenesis [Maillard AP, 2005][Wang R, 2007]. For

(27)

all of these reasons, the direct visualization of UniProtKB features on 3D structures should be facilitated.

The current UniProt website [UniProt Consortium, 2008] (www.uniprot.org) provides access to UniProtKB data in a very comprehensive and user-friendly manner. 3D-structural information, however, is still not directly accessible for individual features. User can only use cross-references in an entry as a starting point and then go to a specialized structural site to visualize the structure in order to locate manually the sequential protein features reported in the UniProtKB entry.

Some applications have already been developed to help visualize sequential information on 3D structures:

- SRS 3D [O'Donoghue SI, 2004] is a web service that allows users to easily and rapidly find all related structures for a given target sequence. Through this tool, structures can theoretically be viewed together with sequences, alignments and sequence features (currently from UniProtKB, InterPro and PDB). According to authors, this tool was designed to be used intuitively by non-experts in structures.

- Distributed Annotation System (DAS) was presented more recently [Prlic A, 2007] to exchange biological data in a network protocol. DAS provides an interesting distributed solution for the management of the growing number of services and exchange between diverse biological data sets. Several extensions to the current DAS 1.5 protocol were set up, notably, to share alignments and three dimensional molecular structure data. Web sites and applications are currently using these new extensions. As an example SPICE [Prlic A, 2005] is a Java application aiming to visualize together PDB structures, sequential and structural annotations shared through the DAS network.

Problems encountered during the manual exploration of UniProtKB sequence feature annotations in the 3D-structural context (or more generally to any set of amino-acids of interest) are basically the same as those encountered by UniProtKB curators during the phase of knowledge gathering from 3D structures. Nevertheless, in this case, scientists only need a simple interface dedicated to browse and visualize concomitantly structural and sequential information. Existing tools (SRS3D and SPICE) do not produce such a minimal requirement.

(28)

Aim

UniProtKB and PDB provide complementary data. Users and curators of these databases frequently need to transfer information from one database to the other.

The general aim of this thesis work is to build a framework allowing a flexible interplay between sequences and structures.

Any task mixing sequential and structural information requires a full knowledge of the relations between PDB structures and UniProtKB entries. Such resource as a whole (or the concept) is named a sequence-structure “mapping”. At the entry level, the plural of this term (i.e. “mappings”) will be used to describe the collection of pairs (sequence in a specific UniProtKB entry, PDB chain) constituting the whole mapping.

Mappings at the residue level are simply obtained by aligning the UniProtKB sequence and the PDB chain of each pair. In the first part of this thesis, we will present and evaluate SSMap, a new PDB-UniProtKB mapping method, designed to easily transfer information from sequences to structures and vice-versa (Figure 6).

As showed in the previous paragraphs, the creation of UniProtKB annotations from structural information and the exploration of sequential annotations in their 3D- structural context, require both a reference sequence-structure mapping. In fact, most of the limitations that we exposed previously for these two tasks are related to the lack of integration of such a mapping. By taking advantage of SSMap mapping features, we developed new adapted solutions in order to palliate these limitations.

Therefore, in a second part, we will present tools conceived specifically to integrate UniProtKB annotations with 3D structures and thus to help UniProtKB curators and users easily use and explore 3D-structures in their routine tasks :

- To summarize PDB information available for a given protein and generate UniProtKB annotations semi-automatically;

- To map and visualize existing UniProtKB annotations sequential features onto 3D structures.

(29)

Figure 6 - General concept of a bidirectional mapping between PDB structures and UniProtKB sequences. The potential applications require a unique mapping.

(30)

Part 1 - Mapping structures to

sequence

(31)

(32)

1.1 Introduction

This chapter describes a new method for the production of a sequence- structure mapping and provides a detailed comparison of it with available mapping sources.

By definition, a PDB-UniProtKB mapping is composed of two levels: an entry level and a residue level. To build a quality mapping requires solving problems at each of these two levels.

At the entry level, the ideal attribution of one PDB chain to one UniProtKB entry is not trivial. A first reason for that is the existence of extreme cases where identical sequences belong to closely related species and the redundancy in the TrEMBL section of UniProtKB. The exact proportion of such cases occurring in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL was unknown. Some annotation errors or inconsistencies in naming taxons between UniProtKB entries and PDB can also prevent an unambiguous attribution of one UniProtKB entry to one specific PDB chain. A second kind of problem occurs with fusion proteins: different PDB chain regions can map to distinct entries.

At the residue level, the main problem comes from the difference of residue numbering between residues in the structure and in the related sequence. In X-ray crystallographic structures, the presence of gaps is a problem during the reconstruction of sequences from structures. This problem is described in Figure 7.

To reconstruct correctly sequences as well as the relations between residues in sequence and structure, one has to consider multiple sources of information in the PDB file. Among them the coordinates, the theoretical sequence provided by authors in the SEQRES record and some additional annotations about the sequence (e.g.

modified residues in MODRES lines) are essential pieces of information.

Interfacing sequences and structures of proteins: applications to protein annotation and sequence feature visualization

Thesis

Reference

Interfacing sequences and structures of proteins: applications to protein annotation and sequence feature visualization

Interfacing sequences and structures of proteins: applications to protein

annotation and sequence feature visualization

THÈSE

GENÈVE

2009

Remerciements

Table of contents

Résumé

Abstract

Background

Part 1 - Mapping structures to

sequence

1.1 Introduction