• Aucun résultat trouvé

During this thesis work, SSMap, a new UniProtKB-PDB mapping resource at the entry and residue level has been constructed. SSMap is the first UniProtKB-PDB mapping resource to be fully compared with other similar mappings. This systematic comparison with other existent resources showed that SSMap is able to provide a set of independent mappings, which complement the current data. The comparison further allows identifying general problems in UniProtKB-PDB mappings so that both the accuracy and coverage of the UniProtKB cross-references to PDB, as well as and the optimal use of structural information for feature annotation can be enhanced.

Compared to other available mappings, SSMap presents also unique features such as bidirectional sequence-structure mapping, as well as isoform specificity.

Moreover, as we showed, residue-level mapping is not only available for entry-level mappings but for any (UniProtKB sequence, PDB chain) pair sharing down to 70%

sequence identity.

Then, the SSMap mapping was currently used for a wide range of applications. We presented in this thesis applications for the production of UniProtKB annotations (mainly SAALSA) from 3D structural information and for the visualization of any protein sequential features in 3D structures.

To facilitate the annotation of UniProtKB entries using PDB structures, we developed two tools that work in concert. The first one is a direct application of the comparison of mappings for the purpose of automatically create DR PDB cross-references to PDB in UniProtKB with a high quality level. To reach this aim, SIFTS, SSMap and existing PDB cross-references are compared and in function of agreement or disagreements, are used to create new DR PDB cross-references. We clearly showed the interest of using SSMap in this procedure to prevent the incorporation of erroneous mappings and to add new cross-references.

The second tool is SAALSA, a rather complex tool that help UniProtKB curators in many ways to produce annotation from 3D-structures. As we shown, the

strength of SAALSA is based on both the usage of SSMap and in the design of the application itself. Indeed, all the structural knowledge available for a given protein is summarized and grouped in specialized views in order to smoothly switch from a structural perspective towards a sequential representation. In SAALSA, we set up at best a compromise between automatic and manual tasks in order to maximize annotation consistency and minimize annotation time. The tool was the subject of an internal test phase, which triggered many improvements, and mainly the development of the part of the tool related to the annotation of ligand binding sites.

The mutual integration of these two first applications has been demonstrated to be critical for the update of cross-references to PDB in UniProtKB and for the consistency of these cross-references regarding to feature annotations. Indeed, manual modification of mappings through SAALSA allows curators first, to record easily their manual choices which will be taken into account primarily during the automatic production of PDB cross-references; and second, to evaluate simultaneously the impact on feature annotations.

The third application for this work was a tool, UP3D, allowing curators and UniProtKB users to visualize in a simply manner UniProtKB features in 3D structures.

The tool has been compared to existing similar ones. The main strength of the interface is that it is intuitive and dynamic, and is able to provide access to quaternary structures, and finally to display results coherent with SAALSA. Links to UP3D have been recently added in the new website for TubercuList (tuberculist.epfl.ch/), a specialized database on Mycobacterium tuberculosis genetics.

Other applications were more recently built on top of SSMap, exploiting the bidirectionality of the mapping. Firstly, we integrated information from the Catalytic Site Atlas (CSA) [Craig T, 2004] to provide an input for a procedure aiming to improve multiple alignments from which PROSITE profiles are defined. In addition, SSMap reconstructed sequences are now used to compute PROSITE profile matches on PDB structures. Secondly, SSMap was used to improve the ModSNP pipeline [Yip YL, 2004] and better integrate the structural information with single amino-acid polymorphism information. In fine, the integration of the mapping allowed us to design an extension to the existing variant pages [Procopiou H, 2007]. Thanks

to the mapping, properties of residues, like sequence conservation, are mapped and directly visualized on 3D-structures and compared with the spatial location of variants. Moreover, the mapping makes it possible to inform users if the variant is localized at a protein-protein interaction interface and to visualize it [Loichot G, 2008].

SSMap was also used as a support to interface a phylogenetic information data browser with structural views [Poggioli D, 2008].

Given the usefulness of the SSMap mapping for the scientific community, we have improved the accessibility of SSMap for external services. In this framework, we recently added a web service for a more flexible usage of SSMap from external programs. The development of this service was triggers by a request from the author of the STRAP tool [Gille C, 2001][Gille C, 2006]. In STRAP, SSMap is used as a reliable resource to find quickly 3D-structures closely related to given UniProtKB entries.

Figure 35 - Global workflow of this thesis work.

For the future, of course, as for any application, numerous improvements could be added to the tools that we developed during this thesis. Among the most important ones, we can note the integration of quaternary structure information in SAALSA.

This integration was already done at the level of the database and applied in the UniProtKB 3D Viewer tool. Taking into account the quaternary structure is of importance for a complete annotation of binding sites, especially in cases where the interface of protein chains are only observed in reconstructed biological units.

Among new perspectives for SSMap, we can note the growing interest of integrating structural information with many different biological data browsers; such as ViralZone (www.expasy.ch/viralzone), which is providing information centered on the viral world, or the Ensembl viewer centered on the genetic information (bbcfbrowser.epfl.ch).

To conclude, we showed that SSMap is an interesting resource for the comparison of sequential-based and structural-based knowledge; and is particularly useful in the context of protein annotation. In this thesis, we explored new solutions based on SSMap to summarize and represent multiple structural information or browse sequential information on 3D-structures. Most importantly, the present work ensures a coherency between UniProtKB annotations, SSMap mappings, and PDB structures at the residue and atomic levels. Finally, the SSMap mapping has been demonstrated to be a useful basis for many applications, and hopefully, will help in the development of future tools and research projects at the frontier between protein sequences and structures.

Documents relatifs