• Aucun résultat trouvé

CHAPITRE II: Proteomics: Beyond cDNA

5. Proteome databases

Biological related databases can be classified according to the type of information provided, i.e. protein sequences, nucleotide sequences, patterns/profiles, proteomes, 3-D structures, PTMs, genomic and metabolic data. Historically, the expression “proteome databases” was used to only describe databases holding proteomics data, i.e. the data produced by the technologies described in the previous sections, mainly 2-DE gel images and mass spectra.

However, according to a more comprehensive definition of current proteomics, this expression has embraced other data resources available to the scientific community. This section briefly describes some of the relevant data resources. A more extensive list is published on a yearly basis by the Nucleic Acids Research journal. Here we focus on sequence databases, 2-DE and mass spectrometry, as well as PTM databases.

5.1 Protein sequence databases

The most comprehensive source of protein information is found in protein sequence databases. These can be divided into universal databases, which store protein information from all types of biological sources, and specialized databases, which concentrate their efforts on restricted groups of protein families or organisms. Universal protein sequence databases can be categorized into databases that are simple repositories of sequence data, mostly

translated from DNA sequences, and in annotated databases. The latter requires the assistance of curators who screen the original literature, review articles as well as electronic archives.

Here we mainly describe Swiss-Prot [51], an annotated universal sequence database, and TrEMBL, an automatically generated sequence database that supplements Swiss-Prot, as well as their integration with other proteomics resources.

Swiss-Prot (http://www.expasy.org/sprot/) is a protein sequence database particularly known for its extensive annotation, minimum level of redundancy and maximum level of integration with other databases. Swiss-Prot is mainly manually curated, whereas the vast majority of TrEMBL entries are unannotated or automatically annotated. Created in 1986, release 46.4 includes more than 170’000 entries from more than 9000 different species. Swiss-Prot’s main host is the ExPASy server and its eight mirror sites.

Swiss-Prot entries consist of different line types that are grouped into sections. The description section includes, among others, the accession number (a unique entry identifier), the update dates, the protein description (its “name”), the gene name and the taxonomic origin. It is then followed by the reference section, which, for each bibliographical reference, includes the type of experimental work contributing to the entry (sequencing, 3-D structure determination, mutagenesis studies, etc.), the author list and the literature references. The comment section follows then with a variety of textual remarks classified into topics such as function, subunit, similarity, PTM, MS source, etc. The subsequent section is the database cross-reference that provides active links to more than 60 different biological databases. The links to other proteomics databases like SWISS-2DPAGE, PROSITE/InterPro and PDB allow for rapid access to experimental proteomic data, like position and number of protein spots on a 2-DE gel, other members of the same family, or the 3-D structure of the protein. After a keyword section follows the so-called “feature table” section. It describes regions or sites of interest in the sequence as well as documented PTMs, binding sites, active sites, secondary structures, variants, conflicts, etc. The entry ends with the amino-acid sequence itself, which is the unprocessed precursor of the protein, before any post-translational modification or processing. Tools that identify proteins from mass spectra should ideally use the information held within the feature table. In order to achieve an optimal approximation of the protein in its mature state, the signal sequence and propeptides should be removed before computing pI and Mr. ExPASy-based proteomics tools such as Aldente and FindMod benefit from the

annotation of Swiss-Prot to improve their capacities of identifying and characterizing active chains and proteins annotated with PTMs.

Since its creation, Swiss-Prot has been developed using high-quality manual and computer-assisted annotation, despite the currently large number of genome sequencing projects and, as a consequence the increasing number of sequences that have to be incorporated into Swiss-Prot. This is where TrEMBL (Translation of EMBL Nucleotide Sequence Database [79]) steps in. TrEMBL was created in 1996 as a supplement to Swiss-Prot and consists of computer-annotated entries in Swiss-Prot-like format. It is populated by protein sequences translated from the coding sequences (CDS) in EMBL. In a way, it can be considered as a waiting room to Swiss-Prot; indeed, once annotated, the entries are transferred to Swiss-Prot.

Since 2003, when the maintainers of Swiss-Prot and TrEMBL (the SIB and the European Bioinformatics Institute) joined forces with the PIR group at Georgetown University to form the UniProt Consortium [80], Swiss-Prot and TrEMBL are also known as the “UniProt Knowledgebase”.

As said in the beginning of this Section, there are many specialized protein sequence databases available. Their contents vary a lot in terms of range of interest, number of entries, type of information and quality of the data. They are listed and detailed in the special issue on databases of the NAR journal as well as on the ExPASy server (http://www.expasy.org/links.html).

5.2. 2-DE gels databases

Among proteomics databases, those containing 2-DE gel images with identified proteins, also known as reference maps, are widely used. These databases, usually freely accessible for academics through the Internet, contain clickable maps. The identified spots are linked to their identification method and the description of their identified protein. SWISS-2DPAGE [3]

(http://www.expasy.org/ch2d/) is the oldest and largest such 2-DE database. Created and maintained at the Swiss Institute of Bioinformatics in collaboration with the University Hospital of Geneva, it contained in April 2005 nearly 40 reference maps of various species including human, mouse, Escherichia coli, etc. More than 1300 entries document over 4000

identified “spots”. The proteins represented by these spots were identified by matching with other gels, by amino-acid composition, by Edman sequencing, by immunoblotting and mostly by MS. The text format for each entry is similar to the Swiss-Prot model. It includes specific fields such as the type of master gel from which the protein spot has been identified, the list of gel images associated with the protein entry, as well as other 2-DE specific data, such as the mapping procedure, the spot identifier, the experimental pI and Mr, the MS data, and quantitative data about the protein expression (i.e., physiological and pathological levels, polymorphisms or modifications in specific conditions). The database has cross-references to Prot and when no identified spot exists in SWISS-2DPAGE for a given entry in Swiss-Prot, an image is generated highlighting the theoretical position of the corresponding protein.

Relevant literature references are provided with links to PubMed. SWISS-2DPAGE data are curated following the Swiss-Prot database standards, i.e. experts manually review the information before making it available. Besides, they follow the MIAPE (Minimum Information About a Proteomics Experiment) guidelines for reporting proteomics experiments recommended by the HUPO Proteomics Standards Initiative (PSI) [81]. The MIAPE data exchange model is named as such by analogy to the MIAME (Minimum Information About a Microarray Experiment) model.

The 2-D database of the Max Planck Institute for Infection Biology (http://www.mpiib-berlin.mpg.de/2D-PAGE/) is also among the most frequently updated proteomic databases, containing over 20 gels of microbial organisms as well as human, mouse and rat [82]. This database is now part of an interconnected proteome system containing information such as MS spectra, ICAT-LC-MS spectra, and textual descriptions of experimental protocols or results of protein identification [83 200]. The whole relational database and querying system is implemented in MySQL and uses other open source software such as R for data analysis and graphics. The proteomics local databases are extensively linked to other external public genomic and metabolic databases.

The number of 2-D PAGE databases and related data is slowly but continuously increasing.

An up-to-date list can be found in WORLD-2DPAGE (http://www.expasy.org/ch2d/2d-index.html), an index of 2-DE databases and services. More than 25 species are represented in about 300 2-DE maps all over the world. The databases are established in various formats.

However, an increasing number follows the principle of federated 2-DE databases [84], according to which the organization of and access to a database must comply with five simple

rules. This set of rules was created to homogenize the querying and presentation of such proteomic databases and assist in interconnecting similar data through a cross-reference system, and as a consequence in sharing and distributing 2-DE data in a more effective way.

Following those guidelines, the Make2D-DB II package was developed to help research groups to create their own 2-DE databases [85]. This free package not only helps non-experts to publish their data on the Internet, but it also provides a graphical interface with query capabilities. It can be obtained at http://www.expasy.org/ch2d/make2ddb.html.

5.3 Mass spectra repositories

Mass spectra databases are still in their early stages. Three public repositories exist so far. The Open Proteomics database (http://bioinformatics.icmb.utexas.edu/OPD/) is a collection that contains approximately 400’000 spectra representing different experiments from E. coli, H.

sapiens, S. cerevisiea and Mycobacterium smegmatis [86]. The mzXML Data Repository (http://sashimi.sourceforge.net/index.html) also contains a small number of collections of MS data obtained with different instruments (mainly ThermoFinnigan LCQ and Micromass Q-TOF Ultima) and various mixtures of proteins. The group that maintains this repertoire also distributes tools for MS analysis and has created the mzXML format for the representation of MS data. The third repository, PeptideAtlas (http://www.peptideatlas.org/), contains a collection of identified peptides from LC-MS/MS experiments. Currently the experimental results contained in these repositories are not very detailed and data formats are excessively diverse, making their use by other groups difficult.

5.4 Post-translational modification databases

In an era in which more than 100 complete genomes are sequenced per year, the issue of understanding proteins and proteomes relies also on understanding protein modifications that cannot be predicted from the nucleic acid sequences. Most proteins indeed contain PTMs and are not functional unless they are modified. While Swiss-Prot, as a universal database, places a considerable emphasis on the documentation of post-translational modifications within the sequence records, several specialized databases have been set up in recent years to feed this growing field.

RESID [87] is a general database of protein structure modifications (http://www-nbrf.georgetown.edu/pirwww/dbinfo/resid.html), maintained by the National Biomedical

Research Foundation in the USA and the Protein Information Resource group (PIR). The database contains descriptive, chemical, structural and bibliographical information on 339 (Release 34.00, Jun 2003) types of modified amino-acid residues. Apart from text-based searches, RESID can also be queried by molecular weight: an average or mono-isotopic mass can be entered (together with a mass variance) to search for all modified amino-acid residues in the database with masses similar to the input mass. Unimod (http://www.unimod.org/) can be seen as a complementary database to RESID [88]. It is a database for verifiable spectrometric mass values of natural or artificial modifications. It is especially dedicated to mass spectrometric analysis software.

Other databases that are specialized in one particular type of PTM are available. For example, two databases have so far been devoted to glycosylations. The public O-GLYCBASE v6.00 contains about 240 descriptions of glycoproteins that have been experimentally verified to have an O- or C-glycosylation site [89]. O-GLYCBASE entries show the type of O-linked sugar involved, the species, the sequence, links to the literature and cross-references to sequence and structure databases such as Swiss-Prot and PDB. GlycoSuiteDB (http://www.glycosuite.com/) is an annotated database of glycan structures restricted to accredited users and submitted to license fees. The database is provided by Proteome Systems Ltd and contains information about most published O- and N-linked glycans [90]. It is cross-referenced to Swiss-Prot/TrEMBL and it can be queried by mass, by attached protein, by oligosaccharide composition or different modes of textual queries (taxonomy, biological source, etc.).

Protein phosphorylations are currently described in at least three databases differing in the curation levels, details of information and scope of organisms. Phospho.ELM [91]and Phosphorylation site [92] databases (http://phospho.elm.eu.org/, respectively http://vigen.biochem.vt.edu/xpd/xpd.htm) describe experimentally verified covalent phosphorylations of Serine, Threonine or Tyrosine residues in proteins from eukaryotes (human, mouse, rat and a thousand other organisms) and prokaryotes respectively.

PhosphoSiteTM (http://www.phosphosite.org/) is a curated database dedicated to in vivo phosphorylation sites particularly in human and mouse proteins [93]. Lipoproteins are the object of DOLOP (http://www.mrc-lmb.cam.ac.uk/genomes/dolop/), which is restricted to bacterial lipoproteins only. This server also has a predictive algorithm for querying unknown

prokaryotic sequences looking for lipoboxes and lists of predicted lipoproteins for multiple completed bacterial genomes [94].

Although the current number of specific PTM databases is still quite small considering the number of known PTMs, it has doubled in the last few years. It is expected that they will multiply because of the increasing amount of data on PTM structures becoming available.

5.5 General considerations on databases

It is clear that the databases described above do not cover all the aspects of proteomics. We did not mention databases that use sequence databases to perform calculation and analysis, such as sequence clustering, phylogeny or profile searching, and thus create added-value databases. Other databases report results from functional studies and mutational experiments, or from 3-D structure determination, or describe metabolic pathways. Those were not mentioned here either. It would be impossible to be exhaustive. Some of the databases have already been treated in other chapters. Some of them are permanently updated, some of them have only a short existence, and some of them are not even publicly available. Proteomics databases, as well as data formats, are developed in a dynamic, non-organized way. To overcome this issue and to facilitate the exchange, dissemination and analysis of the multitude of proteomics data produced by many laboratories, the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has been working on generalized standards representations [81]. Among them are guidelines for reporting proteomics experiments through the Minimum Information About a Proteomics Experiment (MIAPE) data integration model, XML formats for micro-array and mass spectrometry data exchange, a list of proteomics ontologies and guidelines for the comparability of search engine results.

Proteomics information in most databases is accumulated rather than enlarged to a systemic view. The biology understanding is incomplete without quantification and chronology. If the goal of accumulating information is to discover or reveal the function and related biochemical mechanisms, available information has yet to be interconnected, weighed and ordered.

Proteome databases are moving from the stage of simple repositories to interconnected systems with intelligent knowledge production means.

6. Conclusion

The development of diagnostic and predictive tools as well as successful therapies for complex polygenic diseases including diabetes, cancer and cardiovascular diseases requires the understanding of the fundamental biological mechanisms implicated in these disorders.

This can be achieved under defined environmental conditions with strategies that combine genetic and proteomic tools. Proteome analysis has the ability to detect and identify polypeptides that correlate with disease states, and further lead to the discovery of potential molecular markers and therapeutic targets. Furthermore, proteomic technologies can display the pharmacological and toxic effects of candidate drugs on a disease process. There is a close relationship between drug treatment, protein expression and resulting physiological effects.

Most of the time, pharmacological mechanisms entail the secondary regulation or modulation of gene product expressions, in a similar way that complex disease processes alter global protein expression. From this, we can assert that the best drug should be the one that restores global protein expression of a disturbed organism to a normal state. In addition, it is quite unusual that a drug only modulates gene products implied in the disorder. Most of the time, it also causes perturbations in the expression of proteins that are not involved in the disease.

This leads to side effects of drugs. Proteomics and bioinformatics are deeply implicated in the understanding of disease and drug effect mechanisms and the design of new drug therapies.

This chapter has just given a brief overview of their joined capabilities.

7. Acknowledgement

The authors would like to acknowledge Dr Manfredo Quadroni (Protein Analysis Facility, Lausanne University, Switzerland) for the LC-MS data.

8. References

[1] Wilkins, M. R., Williams, K. L., Appel, R. D., Hochstrasser, D., Proteome Research: New Frontiers in Functional Genomics Springer-Verlag, Berlin Heidelberg New York 1997.

[2] Anderson, L., Seilhamer, J., A comparison of selected mRNA and protein abundances in human liver. Electrophoresis 1997, 18, 533-537.

[3] Hoogland, C., Mostaguir, K., Sanchez, J. C., Hochstrasser, D. F., Appel, R. D., SWISS-2DPAGE, ten years later. Proteomics. 2004, 4, 2352-2356.

[4] O'Farrell, P. H., High resolution two-dimensional electrophoresis of proteins. J Biol Chem 1975, 250, 4007-4021.

[5] Klose, J., Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A novel approach to testing for induced point mutations in mammals. Humangenetik 1975, 26, 231-243.

[6] Scheele, G. A., Two-dimensional gel analysis of soluble proteins. Charaterization of guinea pig exocrine pancreatic proteins. J Biol Chem 1975, 250, 5375-5385.

[7] Gorg, A., Weiss, W., Dunn, M. J., Current two-dimensional electrophoresis technology for proteomics. Proteomics 2004, 4, 3665-3685.

[8] Yan, J. X., Wait, R., Berkelman, T., Harry, R. A., et al., A modified silver staining protocol for visualization of proteins compatible with matrix-assisted laser

desorption/ionization and electrospray ionization-mass spectrometry. Electrophoresis 2000, 21, 3666-3672.

[9] Unlu, M., Morgan, M. E., Minden, J. S., Difference gel electrophoresis: a single gel method for detecting changes in protein extracts. Electrophoresis 1997, 18, 2071-2077.

[10] Poole, C. F., Thin-layer chromatography: challenges and opportunities. J Chromatogr A 2003, 1000, 963-984.

[11] Lescuyer, P., Hochstrasser, D. F., Sanchez, J. C., Comprehensive proteome analysis by chromatographic protein prefractionation. Electrophoresis 2004, 25, 1125-1135.

[12] LaCourse, W. R., Column liquid chromatography: equipment and instrumentation. Anal Chem 2002, 74, 2813-2831.

[13] Eiceman, G. A., Gardea-Torresdey, J., Overton, E., Carney, K., Dorman, F., Gas chromatography. Anal Chem 2004, 76, 3387-3394.

[14] Berthod, A., Carda-Broch, S., Determination of liquid-liquid partition coefficients by separation methods. J Chromatogr A 2004, 1037, 3-14.

[15] Nikitas, P., Pappa-Louisi, A., Papachristos, K., Optimisation technique for stepwise gradient elution in reversed-phase liquid chromatography. J Chromatogr A 2004, 1033, 283-289.

[16] Staby, A., Jensen, I. H., Mollerup, I., Comparison of chromatographic ion-exchange resins. I. Strong anion-exchange resins. J Chromatogr A 2000, 897, 99-111.

[17] Barth, H. G., Boyes, B. E., Jackson, C., Size exclusion chromatography. Anal Chem 1994, 66, 595R-620R.

[18] Lee, W. C., Lee, K. H., Applications of affinity chromatography in proteomics. Anal Biochem 2004, 324, 1-10.

[19] Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F., Whitehouse, C. M., Electrospray ionization for mass spectrometry of large biomolecules. Science 1989, 246, 64-71.

[20] Tanaka, K., Waki, H., Ido, Y., Akita, S., et al., Protein and Polymer Analyses up to m/z 100000 by Laser Ionization Time-of-flight Mass Spectrometry. Rapid Comunication in Mass Spectrometry 1988, 2, 151-153.

[21] Steen, H., Mann, M., The ABC's (and XYZ's) of peptide sequencing. Nat Rev Mol Cell Biol 2004, 5, 699-711.

[22] Aebersold, R., Mann, M., Mass spectrometry-based proteomics. Nature 2003, 422, 198-207.

[23] Roepstorff, P., Fohlman, J., Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed.Mass Spectrom. 1984, 11, 601.

[24] Cserhati, T., Mass spectrometric detection in chromatography. Trends and perspectives.

Biomed Chromatogr 2002, 16, 303-310.

[25] Washburn, M. P., Wolters, D., Yates, J. R., III, Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat.Biotechnol. 2001, 19, 242-247.

[26] Regonesi, M. E., Del, F. M., Basilico, F., Briani, F., et al., Analysis of the Escherichia coli RNA degradosome composition by a proteomic approach. Biochimie 2006, 88, 151-161.

[27] Cagney, G., Park, S., Chung, C., Tong, B., et al., Human Tissue Profiling with

Multidimensional Protein Identification Technology. J.Proteome.Res. 2005, 4, 1757-1767.

[28] Zhou, X. W., Kafsack, B. F., Cole, R. N., Beckett, P., et al., The opportunistic pathogen Toxoplasma gondii deploys a diverse legion of invasion and survival proteins. J.Biol.Chem.

2005, 280, 34233-34244.

[29] Poetz, O., Schwenk, J. M., Kramer, S., Stoll, D., et al., Protein microarrays: catching the proteome. Mech Ageing Dev 2005, 126, 161-170.

[30] Gygi, S. P., Rochon, Y., Franza, B. R., Aebersold, R., Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 1999, 19, 1720-1730.

[31] MacBeath, G., Protein microarrays and proteomics. Nat Genet 2002, 32 Suppl, 526-532.

[32] Haab, B. B., Dunham, M. J., Brown, P. O., Protein microarrays for highly parallel

[32] Haab, B. B., Dunham, M. J., Brown, P. O., Protein microarrays for highly parallel