• Aucun résultat trouvé

CHAPITRE II: Proteomics: Beyond cDNA

1. Introduction and principles

The word Proteome had its first use in 1994 to picture the PROTEin complement of a genOME [1]. It describes the ensemble of protein forms expressed in a biological sample at a given point in time and in a given situation. Two years later, the word Proteomics was first used to define the study of proteomes, in a very simplified way. A broader definition advocates that proteomics is the science that deals with the global analysis of proteins, and this includes their identification, the measure of their level of expression and their partial characterization. Other definitions state that proteomics is the large-scale study of proteins, in particular their structures, their functions and their interactions. Whatever the definition adopted, proteomics is in constant evolution and relies on efficient protein separation techniques, mass spectrometry, bioinformatics as well as gene and protein databases.

There are numerous proteomes for a single genome and proteomes are much more complex than genomes. DNA chip technology allows for the simultaneous analysis of the expression of thousands of genes at the mRNA level and can unravel biological processes. However, the correlation between the expression of mRNA and protein is low [2]. For instance, the dynamic range between transcription factors and albumin can be 1012 or higher. Moreover, the complete sequences of active proteins can only be partially deduced from their corresponding gene sequence. In fact, during and after the transcription and the translation processes leading to an active gene product, alterations often occur, such as alternative splicing, N-terminal truncation, post-translational modification (PTM). Besides, single genes can be expressed in more than 20 protein forms in a single tissue: as an example, at least 22 different protein forms matching alpha-1-antitrypsin have been described in human plasma. Proteomics involves also the description of the events generating the modifications that these proteins carry as functional entities. Besides, one single organism will have radically different protein expressions in different stages of its life cycle and in different cells, tissues and fluids.

Therefore, the analysis and the understanding of the suspected half a million to one million proteins in human, expressed by a number of genes that is currently estimated to be around 25’000, represent a real challenge in Proteomics from the technological, analytical and

bioinformatics point of view. Methods involving high-resolution protein separation, parallelization of sample preparation, automation of experimental processes and of database comparison, as well as powerful and specific visualization tools had to be developed and integrated.

The development of new diagnostic tests and the discovery of new drug therapies both depend on the capacity to analyze complex systems. Proteomics offers the possibility to identify disease markers, to discover drug targets or to observe the global influence of a drug in a complex mixture of proteins. This will be possible only if one can describe the identity, the occurrence and the interaction of each individual component of this mixture.

Proteomics was a real revolution to biochemistry and molecular biology essentially due to advances in experimental technology and the combination with bioinformatics. Because of the chemical and physical complexity of proteomes, various methodological approaches have been considered so far. Nevertheless, a representative workflow of a proteome analysis can be described as the one given in Figure 1. This pathway includes most of the wet-lab (analytical) and dry-lab (bioinformatics) steps required for the complete analysis of a proteome:

ŠŠŠ

Š The first crucial step is the sample choice. It can be a raw biological fluid, a cell extract, a fraction of a sample, etc. Above all, the choice is also strongly dependent on the separation method to be applied since it has to be compatible with the dynamic range that the separation can handle. The proteins contained in this sample have then to be separated. In proteomics, one-dimensional, two-dimensional or capillary electrophoresis (respectively 1-DE, 2-DE, CE), liquid chromatography (LC), or a combination of them, are the preferred methods.

Š

ŠŠ

ŠOnce a separation method has been chosen, the next step is the analysis of the result and the selection of the proteins to be identified. In the case of a separation by 2-DE gels, the analysis is made with image analysis software. This kind of software allows for visualizing images, for comparing images, and for performing a number of comparative analyses that enable tracking statistically significant changes in protein expression between populations of gels/samples. It helps to highlight the proteins of interest. Computer analysis of proteomics images is discussed in Section 3.

Š

ŠŠ

Š To proceed further, proteins separated with LC or those selected from a gel analysis are submitted to post-separation analysis. This experimental step determines highly specific

protein attributes, such as peptide mass fingerprints or amino-acid sequence information, when preceded by endoproteolytic cleavage. In endoproteolytic cleavage the proteins are typically incubated with an enzyme that recognizes particular amino acids and cuts the polypeptide chains at specific cleavage sites. The reaction produces shorter peptides that are fragments of the so-called digested proteins. Today's standard procedure most often involves a protein digestion step with trypsin and the analysis of the generated fragments with a crucial tool in proteomics, mass spectrometry (MS). Separation and post-separation techniques are briefly described in Section 2.

Š

ŠŠ

Š The intensive use of liquid chromatography and mass spectrometry in the proteomics analysis has opened a new domain in the proteomics imaging. The representation of liquid chromatography-mass spectrometry datasets as two-dimensional plots highlights the redundancy of data not necessarily observable when displayed in 1-D. Even though the analysis of this kind of images is still in its infancy, their potential and advantages can already be anticipated. Computer analysis of proteomics images is also discussed in Section 3.

ŠŠŠ

Š Once in possession of the protein attributes acquired through the previous experimental steps, we can then move to database search. This search identifies a protein by looking at the best match between experimental data and data obtained by in-silico processing and

“digestion” of proteins in a sequence database. The identification (determining the name or sequence of the known proteins) and characterization (obtaining information about their function, cellular localization, post-translational modifications, etc.) procedures using bioinformatics tools are the topic of Section 4.

Š

ŠŠ

Š Comprehensive sequence databases are a prerequisite for successful protein identification, and data from the identified proteins and samples are in turn used to populate specific proteomics databases. Section 5 browses some of the necessary databases for current proteomics projects.

Figure 1: A schematic Proteomics workflow. Digestion of proteins may occur either before the sample separation or before the mass spectrometry.

The representation of a proteome analysis pathway such as the one given in Figure 1 assembles the different steps required to perform the identification of proteins from a crude biological sample. This pathway can for instance generate a systematic description of a complete proteome observable in a 2-DE gel. The result of such an analysis can be made concrete by the creation of an annotated database such as SWISS-2DPAGE [3] or other 2-DE databases. Besides, the information in such databases, e.g. annotated 2-DE gels, can be compared to an unannotated image to correlate positions and intensities of protein spots. From all the spots in the unannotated image, only those representing a real interest are further analyzed with identification methods. A widely used method to search for biological markers of specific diseases is the comparison of a statistically significant number of 2-DE images from samples of healthy and diseased patients or samples treated and not treated with target drugs. The images are compared, clustered and searched for protein spots that appear to be differently expressed. These become the spots of interest to be further identified. In this approach most of the efforts are concentrated in the generation of the 2-DE gels and in the comparison of the generated images. Section 3 describes the possibilities of undergoing such an analysis using dedicated software.

In addition, beyond the identification efforts, it is of highest interest to describe and understand the modifications carried by the active gene products. This implies the search for splicing variants, amino-acid mutations or PTMs that characterize a protein. Often, mass spectrometry is used in a so-called MS/MS mode to decipher the spectra generated by this process, and to reveal structural information on the amino-acid sequence and the description of PTMs attached to the studied peptides.