Engineering high-throughput proteomics pipelines

(1)

Thesis

Reference

Engineering high-throughput proteomics pipelines

VAEZZADEH, Ali Reza

Abstract

Most current proteomics workflows, while highly developed and sophisticated, are usually not compatible with routine proteome analysis due to due heaviness of technical procedures and lack of reproducibility. Therefore, the development of reliable high-throughput proteomic platforms represents a crucial step for the advancement of proteomics research. In this thesis three examples of such platforms have been investigated. Every step of these pipelines was studied and subjected to further improvement, in order to enhance their efficiency and practicability. Many developments presented in this thesis can be helpful in different fields of proteomics. Engineering high-throughput proteomics pipelines will increase the role of proteomics in life science research and open new avenues for biomarkers discovery.

VAEZZADEH, Ali Reza. Engineering high-throughput proteomics pipelines. Thèse de doctorat : Univ. Genève, 2008, no. Sc. 3972

URN : urn:nbn:ch:unige-5346

DOI : 10.13097/archive-ouverte/unige:534

Available at:

http://archive-ouverte.unige.ch/unige:534

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Département de Biologie Structurale FACULTE DE MEDECINE et Bioinformatique Professeur Denis Hochstrasser

Section des Sciences Pharmaceutiques FACULTE DES SCIENCES Professeur Gérard Hopfgartner

Engineering High-Throughput Proteomics Pipelines

THESE

Présentée à la Faculté des Sciences de l‟Université de Genève pour obtenir le grade de Docteur ès Sciences, mention interdisciplinaire

par

Ali Reza Vaezzadeh

de Chiraz (Iran)

Thèse Sc. 3972

GENEVE

Atelier de reproduction de la section de physique 2008

(3)

Important: If these links are not active, the thesis can be find in the « :\Thesis » folder of this CD-ROM. The appropriate software is « AdobeReader ». Displaying the

“navigation” tabs, with a left-click on the AcrobatReader toolbar simplifies the reading of this document.

(4)

IItt''ss lloovvee tthhaatt hhoollddss aallll eeaasstteerrnn aallcchheemmyy,,

A A cclloouudd tthhaatt hhiiddeess aa tthhoouussaanndd lliigghhttnniinngg bboollttss..

IIttss gglloorryy ffiillllss aann oocceeaann iinnssiiddee mmee,,

A A uunniivveerrssee wwhheerree aallll ccrreeaattiioonn ddrroowwnnss..

RuRummii ((1133^th^t^h CCeennttuurryy PPeerrssiiaann PPooeett))

(5)

Ab A bs s t t r r a a c c t t

Engineering high-throughput proteomics pipelines

Most current proteomics workflows, while highly developed and sophisticated, are usually not compatible with routine proteome analysis due to due heaviness of technical procedures and lack of reproducibility. Therefore, the development of reliable high-throughput proteomic platforms represents a crucial step for the advancement of proteomics research.

In this thesis two examples of such platforms have been investigated. Every step of these pipelines was studied and subjected to further improvement, in order to enhance their efficiency and practicability. First the “Molecular Scanner” is discussed, which is based on high-throughput simultaneous digestion of proteins in a complex sample followed by their identification and visualization. The application of the enhanced pipeline to a new mass spectrometry approach, “Microscope Imaging”, allowed to rapidly generate protein profiles. Another high-throughput technology was the “Shotgun IPG-IEF” based on the separation of bulk digestion-produced peptides by isoelectric focusing (IEF) on immobilized pH gradient gels (IPG) before applying the conventional liquid chromatography tandem mass spectrometry (LC-MS/MS) steps of shotgun proteomics. The developed pipeline was used to study the mechanisms of resistance of Staphylococcus aureus against Daptomycin antibiotics.

Finally, a combination of the Molecular Scanner and Shotgun IPG-IEF was presented, as Imaging Shotgun IPG-IEF, which allowed to obtain a global protein expression profile of any proteome in a single day. Many developments presented in this thesis can be helpful in different fields of proteomics. Engineering high- throughput proteomics pipelines will increase the role of proteomics in life science research and open new avenues for biomarkers discovery.

LINK TO TABLE OF CONTENTS

(6)

R R é é s s u u m m é é e e n n f f r r a a n n ç ç a a i i s s

La plupart de méthodes actuelles de protéomique, bien qu'extrêmement développées et sophistiquées, sont insuffisantes pour analyser un protéome en profondeur et en routine. C‟est pourquoi, le développement de systèmes d‟analyse de protéomes avec une grande puissance de traitement représente un élément essentiel dans l‟avancement de ce domaine de recherche, tel qu‟il est décrit dans le CChhaappiittrree 11.

Dans cette thèse deux exemples de tels systèmes ont été étudiés. Chaque étape en a été examinée et a été optimalisée, afin d'améliorer leur efficacité et de les rendre plus faciles à utiliser.

Dans le CChhaappiittrree 2, le procédé du « Scanner Moléculaire » est discuté. Ce 2 procédé consiste à digérer simultanément les protéines d‟un échantillon (électrophorèses mono ou bidimensionnelles, ou les tissus) et à procéder à leur identification et visualisation utilisant les outils de bioinformatique (Bienvenut et al.

1999). Les protéines sont transférées, par électrophorèse transverse, d'un gel de SDS- PAGE (ou d‟un tissu) à travers une membrane chargée de trypsine et les peptides obtenus sont recueillis sur une membrane de capture. Cette dernière est recouverte avec la matrice et directement analysée dans un spectromètre MALDI. Après perfectionnement, cette technologie a été utilisée en conjonction avec une nouvelle technique de la spectrométrie de masse combinée avec le « MALDI Microscope Imaging » afin de rapidement représenter les distributions spatiales des protéines individuelles de l‟échantillon. Malgré le fait que le scanner moléculaire soit une méthodologie à haute capacité de traitement, et qu‟il semblait promis à un avenir prometteur, il lui a manqué la sensibilité exigée pour s‟ouvrir à des applications cliniques.

Au cours du CChhaappiittrree 33, le système de « Shotgun IPG-IEF » (Cargile et al. 2005) à été repris. Dans cette méthode, les protéines sont initialement digérées, comme dans les approches de shotgun. Pendant cette recherche les peptides produits ont été séparés par IPG-IEF. Ensuite, la bandelette IPG a été découpée et les peptides ont été extraits puis soumis à la seconde dimension de séparation, la RPLC-MS/MS. Cette combinaison de technologies est extrêmement puissante, profitant du pouvoir de séparation exceptionnel de l'IEF ainsi que des avantages de l'approche shotgun.

(7)

Cependant, ce processus implique des séquences de manipulation délicates à plusieurs étapes et ne peut pas être facilement automatisé. Après avoir amélioré différents étapes, il a été appliqué à l‟étude les mécanismes de résistance de bactéries Staphylococcus aureus contre l'antibiotique Daptomycine.

Le CChhaappiittrree 44 décrit une nouvelle approche basée sur la combinaison des technologies développées et améliorées vues dans les ChChaappiittrreess 2 et 2 33.. Le système dénommé « Imaging Shotgun IPG-IEF » est basé sur la séparation de peptides par IPG-IEF, mais évite la longue étape de LC-MS/MS par un transfert des peptides de la bande d'IPG sur une membrane de capture. Puis, comme dans le scanner moléculaire, la membrane est recouverte de la matrice et directement analysée dans le spectromètre de masse. Les développements de ce processus, ses avantages et ses inconvénients ainsi qu‟un exemple d‟application sont décrits dans ce chapitre. Cette méthode permet d‟obtenir une avant-première du profil protéique de n'importe quel protéome en moins d‟un jour.

Les techniques développées dans cette thèse peut-être également appliquées à d‟autres domaines de la protéomique pour améliorer leurs performances, leur efficacité et faciliter leur utilisation en particulier dans la recherche de bio marqueurs.

LINK TO TABLE OF CONTENTS

(8)

Ac A c k k no n ow wl l e e d d g g m m e e n n t t s s

After four years, there is a long list of people I have to acknowledge. First, I would like to thank my Professor, Denis Hochstrasser, to whom I am greatly indebted for accepting me in his group. He is a true visionary and represents a rare combination of leadership, scientific innovation and humanity from which I continue to learn. He has certainly changed my life in many ways and no amounts of words suffice in thanking him for all his support in both the personal and the professional aspects of my PhD years. I would also like to thank Professor Gérard Hopfgartner, the co- director of my thesis, for his guidance and precious practical advice especially in the mass spectrometry aspects of my work. I am grateful to Professors Manfredo Quadroni and Thierry Rabilloud for accepting to review this manuscript and for taking the journey to Geneva to participate at the jury of my defense.

I have had many mentors during my years of study: Gary Corthals, Carla Pasquarello, Jean-Charles Sanchez, Catherine Zimmermann, Pierre Lescuyer and Professor Jacques Deshusses. They have all brought new and inspiring perspectives to my work. Nevertheless, Jacques has had the most influence and life lasting impact on me. I could not have imagined having a better supervisor for my PhD, and without his logic, broad knowledge, perceptiveness and guidance, I would have never been able to complete my studies. He is a walking dictionary and I have learned a tremendous amount from him. Furthermore, I am thankful to Pierre for critically reviewing this manuscript.

Today‟s life science technology development is rarely the result of a single effort.

This is even more true of interdisciplinary research projects such as this work.

Therefore, I would like to express my gratitude to Stefan Luxembourg, Erika Amstalden and Professor Ron Heeren from AMOLF (Netherland) for their fruitful collaboration on microscope imaging application to the Molecular Scanner. I sincerely acknowledge Patrice Waridel and Manfredo Quadroni from the University of Lausanne for putting their Orbitrap instrument at my disposal and for their dynamic discussions. I appreciate the efforts of Jacques Schrenzel and Patrice François from Geneva University Hospitals for preparing the S.aureus samples. Many thanks to

(9)

Oscar Vadas and Professor Keith Rose for synthesizing the peptides used to develop fluorescent pI markers. I would like to thank René Demellayer and Philippe Passeraub from Geneva Engineering School for their collaboration on the developments of the matrix deposition and strip fractionation robots. My deepest gratitude goes to my former colleague and true friend Joël Di dio for his help on robotics. I am thankful to Lena Hornsten and Bengt Bjellqvis from GE Healthcare for their collaboration on the

“Well-former” robot. I learned a new language grace to my colleagues at Swiss institute of bioinformatics (SIB): Christine Hoogland, Patricia Palagi and Prof. Ron Appel. “Merci beaucoup” to Daniel Walther and Sébastien Catherine for their help on MSight, and to Céline Hernandez for her collaboration on pICarver software and her patience with my endless demands. I express my sincere appreciation to my friends Pierre-Alain Binz and Alexandre Masselot from GeneBio (Switzerland) for their help on Phenyx. Thanks to Tatiana Rohner and Markus Stoeckli from Novartis for sharing the Molecular Scanner endeavors with me.

I would like to express gratitude to all my students: Remi Buisse, Alexis Chauvet and Jovan Simicevic. I will always carry Alexis‟s memory in my heart, and cherish the time I spent with him prior to his untimely passing. I would also like to thank all the members of the ever growing BPRG family, for their day to day help and kindness. I extend my utmost gratitude to the department‟s secretaries for their administrative support: Dany Roiron, Corine Bessmer, Luli Mestre and Marielle Fernandez. Dany has a special place in my heart not only because of her kindness but also because she was the only person who called me “Alireza”.

On the family side, I want to thank Dr. Mehri Michéa for her significant help and support during my setteling process in my new life in Geneva, and her guidance all these years. The most important influences on me have been my parents and I thank them for their unconditional love and perpetual support. I consider myself extremely lucky to be able to call Farideh Amirebrahimi and Dr. Karim Vaezzadeh, Mom and Dad. Last but certainly not least, I want to thank Yasmin, who has brought joy, happiness and love in my life. Thank you darling, for encouraging me and putting up with me during these last three difficult months! My conviction has always been to achieve high education, return to my country and serve my people. It is to Iran that I dedicate this thesis.

(10)

Ab A bb br r e e v v i i a a t t i i o o n n s s

2-DE two-dimensional gel electrophoresis

3D-IT three dimensional ion trap

µ-Dig microwave-digestion

1h-Dig one hour digestion

AcN Acetonitrile

AP atmospheric pressure

CHCA α-cyano-4-hydroxycinnamic acid

CID collision induced dissociation

Conv-Dig conventional digestion

CV coefficient of variation

Da Dalton

DART direct analysis in real time

DE delayed extraction

DESI desorption electrospray ionization

DTE 1,4-dithioerythritol

ECD electron capture dissociation

ESI electrospray ionization

ETD electron transfer dissociation

FT-ICR Fourier-transform ion cyclotron resonance

GRAVY grand average hydropathy index

HPLC high performance liquid chromatography

Hz Hertz

ICAT isotope-coded affinity tags

IEF isoelectric focusing

IPG immobilized pH gradient

LC liquid chromatography

LIT linear ion trap

m/z mass to charge ratio

MALDI matrix assisted laser desorption/ionization

MRSA methicillin resistant Staphylococcus aureus

MS mass spectrometry

MS/MS tandem mass spectrometry

PFF peptide fragment fingerprinting

pI isoelectric point

PMF peptide mass fingerprinting

PTM post-translational modification

PVDF poly-vinyldifluoride

q Quadrupole

S. aueus Staphylococcus aureus

SDS-PAGE sodium dodecyl sulfate-polyacrylamide gel electrophoresis

TFE 2,2,2-trifluoroethanol

(14)

C C h h a a p p t t e e r r 1 1

1 1 . . I I n n t t r r o o d d u u c c t t i i o o n n

The importance of the proteins can not be overstated as they are responsible for the structure, energy production, communications, movements and division of all cells. The study of proteins in an organism is called “proteomics”. Mass spectrometry-based proteomics has become an important component of biological and clinical research. However, high- throughput and comprehensive proteome study is challenging due to the complexity of the proteome. Current methods, while highly developed and sophisticated, are falling short of routinely analyzing the whole proteomes. In this chapter we describe some of the current proteomics tools and discuss the challenges facing this field. The focus of this chapter is on the new and emerging technologies, especially with an emphasis on high-throughput pipelines.

(15)

1.1 Proteomics: A new challenge

Since the word “Proteomics” was coined by Marc Wilkins at Siena conference in 1994, it has been defined in many different ways (Wilkins et al. 1996). It is difficult to come up with a quick buzzword definition that adequately describes the detailed activities within the multitude of platforms, which are being developed to tackle the protein analysis today.

Proteomics can broadly be defined as a collection of scientific approaches and technology toolboxes to characterize the protein content and protein modifications within cells, tissues, body fluids and whole organisms at a certain stage. Genome sequencing and the introduction of deoxyribonucleic acid (DNA) microarray technologies during the 1990s marked the start of the –omics era of research (Schena et al. 1996). In principle, “-omics” technologies are aimed at profiling the entire pattern of information in a single experiment. Proteomics is a logical continuation of the widely used methodology of transcriptional profiling. The ultimate goal of proteomics, to analyze all the proteins including splice variants and modifications which participate in various cellular processes, is still a dream. Nevertheless, the discipline of proteomics has developed significantly since its early days, and is showing exponential growth in terms of numbers of publications.

The level of understanding and appreciation of the complexities surrounding protein expression, function and detection have grown with the sophistication in applications of technology and a genuine understanding of the networks of proteins pathways present at every level of biology (Aebersold and Mann 2003). This dramatic progress of proteomic research over the last decades has been catalyzed by several, seemingly independent developments.

First, the wealth of genomic sequence information generated by large scale sequencing projects and the development of computational gene prediction and annotation tools have produced sequence databases that are expected to contain most coding gene regions. Second, technological improvements in peptide/protein separation techniques and mass spectrometry allow rapid and sensitive protein identification from minute amounts of complex biological samples. Third, the development of computational/bioinformatics tools, such as the assignment of tandem mass (MS/MS) spectra to peptide sequences and the statistical validation of these assignments, which allows the consistent analysis of large datasets with no or minimal human intervention.

(16)

Proteomics can not be perceived as an isolated entity with its own specific needs and technologies. In all biological processes, different levels of information are needed to detect the right responses to particular signals. In cells, genetic information is needed to translate the appropriate responses to the protein level. Proteins (e.g. receptors, enzymes, structural proteins or signaling peptides) are the main players in all cellular pathways. In the various phases involved in cellular responses, proteins can be cut, modified in a number of ways and excreted into particular compartments or into the extracellular space. For this reason, proteomics should be considered as an important element for “Systems Biology”. According to Hood, systems biology is to quantitatively describe the interactions among the individual components of the system under investigation (Hood and Galas 2003). The ultimate aim of systems biology is to develop computational models of these complex systems, using the data obtained from different analytical platforms in differentially perturbed states and the synthesis of these data into a model describing the system (Ideker 2004). Therefore, it is essential that quantitative proteomic experiments can be carried out at high-throughput.

Species Number of genes

Mycoplasma genitalium 500

Streptococcus pneumoniae 2,300

Escherichia coli 4,400

Saccharomyces cerevisiae 5,800

Drosophila melanogaster 13,700

Caenorhabditis elegans 19,000

Homo sapiens 20,500 *

Sea urchin 23,300

Arabidopsis thaliana 25,500

Mus musculus 29,000

Oryza sativa 50,000

Table 1.1 Gene content of various organisms. (*) is according to (Pennisi 2007)

There is a great difference in complexity between proteome analysis and genome analysis. As shown in Table 1.1, human has approximately 20‟500 genes (Pennisi 2007), each of which encodes not only one protein but a large number of various splice variants (having in part the same basic amino acid sequence) with varying numbers of other post-translational modifications (PTM), which may give rise to a 100 to 300-fold number protein variants.

Furthermore, the dynamic nature of the proteome of a cell or a tissue provides ample justification for studying gene expression in disease directly at the proteomic level. But

(17)

capturing this dynamic state represents a technological challenge. Undoubtedly, tackling the numerous facets of disease, proteomics requires implementation of multiple strategies and technology platforms. The dynamic range of mammalian proteome spans over several orders of magnitude (Anderson and Anderson 2002). This is of course highly species dependant.

According to Patterson, the biggest challenge for large scale proteomics is being able to analyze across a high dynamic range (Patterson 2003). It is necessary to encompass PTMs in an abundance equation to generate a cell-protein-index-number (CPIN). For instance, if one considers that there are 30 types of phosphorylation variants of a single phosphoprotein (~1200 kinases have been identified), as well as a hundred forms of glycosylation possibilities of a single protein, the CPIN could vary from a few millions to several hundreds of million different protein forms within a cell. Studies calculating the dynamic range of protein expression within cells estimate 10^8-10 orders of magnitude difference between the least and most-abundant proteins (Anderson and Anderson 2002). To this should be added the proteome variations due to external factors such as environment, foods, drugs and etc.

As illustrated in Figure 1.1, despite the phenomenal impact of mass spectrometry and protein/peptide separation techniques on proteomics, the identification and quantitation of all of the proteins in a biological system is still an unmet technical challenge. For protein quantitation, a significantly smaller numbers of proteins are taken into account due to the fact that the data quality, in terms of information content, required for quantitation by far exceeds that for protein identification. Still, out of the many thousand proteomic studies published to date, only a small minority has attempted to provide a comprehensive quantitative description of the biological system under investigation.

Proteins in sa mple

Proteins qua ntified

Proteins concentration

Number of proteins

Proteins identified

Figure 1.1 Schematic representation of the fraction of a proteome that can be identified or quantified by mass- spectrometry-based proteomics.

(18)

As more proteome-level information is collected, including comparisons between samples collected in disease and healthy states, novel biomarkers can be revealed. A biomarker is a measurable indicator of a specific biological state, particularly one relevant to the risk of contraction, the presence or the stage of disease. The area of biomarker discovery is exploding in proteomics but it has not been drastically successful. Any disease can recruit common “defense mechanisms”, which depending on the severity of that disease can generate different secondary effects. Differential expression of whole sets of proteins would relate to malnutrition, pain, coagulopathy, weight loss, liver affection, skeletal muscle catabolism, transport protein counter-regulation, hospitalization, therapy and etc. Proper study setup must take these effects into account to avoid the discovery of non-specific biomarkers, which then enter into lengthy and expensive validation routines. It has been suggested by several groups that each proteomics study, especially for biomarker discovery, should have two distinctive steps: discovery and validation (Rifai et al. 2006; Lescuyer et al. 2007).

Despite the substantial enthusiasm and efforts of the proteomicians in the last decade, proteomics has failed to introduce any new clinically applicable biomarkers. Recently, several guidelines were published advising the scientists on robust study design, precise sample handling, minimal technical variations and stringent statistics (Lescuyer et al. 2007; Mischak et al. 2007; Ransohoff 2007). Given the current status of measurement reproducibility and lack of standardization of calibration, many researchers use proteomics in the discovery mode. They view proteomics as a rapid screening tool for generating new hypotheses.

Candidate proteins are selected for further evaluation using more traditional, lower- throughput techniques, such as ELISA assays (Howard et al. 2003). Given the issues above, it is popular to conduct re-analysis or meta-analysis using raw data coming from other groups.

Thus, the desire to share experimental data between research groups has resulted in the adoption of standards, such as MIAME (Brazma et al. 2001), MIAPE (Orchard et al. 2004) and MCP guidelines for MS data (Carr et al. 2004).

(19)

The ultimate accomplishment of proteomics will be the most successful combination of:

 Sample generation – sampling

 Sample preparation and handling

 Protein/peptides separation

 Protein identification and quantitation

 Protein function annotation

1.2 High-throughput proteomics

Landmark developments in mass spectrometry (MS)-based proteomics have enabled this field to rapidly become a major player in life science research (McCormack et al. 1997;

Aebersold and Mann 2003). In combination with powerful bioinformatics tools, proteomics now allows identification, characterization and even quantitation of thousands of proteins directly from a variety of complex biological samples in different states. Current proteomics techniques employ highly developed separation technologies and sophisticated mass spectrometers, yet they have so far reached only a minor role in routine clinical medicine. The main reasons are not only the intrinsic complexity of biological samples but also the labor- intensive and time-consuming workflows, the enormous data flow and a poor reproducibility (Cho 2007). Sample complexity has two distinct sides. One side is the design of the study, selection of the appropriate study population, samples and technologies to be employed and the clinical utility of the study. The design of the study needs interdisciplinary interaction between scientists of various fields such as clinicians, biologists, biochemists and statisticians/bioinformaticans. Pre-analytical conditions and sample handling are also essential and should be meticulously monitored. The other side of the sample complexity is their intrinsic properties such as protein, salt and other contaminants concentration. Sample preparation often constitutes multiple steps. Sample storage and processing might also cause variations. On the workflow complexity, one can mention the need of several purification and fractionation steps, which are usually time-consuming and cause irreproducible sample loss.

Most instrumentation employed in current proteomics workflows, especially the mass

(20)

spectrometers, is extremely costly. Additionally, most “-omics” studies in general generate data on a scale unprecedented in the traditional domain of biostatistics, which brings many issues such as data storage, processing, validation and interpretation that usually requires mutli-vectoral expertise. Some of proteomics shortcomings in recent years have been the lack of appropriate quality control, statistical control and independent validation. Lack of universal standards has hindered a proper assessment of the quality of the published data inter and intra- laboratories. Another deficiency has been the variability of data formats and the fact that they can not be openly accessed by public for evaluation. The poor quality of the many early proteomics studies was damaging to the field.

The need for engineering high-throughput proteomics pipelines encompassing proteomics workflows with genomics, bioinformatics and other related fields is greatly felt in the scientific community. This ideal comprehensive proteomics workflow should have the following characteristics: (i) short procedure and time scale (ii) rapid and high-quality sample preparation (iii) highly automated workflow connected to powerful MS instruments (iv) facilitated data analysis and validation and (v) established quality assurance and quality control measures. Such a pipeline should be able to routinely analyze the entire protein complement of a sample in a rapid and reproducible manner while preserving the ability to perform quantitative comparisons and to identify protein modifications. Similarly to other high-throughput approaches, the intrinsic complexity of the biology, multiplied by the enormous volumes of data generated, make the analysis and interpretation of proteome studies extremely difficult. Before being able bridge the gap between the proteome data and biological discovery and accelerate proteomics processes and diagnostics, it should be demonstrated that high-throughput proteomics generates valid, reproducible, and reliable results. This can be addressed by development and utilization of both computational and experimental standards, simple and complex mixtures of known proteins (Hogan et al. 2006).

Such experimental standards will allow not only improvement in the consistency of the proteomics results, but also the quantitative evaluation of existing and future analytical, technological, and data analysis methods. Recently, the association of biomolecular resource facilities (ABRF) introduced 49 protein mixtures of equal concentration and performed comparative analysis of the results obtained by multiple laboratories (www.abrf.org/sprg).

This mixture represents an important step up in complexity; however, given the much larger number of proteins and varying concentrations that can be present in biological samples, there

(21)

is a clear need for known experimental samples of higher complexity in both the number and concentration range of proteins.

Finally, it is important to note that no “full proteome” technology is available and for a comprehensive study several technologies should be used in a complementary manner. An important comparative analysis, involving different sample preparation techniques and chromatographic separations, was performed by the human proteome organisation‟s (HUPO) plasma proteome project (Barnea et al. 2005; Omenn et al. 2005; Zolotarjova et al. 2005).

Different laboratories were analyzing the same aliquot of a human plasma sample to identify as many proteins as possible. The results from each laboratory were combined to create a master list of identified proteins and incorporated into a publicly available database (Omenn et al. 2005). This massive endeavor demonstrated the utility of different sample preparation and fractionation techniques, instrumentations, and database search methodologies. Another similar comparative analysis of different database search algorithms and filtering criteria to obtain positive protein identifications was executed utilizing LC/MS/MS datasets created using human plasma samples (Kapp et al. 2005).

In this thesis two high-throughput technologies and the developments on them in order to enhance their efficiency and practicability are discussed.

1.3 Proteins/Peptides Separation

Separation is one of the most challenging steps of proteome analysis due to the high complexity of protein/peptide samples. It can be performed in many different ways. The choice of technique and methodology with appropriate protocols will mostly be governed by the biological question asked. Because of the notably different properties of proteins and peptides, two major approaches have evolved for the proteome analysis, depending on the stage at which the intact protein or protein fragments are separated and identified. In the “top- down” approach, fractionation of the proteins recovered from a cell or a tissue occurs at a very early stage (Kelleher 2004). “Bottom-up” proteomics analysis involves proteolytic digestion of the proteins immediately after their isolation from a cell or a tissue, which is also called the “shotgun” approach (Link et al. 1999). Protein digestion is performed using either specific enzymes such as trypsin, called proteases, or chemically using peptide bond specific reagents, such as cyanogen bromide (Samyn et al. 2006). Chemical and enzymatic digestion

(22)

may be, as well, sequentially applied (van Montfort et al. 2002). Here, we describe different separation technologies used for both bottom-up and top-down approaches based on number of dimensions used in the separation technique. It is important to note that although mass spectrometry is an additional dimension in proteins/peptides separation, it is not counted as an additional dimension in the next sections.

1.3.1 One-dimensional separation

1.3.1.1 Electrophoretic separation

1.3.1.1.1 SDS-PAGE

Physicochemical properties of the proteins/peptides are used for their separation.

Separation as a function of molecular volume was introduced in the late 1960s (Shapiro et al.

1967). Sodium dodecyl sulfate - polyacrylamide gel electrophoresis (SDS-PAGE) is performed in a matrix of polyacrylamide gel obtained by co-polymerization of acrylamide with a cross linker such as bis-acrylamide or piperazine diacrylyl (Hochstrasser and Merril 1988). The concentration of the cross linker defines the pore size, which can be homogenous along the gel or can produce a gradient. All proteins are put at similar charge density ratio by denaturation using SDS (Pitt-Rivers and Impiombato 1968). The complete procedure was described by Laemmli (Laemmli 1970). By applying the electrical field, proteins migrate through the gel pores and are separated according to their mass. The gels are then usually stained, visualizing bands of proteins containing single or multiple proteins (Scherl et al.

2002). The SDS-PAGE is not used for peptides due to their smaller molecular volumes.

1.3.1.1.2 Isoelectric focusing

Isoelectric focusing (IEF) also known as electro-focusing is based on separation of proteins and peptides according to their isoelectric point (pI). The sample, in its charged state, is loaded into a gel with an established pH gradient. The pH gradient is established by co-

(23)

polymerizing a gradient of monomers with different pK values, before the application of proteins or peptides. The pores of the gel are made to be very large so as to eliminate the sieving effect. Under electrical field the proteins/peptides migrate in the pH gradient until they reach their isoelectric point. At this point, the net charge is zero and proteins/peptides do not move anymore. The solubility is sharply reduced and proteins/peptides might precipitate.

Upon termination of electrophoresis the proteins/peptides are separated into stationary isoelectric zones (Figure 1.2 A).

Earlier IEF techniques were based on carrier ampholytes (CA) (Cramer and Svensson 1961), which created under electrical field a mobile pH gradient in a gel or in a sucrose density gradient. However CA techniques had certain limitations such as „cathodic drift‟

(Hunter 1978) or „plateau phenomenon‟ (Mosher 1986), buffering capacity gaps and uncontrolled ionic strength; resulting in unstable pH gradient. In early 1980‟s Immobilized pH gradients (IPG) were introduced as an alternative to CA slab gels (Bjellqvist et al. 1982). In IPG strips ampholytes are co-polymerized with the polyacrylamide gel. IPGs have the following advantages: (i) even buffering capacity (ii) uniform conductivity (iii) defined ionic strength (iv) enhanced resolution due to the possibility of generating extremely shallow pH gradients (v) improved reproducibility and (vi) complete lack of „pH drift‟. Since IEF is highly used in the following chapters, it is discussed in further details here.

Sample solutions for IEF under denaturing conditions usually contain urea. This neutral chaotrope denatures proteins by disrupting noncovalent and hydrogen bonds between amino acids. Its neutral nature renders it ideal for IEF. It is usually used at a concentration of 5 to 8 M for proteins and at 4 M for peptides. However, urea spontaneously degrades to cyanate resulting in carbamylation of cysteins especially at slightly acidic pH conditions (Lippincott and Apostol 1999). Other detergents such as thiourea, CHAPS, triton X-100, dithiothresitol (DTT), dithioerythritol (DTE) (Gorg et al. 2000), trifluroethanol (TFE) (Deshusses et al.

2003) and even sugars such as sorbitol and mannitol (Esteve-Romero et al. 1996) for solubilization of hydrophobic and membrane proteins have been used in the IEF buffer (Rabilloud 1998; Shaw and Riederer 2003).

Carrier ampholytes (CA) are small amphoteric molecules, which help to establish the pH gradient in the IPG strip. The addition of CA to the IEF buffer has several advantages. First, they are useful in inhibiting interactions between the hydrophobic proteins and immobilines.

They also scavenge cyanate ions and help in the precipitation of nucleic acids (Righetti 1983).

They are usually used at a concentration of 0.5-2% for proteins and 0.2% for peptides.

(24)

Righetti and co-workers recently carried out an investigation using a Rotofor system on the quality of four commercial CAs (Pharmalyte and Ampholine from Pharmacia, Bio-Lyte from BioRad and Servalyt from Serva) (Righetti et al. 2007). They showed that Servalyt contained a grand total of 686 chemical entities and no less than 3899 isoforms. Similar values were obtained for other products: Pharmalyte 643 and 2211; Bio-Lyte 255 and 1192; Ampholine 294 and 1182, respectively. In terms of M(r) distribution, they reported different upper limits.

Pharmalyte had an upper M(r) value of 1179 (in the pH 4-6 range), versus 907 for Servalyt, 835 for Bio-Lyte and 893 for Ampholine. In general, in going towards the more alkaline pH intervals (e.g. pH 8-10) the molecular mass of the CAs was reduced to as low as 491 (Bio- Lyte), indicating that the alkaline species are probably made with shorter oligoamines and are, in general, less substituted. All acidic pH intervals (up to pH 6-8) appeared to be constituted by a very large proportion of well focusing species, indicating small values of ∆pK across their pI. Above pH 8, all brands of CAs worsened, the vast majority were unable to focus properly and sustain adequately the pH gradient. Righetti provided general guidelines for the synthesis of new alkaline species for improving the basic pH ranges (Righetti et al. 2007).

Samples can be applied to the IPG strips by cup-loading (Gorg et al. 1988), in-gel rehydration (Rabilloud et al. 1994) or paper-bridge (Sabounchi-Schutt et al. 2000). For the two later methods, sample entry occurs in the beginning of the focusing. However, in-gel reswelling can also occur by applying low voltages (Gorg et al. 1999). IPG-IEF can be simplified by use of an integrated system such as the IPGphor (Gorg et al. 1999). It is important to remember that the pI of the protein is temperature dependent. Fredriksson showed that in the basic region differences as high as 0.6 pH units can be obtained for protein pI values measured at 4 ºC and 25 ºC (higher pI is obtained at lower temperature) (Fredriksson et al. 1997). Proteins with lower pIs show less variation of pI with temperature, typically -0.005 pH units/ºC, whereas strongly basic proteins have variations of typically - 0.03 pH units/ºC.

The resolution obtained by analytical IEF is amongst the highest from present biochemical separation techniques. IEF can readily resolve proteins that differ in pI by as little as 0.01 units (Bjellqvist et al. 1982), which means that proteins differing by one net charge can be separated. The high loading capacity (up tp 5 mg) of IPGs makes them the natural choice for cases where high loads are necessary for detection of low abundant proteins.

Another isoelectric focusing technology is the Off-Gel, which enables the extraction of proteins of a given pI from complex samples, whilst maintaining the proteins in solution

(25)

without the need of carrier ampholytes or buffers (Ros et al. 2002). This system is composed of a chambers where one wall of the chamber is an immobilized pH gradient gel. If the chamber is thin enough the proteins situated in the solution close to the gel will be buffered by it. By applying an electrical field through the chamber, all charged species at this pH migrate out of the chamber, into the gel. Because there is no fluidic connection between the wells, proteins and peptides are forced to migrate through the IPG gel. In the solution, only the proteins with an isoelectric point close to the one of the gel in contact with the chamber remain in solution. After IEF, the proteins or peptides can be recovered directly from the liquid phase. It was shown that protein IEF by Off-Gel has a resolution of at least 0.3 pH units using a linear pH gradient of 4–7 (Michel et al. 2003). The advantageous feature of this instrument is that it can work for down-scale volumes, typically in the order of 0.1 to 1 mL

IEF can also take place in non-gel mediums. One example is the capillary electrophoresis (CE), which allows the separation of positive/negative charged as well as neutral ions in capillary columns. Depending on the capillary capping, different separation modes can be obtained: i) capillary zone electrophoresis (CZE) based on net charge to protein/peptide volume ratio (Ding and Vouros 1999), ii) Isoelectric focusing (CIEF) separation based on pI (Shen et al. 2000; Shen et al. 2001), which has been often used for peptide separation in combination with mass spectrometry detection (Kasicka 2008), iii) Micellar electro-kinetic chromatography (MECK) based on hydrophobicity and mass to charge ratio of proteins in presence of a denaturant such as SDS (Ishihama et al. 2000) and iv) Affinity capillary electrophoresis (ACE) based on interaction with other molecules (Righetti 2001).

IEF can also be performed in solution. Continuous free-flow electrophoresis (FFE), continuously injects samples into a carrier ampholine solution flowing as a thin laminar film (0.3-1.0 mm wide) between two plates. By introducing an electric field perpendicular to the direction of flow, proteins (as well as peptides) can be separated by IEF according to their different pI values and subsequently collected for further proteome analysis (Soulet et al.

1998). Recent FFE device designed by BD (Franklin Lakes, NJ, USA) has three main operating modes: i) IEF: focusing separation of species within a formed pH-gradient according to their pI, ii) zone electrophoresis: non-focusing separation of species in a homogeneous medium according to their net charge density, iii) îsotachophoresis (ITP):

separation of species according to their electrophoretic mobility.

In another system, called isoelectric split-flow thin fractionation (SPLITT), the separation principle is not based on an established pH gradient but the charge that proteins exhibit on

(26)

their pI in buffers of a different pH. After the application of the potential to the flow cell, proteins are separated using adequate outlet and/or inlet splitters. This system is not capable of separating complex samples with pI value differences less than 0.1 pH unit (Fuh and Giddings 1995). One of the most common preparative approaches to recycling free-flow electrophoresis is the “Rotofor” apparatus, commercialized by Bio-Rad (Hercules, CA, USA) (Righetti et al. 2003). In a tube-like apparatus where compartments are defined by a screening material, the pH gradient is established using special ampholytes, the so-called “Rotolytes”.

Gravity problems in free-flow electrophoresis are overcome by the rotation of the separation compartments. This device has been successfully applied to the preparative scale proteomics.

However, the sample volume of the rotofor cell is quite large (18 mL or higher), which can be problematic with many biological applications where sample size is limited. Another issue is that the resulting sample will contain CAs, which may impair subsequent HPLC and mass- spectrometric analysis A modification of this approach is the tangential electrophoretic apparatus from Bier (Bier 1998). Here, the different compartments are arranged in such a manner that an array of multi-channels is separated from a second array of multichannels slightly displaced through a single screen. An electrical field is applied perpendicularly to the channels, which enables an electrophoretic serpentine pathway through the channels. The pH in the channels is fixed by CAs and recycling is possible with independent inlet and outlet ports at every channel. Another approach based on free-solution IEF was provided by Righetti‟s group (Herbert and Righetti 2000). This technique was based on isoelectric membranes, which recovers proteins in carrier ampholyte-free solution. This device can be composed of several compartments separated by immobiline gels stabilized by membranes.

The separation of fractions is achieved in such a way that the protein stops migrating in an electrical field in between two immobiline membranes, wherein one membrane establishes a pH higher than the protein‟s pI and the other a pH lower than it.

Another liquid-based IEF prefractionation technology, the Gradiflow system presented by Corthals et al., comprises re-circulating hydraulic flow of the protein mixtures through two shallow separation compartments with an orthogonal electrophoretic transport of different proteins across a single separation membrane between the re-circulating compartments (Corthals et al. 1997). By controlling the pH in the flowing solution with CAs, isoelectric focusing separation can be run in this instrument. Unlike most other multi-compartment electrokinetic analyzers, the Gradiflow technology allows the electrophoretic separation of

(27)

proteins based upon both the protein‟s charge and the molecular shape (size) properties (Ogle et al. 2003).

1.3.1.2 Chromatographic separation

Liquid chromatography (LC) is based on separation of proteins and peptides using their different physicochemical properties of the proteins and peptides.

 Hydrophobicity: Reversed-phase (RP) is carried out with hydrophobic stationary phases (Byrnes 1994; Badock et al. 2001). Ion-pair (IP) RP is based on addition of hydrophobic acids and bases to the mobile phase (Patthy et al. 1990).

 Ionic charge: Ion-exchange (IEX) LC is based on electrostatic interaction of analyte molecules with positively or negatively charged groups.

 Affinity: This method enables the isolation of a single protein or peptide type in a complex sample based on their interaction with the stationary phase.

 Size: Size-exclusion chromatography (SEC) is a separation according to the molecular size. The stationary phase usually have pores of a defined size.

Currently, most LC separations in proteomics are done in RP high performance LC (HPLC) mode, because of its compatibility with MS (Shen and Smith 2002). RP-HPLC can also be used to concentrate and/or desalt samples. State-of-the-art RPLC columns provide a peak capacity of several hundreds, depending on their length and the gradient slope. A recent study outlines three strategies for LC peak capacity improvements (Gilar et al. 2004): (i) decreasing the gradient slope (column length is fixed), (ii) increasing the column length with proportional increase in gradient time, and (iii) employing columns packed with smaller sorbent particles. It has been demonstrated that “ultra” high pressure LC (UPLC) can dramatically increase separation speed and hence decrease analysis time (Wu and Clausen 2007). The mobile phase in RPLC normally contains a mixture of water and a water miscible organic solvent such as acetonitrile. Acid (formic, acetic or trifluoroacetic) is added to the mobile phase to render all of the component proteins and peptides positively charged and denatured and to reduce unwanted ionic interactions with the stationary phase. Trifluoroacetic acid concentrations are limited because of its ion suppressive effects.

(28)

Monolithic stationary phases have been developed and used in capillary columns as an alternative to granular packed beds. Monolithic columns have attracted a great deal of interest because of their ease of preparation, reliable performance, good permeability and versatile surface chemistry (Barroso et al. 2003).

In comparison with gel-based separation methods, sample handling and preparation are minimal in LC. Proteins or peptides are separated by LC and can be introduced directly into the mass spectrometer for identification and analysis. Although 1D-LC has been proved to be an economic and effective way for protein and peptide identification, its application in proteomics is relatively restricted by the complexity of the samples. Samples in proteomic analyses often contain thousands of proteins (Bodnar et al. 2003; Zhang et al. 2004; Chen et al. 2005). After proteolytic digestion, peptides numbering in the hundreds of thousands must be separated. This exceeds the analytical range of most 1D-LC methods because of insufficient peak capacity. Therefore, multi-dimensional separations are often required.

Another important aspect of future development is the miniaturizationof instrumentation, primarily of the separation columns. There are many publications describing miniaturized chips and theiruse for separation of proteins and peptides. Yin et al. reported the development of a new chip designed to use existing nano-HPLC hardware and MS (Yin et al. 2005).

Fortier et al. reported the use of the nano-HPLC chip for RP and 2D separation of peptides obtained through tryptic digest of rat plasma (Fortier et al. 2005). Reichmuth et al.

demonstrated the use of a C18 side chain porous monolithic column with an integrated fluorescence detector for separation of peptides and proteins (Reichmuth et al. 2005).

1.3.2 Two-dimensional separation

Two and multi-dimensional protein/peptide separations are based on utilization of two or more of their independent physical properties to separate their mixtures into individual components. When the properties are truly independent, the separation method is considered

“orthogonal” and the peak capacity is approximately equal to the product of the individual peak capacities of each dimension.

(29)

1.3.2.1 2D-PAGE

Two dimensional gel electrophoresis (2-DE) is a protein separation technique which combines IPG-IEF (as 1^st dimension) and SDS-PAGE (as 2^nd dimension) methods (Figure 1.2 B) and is one of the mostly used techniques in proteomics (Gorg et al. 1999; Carrette et al.

2006). 2-DE was developed independently in the laboratories of O`Farrell and Klose more than three decades ago (Klose 1975; O'Farrell 1975). Proteins can be visualized in 2-D gels using different detection methods. The more common protein staining methods include Coomassie blue and silver staining, fluorescence dyes (e.g. Cy dyes, LAVAPurple, Sypro dyes), radiolabeling and immunodetection. Using standard format SDS-gels for 2-DE, it is possible to routinely separate more than 2000 protein spots from serum/plasma or tissue extracts, which reflects ~100–300 different proteins, depending on the pH gradient used in the first dimension. Differential in-gel electrophoresis (2D-DIGE) is a recent improvement of the 2-DE technology (Marouga et al. 2005). It improves gel reproducibility, minimizes alignment issues and allows better quantitative comparison between samples. In 2D-DIGE, proteins from different disease states are separately labeled with different fluorescent dyes, and an internal pooled standard is labeled with another dye. The labeled samples are then combined and subjected to 2-DE, and the gel is scanned at different emission wavelengths generating multiple images that can be overlaid.

A B

Figure 1.2 A. Silver-stained SDS-PAGE gel with Staphylococcus aureus N315 proteins. B. Silver-stained 4-7 2D-gel with the same samples.

(30)

Despite the longstanding success of 2D-PAGE coupled with mass spectrometry, several fundamental issues with the technology, including the challenges of identifying low- abundance proteins (Gygi et al. 2000), membrane proteins (Santoni et al. 2000), and proteins with extremes in isoelectric point and molecular weight (Corthals et al. 2000; Oh-Ishi et al.

2000), drove researchers to develop alternative approaches for the separation of complex mixtures.

1.3.2.2 2D-LC

Since most current two-dimensional (2D) LC systems are interfaced with mass spectrometers, the choice for the first dimension separation has to be complemented by reversed-phase chromatography in the second dimension because, in this case, the samples, eluted from the RP column, are in the most desirable form for injection into the mass spectrometer. Link et al. developed an approach for the direct analysis of large protein complexes (DALPC) (Link et al. 1999). This was the first description of analyzing a complex peptide mixture using the strong cation exchange (SCX)– reverse phase (RP) chromatography in combination with tandem mass spectrometry (SCX/RP/MS/MS) approach. After digestion, acidified peptides are loaded on the cation exchange column. A fraction of the peptides present are moved onto the reversed-phase column by a salt step. Peptides are retained on this column for desalting and then eluted to the mass spectrometer by an acetonitrile gradient. The reversed phase column is re-equilibrated, and the process is repeated with the salt concentration increasing on the ion-exchange column at each salt step. This 2D-LC separations fall into two main categories of either offline or online techniques (Peng et al.

2003). DALPC or the offline techniques involve the collection of sample peaks at the detector exit of the first column. The samples are treated, if necessary, and then re-injected onto the second column. Online techniques (or the so-called multi-dimensional protein identification technology: MudPIT) require the coupling of two columns through a switching valve (Wolters et al. 2001). Both offline and online techniques have advantages and disadvantages.

Off-line techniques allow the separation conditions of both dimensions to be fully and independently optimized. In addition, off-line techniques implicate sample manipulation between dimensions, evidently with some sample loss. Thus, triphasic columns were developed to reduce the sample handling (McDonald and Yates 2002). 2D-LC has been

(31)

mainly used with electrospray mass spectrometers. However, MALDI has also been coupled in an off-line manner using spotting robots or fraction collectors. The limitation of 2D-LC techniques is primarily related to the time required to achieve the separations. Since many individual first dimension fractions must each be fractionated in long (e.g. 2 h) RPLC gradient elution runs, complete 2D-LC runs may require several days to complete. Another disadvantage is that 2D-LC requires a larger quantity of sample for a single run (>2.5 ml) as compared to 1D-LC (5-100 μl), which can be a difficulty if available sample volumes are small.

Among other combinations one can mention the affinity chromatography which has been especially used for the analysis of PTMs. The best known approach is the immobilized affinity chromatography (IMAC) (Ficarro et al. 2002; Nuhse et al. 2003) or titanium dioxide (TiO2) chromatograpy (Pinkse et al. 2004) used for the enrichment of phosphopeptides. Size- exclusion chromatography (SEC) is occasionally used as a first dimension in 2D-LC separations. SEC has the advantages of high reproducibility, stability and relatively short analysis time but it suffers from limited loading capacity and low resolving power. A few reports appeared using a comprehensive 2D SEC–RPLC technique for proteome research (Opiteck and Jorgenson 1997; Opiteck et al. 1998; Liu et al. 2002) . In these experiments, peptide fragments were separated by SEC followed by RPLC. The chromatographic separation system was coupled to an ESI-MS for on-line protein identification.

A new 2D-LC platform, proposed by Beckmann, was the PF2D^® liquid-based separation system. Fractionation of intact proteins in the two-dimensional PF2D^® system uses chromatofocusing according to pI (on one dimension) and nonporous reversed-phase column chromatography (on the second dimension) according to hydrophobicity, which provides greater throughput-potential with robustness (automatic) (Chen et al. 2006). Some drawbacks of PF2D^® are „position shift‟, a gap between the elution buffer pH and the pI of proteins in pH gradient fractions. Among other 2D combinations for separation at the protein level, one can mention hydrophobic interaction chromatography (HIC) (Karlsson et al. 1999), weak anion exchange (WAX) (Kato et al. 2004), strong anion exchange (SAX) (Linke et al. 2004), liquid and gel phase IEF (Herbert and Righetti 2000; Gorg et al. 2002) and SDS-PAGE (Beausoleil et al. 2004; Yang et al. 2007).

Different methods have also been used as the first dimension substituting SCX in 2D-LC for peptides separation. Shotgun IPG-IEF, which is one of the main subjects of this thesis (see Chapter 3), is based on separation of peptides by IPG-IEF followed by RPLC (Cargile et al.

Engineering high-throughput proteomics pipelines

Thesis

Reference

Engineering high-throughput proteomics pipelines

Engineering High-Throughput Proteomics Pipelines

Ali Reza Vaezzadeh

Ab A bs s t t r r a a c c t t

Engineering high-throughput proteomics pipelines

R R é é s s u u m m é é e e n n f f r r a a n n ç ç a a i i s s

Ac A c k k no n ow wl l e e d d g g m m e e n n t t s s

Table of Contents

Ab A bb br r e e v v i i a a t t i i o o n n s s

C C h h a a p p t t e e r r 1 1

1 1 . . I I n n t t r r o o d d u u c c t t i i o o n n

1.1 Proteomics: A new challenge

1.2 High-throughput proteomics

1.3 Proteins/Peptides Separation