Exploring the use of MS/MS spectral libraries to improve protein identification and characterization

(1)

Thesis

Reference

Exploring the use of MS/MS spectral libraries to improve protein identification and characterization

AHRNE, Erik

Abstract

The use of tandem mass spectrometry (MS/MS) is a well established method to identify and characterize proteins from complex samples. Various tools are designed to map MS/MS spectra to peptides and proteins. The most commonly used software are Sequest, Mascot, and Phenyx, often referred to as sequence search tools, as they employ a spectrum identification algorithm where the experimental spectra are compared to theoretical spectra generated in silico from a protein sequence database. In recent years, a different method for protein identification based on spectral library search has shown promising results. In this approach the sequence database is replaced by a collection of high quality MS/MS spectra confidently identified in previous analysis. This thesis is made up of four papers contributing to the development of an analysis workflow, combining the two identification strategies, to improve upon protein identification and characterization.

AHRNE, Erik. Exploring the use of MS/MS spectral libraries to improve protein

identification and characterization. Thèse de doctorat : Univ. Genève, 2011, no. Sc. 4311

URN : urn:nbn:ch:unige-163835

DOI : 10.13097/archive-ouverte/unige:16383

Available at:

http://archive-ouverte.unige.ch/unige:16383

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSITE DE GENÈVE FACULTE DES SCIENCES

Département d’informatique Professeur Ron D. Appel Institut Suisse de Bioinformatique Dr. Frédérique Lisacek

Exploring the use of MS/MS Spectral Libraries to Improve Protein Identification and Characterization.

THÈSE

présentée à la Faculté des sciences de l’Université de Genève

pour obtenir le grade de Docteur ès sciences, mention bioinformatique

par Erik Ahrné

de

Uppsala (SWE)

Thèse N° 4311

Genève

Atelier d'impression ReproMail

2011

(3)

Résumé en Français

La présente thèse est consacrée à la conception et à l’implémentation d’une plateforme d’analyse de données de spectrométrie de masse permettant la détection de modifications post-traductionnelles (PTMs) sans hypothèse préalable.

Il est établi que la qualité des résultats d’identification des protéines par les méthodes classiques d’analyse de données de spectrométrie de masse (MS) se déprécie si la recherche est étendue à un nombre croissant de PTMs. De plus, les outils en sont notablement ralentis, à moins de recourir à des heuristiques pour maîtriser l’explosion combinatoire. Le présent travail s’inscrit dans le cadre de la caractérisation des PTMs et répond donc à un besoin réel d’implémenter une solution qui garantisse la qualité des résultats et soit plus économe en temps de calcul. Il est ainsi démontré que l’utilisation de librairies de spectres permet à la fois le gain de temps et la fiabilité de la détection. QuickMod, la plateforme résultante peut efficacement traiter des ensembles de données MS/MS (une catégorie particulière et très répandue des données MS) pour analyser et visualiser l’ensemble des PTMs détectées.

Le manuscrit contient neuf sections, de nombreuses tables et figures qui illustrent le texte et une liste de références bibliographiques. Les sections 1 et 2 sont introductives. La première présente brièvement le contexte de la protéomique en y soulignant l’importance des données de spectrométrie de masse et le rôle de la bioinformatique dans leur analyse pour identifier et caractériser les protéines. La deuxième est un survol et une mise en perspective des travaux décrits le manuscrit, plus spécifiquement dans leur motivation à relever le défi posé par la détection des PTMs dans les données de spectrométrie.

La section suivante décrit une méthodologie qui introduit l’utilisation de librairies de spectres créées à partir d’identifications obtenues en prétraitement par des méthodes classiques. Il est ainsi montré que l’assignation de spectres à des peptides et des protéines est augmentée par cet enchaînement de méthodes. La quatrième section traite de la définition et de la construction de librairies de spectres aléatoires. La définition de la notion de spectre aléatoire fait suite à une étude détaillée des distributions de masses dans des jeux de données annotés issus de différentes techniques de fragmentation. La méthode de construction de librairie aléatoire repose sur cette définition et des critères de similarité entre spectres prenant

(4)

en compte la présence ou non d’annotation. La cinquième section décrit l’éventail des stratégies de détection des PTMs sans hypothèse préalable à l’aide d’outils informatiques analysant les spectres de masse. Elle permet de justifier les options prises dans la section suivante. La sixième section présente en détail la méthode implémentée dans la plateforme QuickMod et explique en détail les critères choisis pour évaluer l’alignement des spectres et leur performance comparative. De plus, une amélioration de l’assignation de la position d’un acide aminé modifié dans un peptide est proposée. La septième section illustre l’application de QuickMod dans une recherche PTMs dans des échantillons sanguins de patients traités en chimiothérapie. Le mémoire est conclu dans huitième section par les perspectives de développement de QuickMod à plus ou moins long terme. Finalement, la neuvième section est dédiée aux détails de l’implémentation en particulier concernant les liens de QuickMod à la librairie de programmes Java pour la protéomique et le développement de l’interface utilisateur.

(5)

Abstract

The use of tandem mass spectrometry (MS/MS) is a well established method to identify and characterize proteins from complex samples. Various tools are designed to map MS/MS spectra to peptides and proteins. The most commonly used software are Sequest, Mascot, and Phenyx, often referred to as sequence search tools, as they employ a spectrum identification algorithm where the experimental spectra are compared to theoretical spectra generated in silico from a protein sequence database.

In recent years, a different method for protein identification based on spectral library search has shown promising results. In this approach the sequence database is replaced by a collection of high quality MS/MS spectra confidently identified in previous analysis. Each spectral library entry represents a peptide sequence, the peptide ion charge state and mass over charge ratio (m/z) and includes a fragment peak-list compiled from one or more experimental spectra. Typically m/z and intensity pairs of the peak-list are labeled with one or more ion-types that match the peak within a given mass tolerance.

Spectral library searching is a natural way of incorporating previous knowledge about observed peptides and their respective fragmentation patterns into a new search and several spectrum library search tools have been developed for this purpose. As library spectra are rich in reproducible information describing peptide fragmentation compared to a theoretical spectrum modeled by a sequence search tool spectral library based analysis has proven to perform favorable in terms of speed and sensitivity.

So far, spectral library searches are mostly applied to detect peptides as they are present in the library. However, they also allow finding modified variants of the library peptides if the search is done with a large precursor mass window and an adapted Spectrum-Spectrum Match (SSM) scoring algorithm.

This thesis is made up of four articles, two of which are published and the other two are under revision, contributing to the development of an analysis workflow designed for protein identification and characterization. We start off by presenting a MS/MS identification pipeline where sequence search based data analysis is coupled with a subsequent spectral library

(6)

search and demonstrate how such a set-up allows for a substantial increase in the number of identified MS/MS spectra.

In any type of data analysis that involves multiple testing, the control of the number of false positive identifications expected in the final result list by means of the false discovery rate (FDR) is of crucial importance. In a classical sequence search the FDR is estimated by searching randomized or decoy sequence databases. Our spectral library based analysis workflow includes a software tool, DeLiberator, for the creation of randomized spectral libraries. The implementation and testing of our solution to compile decoy MS/MS spectra is described in the second article (under revision) presented in this thesis.

In preparation to extend the capability of our spectrum library based analysis workflow to identify modified variants of the peptides listed in the spectrum library we performed an extensive literature study of software solutions designed for exhaustive identification of post- translational modified (PTM) peptides. In the third article of this thesis we provide a review of this work.

Finally, we present an improved spectral library search tool, QuickMod, designed to identify peptides carrying PTMs. QuickMod is a so called Open Modification Search tool (OMS) meaning that it does not require prior input from the user estimating the modifications present in the sample. The article (under revision) describing our modification tolerant spectral library search tool includes a preliminary study showing that library spectra of unmodified peptides are better reference spectra than theoretical peptide spectra, when trying to identify query spectra of modified peptides. We investigate how SSMs can be efficiently scored in OMS mode, and show how spectra from peptides carrying distinct modification types have different scoring characteristics. Furthermore we benchmark a rapid algorithm for positioning a modification on the peptide sequence. Lastly, QuickMod is compared to other software for PTM discovery and we show how the high speed and accuracy of spectrum library searches still holds when searching for modified variants of the library peptides.

(7)

Publications

Article 1

A simple workflow to increase MS2 identification rate by subsequent spectral library search.

Erik Ahrné, Alexandre Masselot, Pierre-Alain Binz, Markus Müller, Frederique Lisacek Proteomics, 20 FEB 2009, DOI: 10.1002/pmic.200800410

Manusript 1

How to create good MS2 decoy spectra?

Erik Ahrné, Yuki Ohta, Frederic Nikitin, Alexander Scherl, Frederique Lisacek, Markus Müller

Submitted to Proteomics, under revision.

Article 2

Unrestricted identification of modified proteins using MS/MS.

Erik Ahrné, Markus Müller, Frederique Lisacek

Proteomics, 22 DEC 2009 DOI: 10.1002/pmic.200900502

Manuscript 2

QuickMod – a tool for open modification spectrum library searches.

Erik Ahrné, Frederic Nikitin, Frederique Lisacek , Markus Müller

Accepted as a conference paper for the RECOMB Satellite Conference on Computational Proteomics 2011, March 11-13, 2011, La Jolla, California

Revised manuscript submitted for publication in Journal of Proteome Research

(8)

Chapter 1

Introduction

(12)

1. Introduction

Since the beginning of modern biology, researchers have sought ways to better understand the relationships between protein sequence, structure and function. The central dogma of biology states that DNA sequence encodes the protein sequence, which in turn determines the three- dimensional structure of the protein and hence its function. Much attention has been brought to the sequencing efforts of genome projects, where the completion of the Human Genome Project represents a major achievement in modern science. However, the eventual end goal of these projects is to determine how the genome builds life through proteins. The proteome, the entire set of proteins expressed by a genome, is more difficult to map out than the genome for several reasons. While the genome is relatively constant the proteome is seemingly boundless as each protein may be present in different forms, in different amounts, in different tissues and at different times. DNA and RNA research is greatly facilitated by the polymerase chain reaction (PCR) which provides the ability to make nearly unlimited copies of a target nucleic acid sequence. No analogous technique is available for the study of proteins and thus presumes the development of sensitive and accurate techniques to analyze relatively small numbers of molecules. As suggested in a recent Nature Methods Commentary by Nilsson et al. 2010; “Sequencing the human genome was perhaps the easy part, and now making sense of the constantly moving and changing picture of the proteome will require a lot of time, effort and creativity” (Nilsson et al. 2010).

1.1 Proteomics

The ambitious challenge of large-scale determination of gene and cellular function directly at the protein level; the research field of proteomics, has four main objectives: (1) identification of all proteins from a proteome creating a catalog of information; (2) analysis of differential protein expression associated to disease, different cell states, sample treatments and drug targets; (3) characterization of proteins by discovering their function, cellular localization, PTMs, etc and (4) describe and understand protein interaction networks. (Palagi et al. 2006) Although proteomics is often associated with a collection of various technical disciplines, at the time when Professor Mark Wilkins first coined the term, he was primarily referring to the functional study of proteins using Mass Spectrometry (MS), which remains the central

(13)

Two major factors strongly contributed to the usefulness of MS based study of proteins during the 1990s. Firstly the rapid increase of the number of available protein sequences in public databases, which were extensively fed by high-throughput DNA sequencing project. Secondly the development of two soft ionization methods for mass spectrometry, recognized by the 2002 Nobel Prize in chemistry: electrospray ionization (ESI) (Fenn et al. 1989) and matrix- assisted laser desorption/ionization (MALDI) (Karas & Hillenkamp 1988, Tanaka et al.

1988). These two methods allowed the gentle ionization of large non-volatile biomolecules, and consequently extended detection limits and mass ranges in MS (Henzel et al., 2003).

1.2 Mass spectrometry based proteomics

As its name implies MS is about measuring molecules based on mass. In 1993, 5 groups published on a method for protein identification using mass spectrometry referred to as Peptide Mass Fingerprinting, PMF (Henzel et al. 1993, James et al. 1993, Mann et al. 1993, Pappin et al. 1993, Yates et al. 1993). Here proteins are identified by matching a list of experimentally observed peptide masses, a mass spectrum, with theoretical peptide masses computed from protein sequences stored in databases. Commonly proteins are digested chemically into peptides using an enzyme, such as trypsin. Trypsin cleaves a protein after lysine or arginine residues, producing peptides typically ranging from three to 20 amino acids in length. The mass spectrometer determines the mass of each peptide and the entire set of masses is most often unique to a specific protein. This protein identification technology relies heavily on the fast and accurate mapping of mass spectra to protein database entries, and gave birth to a bioinformatics discipline focusing on the development of software tools for the processing of MS data. Several software tools based on PMF identification were developed in the following years (Pappin et al. 1993, Perkins et al. 1999, Clauser et al.1995, Binz et al.

1999, Zhang et al. 2000).

Soon after, the first publication on another MS based approach for identifying proteins emerged, tandem MS or MS/MS (Eng 1994). In this setup the mass spectrometer performs sequential MS analyses. In a first stage intact peptides are analyzed and a mass spectrum is produced. Next, a selection of these peptides is fragmented at the peptide-bond by colliding them with inert atoms, a method referred to as collision-induced dissociation (CID) (see Figure 1). The mass and intensity of each peptide fragment is detected producing a MS/MS spectrum. Each MS/MS spectrum can be mapped to its original peptide using a similar

(14)

approach to PMF, where the assembly of peptide masses making up the mass spectrum is screened against theoretical peptide masses of a given protein.

Figure 1a)

Figure 1b)

Figure 1. Collision Induced Dissociation (CID)

A) Possible peptide fragment ions produced during CID MS/MS.

B) A MS/MS spectrum of the peptide GISHVIVDEIHER, dominated by b- and y- type ions.

(Figures were originally published in Hernandez et al. 2005)

(15)

Protein identification using MS/MS has several advantages to PMF. Firstly MS/MS based analysis, known as Peptide Fragment Fingerprinting, PFF, can identify a protein based on a few peptide signals. In addition to the peptide mass, the peak pattern in a CID spectrum also provides information on the peptide sequence. MS/MS can therefore be used to analyze complex protein mixtures and to search homologous databases. Secondly, provided significant peptide coverage, detailed information about the peptide sequence including possible modifications and mutations can be obtained.

The technological advancements of tandem mass spectrometers in combination with advanced computational tools for analyzing their output have made MS/MS a widespread identification method in proteomic studies.

1.3 MS/MS data analysis

High performance liquid chromatography coupled with tandem mass spectrometry (LC- MS/MS), often referred to as shotgun proteomics is the key experimental method for modern large-scale protein identification (see Figure 2).

(16)

Figure 2. Overview of a typical LC-MS/MS experiment

1) Isolation of proteins from a cell lysate or tissue by biochemical fractionation of affinity selection.

2) Enzymatic digestion of proteins into peptides, usually using trypsin.

3) Peptide separation by one or more steps of high-pressure liquid chromatography.

4) Interpretation of the MS1 spectrum and generation of a prioritized list of peptides for subsequent CID fragmentation.

5) Acquisition and storing of MS/MS spectra followed by computer assisted data analysis using protein identification search tools such as Sequest or Mascot.

(Figure was originally published in Aebersold et al. 2003)

A large number for MS/MS based protein identification tools have been developed, with different algorithmic solutions to find the optimal Peptide Spectrum Match (PSM) (see Figure 3).

Mann et al. presented the “peptide sequence tag” algorithm where short unambiguous amino acid sequences are extracted from the spectrum peak pattern by linking peaks with a mass difference corresponding to the mass of an amino acid (Mann et al. 1994). Sequence tags in combination with the spectrum precursor mass information can be used as a specific probe to identify the peptide origin of the MS/MS spectrum.

In theory the full peptide sequence can be revealed using an algorithm similar to tag- extraction by finding a tag spanning the full mass range of the spectrum, a technique known as de novo sequencing (Dancik et al. 1999, Fernandez-de-Cossio et al. 2000, Ma et al. 2003, Johnson et al. 2002, Frank et al. 2005, Searle et al. 2004, Savitski et al. 2005). However, the accuracy of such algorithms hinges on very high quality spectral data.

The most widely used MS/MS identification algorithms such as Sequest, Mascot and Phenyx, often referred to as sequence search tools, screen the experimental MS/MS data against a user-selected protein database (Eng et al. 1994, Perkins et al. 1999, Colinge et al. 2003). The protein sequences are digested into peptides in silico in accordance with the cleavage rules of the protease used in the sample preparation step of the experimental workflow. For each

(17)

with the experimentally observed peaks. A peptide match score is derived which reflects the similarity between the MS/MS spectrum and the theoretical peptide spectrum. A number of different scoring functions have been described in the literature including simple spectral correlation functions such as the dot-product (Liu et al. 2007), more advanced cross correlation functions, scoring functions based on empirically observed rules or statistically derived fragmentation frequencies. The score reported for a given PSM can be on some arbitrary scale or converted to a statistical measure such as a p-value or an expectation value, E-value. The final list of identified peptides is compiled into a protein 'hit list', which is the output of a typical proteomics experiment (Chapter 5 includes a more in-depth description of different sequence search based PSM scoring algorithms).

As tandem mass spectrometry has become a mature technology in proteomics several already analyzed MS/MS datasets are stored in public repositories. The work presented in this thesis is centered on yet another protein identification strategy known as spectrum library searching.

Here, experimental spectra are matched against a carefully compiled library of previously identified high-quality MS/MS spectra.

(18)

Figure 3. Overview of different peptide identification strategies

Sequence DB Search: Acquired experimental MS/MS spectra are matched to theoretical spectra predicted for each peptide contained in a protein sequence database search.

Spectral Library Search: Acquired experimental MS/MS spectra are matched to annotated experimental spectra from previous experiments listed in a spectrum library.

De Novo Sequencing: Peptide sequences are extracted directly from the MS/MS spectrum by linking peaks with a mass difference corresponding to the mass of an amino acid.

Sequence Tag-Assisted search: Peptide sub-sequences are extracted from the MS/MS spectrum followed by a filtering of a protein sequence database retaining only those peptides matching the precursor mass of the MS/MS spectrum and containing the amino acid tags.

(Figure originally published in Nesvizhskii 2010)

(19)

Chapter 2

Thesis Overview

(20)

2. Thesis Overview

A lot has happened in the field of proteomics in the decade and a half since the first publication on tandem mass spectrometry as a method for protein identification and proteomics remain a fast growing field. The last few years have been particularly exciting with the introduction of new MS instrumentation, alternative fragmentation mechanisms and advanced data acquisition strategies, dramatically improving the throughput and depth of proteomic analysis. Even though CID remains the most common fragmentation mode, several alternative mechanism such as High energy Collision dissociation (HCD), and Electron Transfer Dissociation (ETD), are routinely used for specialized applications such as sequencing of peptides with Post Translational Modifications (Olsen et al. 2007, Coon et al.

2005, Molina et al. 2007, Chi et al. 2007). Some instruments can be operated in multi-stage mode with automated data-dependent triggering of MS3 acquisition (Bodenmiller et al. 2007, Steen et al. 2006, White et al. 2008). High-mass accuracy instruments such as the LTQ- Orbitrap measuring peptide ion m/z values as low as a few parts per million (ppm) are commonly available in proteomics labs, to be compared with the mass accuracy of around 500 ppm of instruments used in the early proteomics studies. Innovative instrumentation setups such as Single Reaction Monitoring (SRM) on Triple-Quadrupole mass spectrometers have dramatically improved the sensitivity of protein quantitation.

To deal with the large data volumes and differing data types produced in today’s proteomics experiments the field is strongly dependent on bioinformatics. The commercial identification tools Mascot and Sequest are the software of choice in the majority of proteomics labs, while a multitude of excellent open source tools have been developed and adapted to meet the throughput and spectral quality of modern mass spectrometers. Several data repositories exist for storing and sharing raw data and identification search results (see Table 1).

(21)

Table1, A partial list of publicly available tools for MS/MS-based proteomics (adapted from Nesvizhskii 2010)

Program www Availability

De novo sequencing tools

Lutefisk www.hairyfatguy.com/Lutefisk **

PepNovo proteomics.ucsd.edu/Software/PepNovo.html *,**

PEAKS www.bioinformaticssolutions.com

Sequit www.sequit.org/

Database Search Tools

SEQUEST thermo.com

MASCOT matrixscience.com *

ProteinProspector prospector.ucsf.edu *

ProbID tools.proteomecenter.org/wiki/index.php?title=Software:ProbID **

X!Tandem www.thegpm.org *,**

SpectrumMill www.chem.agilent.com/

Phoenyx www.genebio.com/products/phenyx/

OMSSA pubchem.ncbi.nlm.nih.gov/omssa/ *,**

VEMS3.0 yass.sdu.dk/ **

ProteinPilot www.absciex.com

MyriMatch fenchurch.mc.vanderbilt.edu/software.php **

PepSplice www.ti.inf.ethz.ch/pw/software/pepsplice/ **

RAId_DbS www.ncbi.nlm.nih.gov/CBBresearch/qmbp/raid_dbs/ *,**

MassMatrix www.massmatrix.net/mm-cgi/home.py

Sequence tag/hybrid approaches

InsPecT proteomics.ucsd.edu/Software/Inspect.html *,**

Popitam www.expasy.org/tools/popitam/ **

TagRecon fenchurch.mc.vanderbilt.edu/software.php **

ByOnic www.parc.com/work/focus-area/mass-spectra-analysis/

SpectralNetworks proteomics.ucsd.edu/Software/SpectralNetworks.html **

MODi www.massmatrix.net/mm-cgi/home.py

Spectral Matching/Library Search Tools

SpectraST www.peptideatlas.org/spectrast/ *,**

X!P3 p3.thegpm.org/tandem/ppp.html *

Bibliospec proteome.gs.washington.edu/software/bibliospec/

Databases for storing and mining of MS data PeptideAtlas ww.peptideatlas.org

Proteios www.proteios.org

SBEAMS sbeams.org

CPAS www.labkey.org/

PRIDE www.ebi.ac.uk/pride/

Peptidome www.ncbi.nlm.nih.gov/peptidome/

MASPECTRAS2 genome.tugraz.at/maspectras

Data sharing Tranche www.proteomecommons.org/dev/dfs/

* Free access via the web interface (functionality might be limited).

** Free software distribution.

(22)

2.1 Identification of Protein Post-Translational modifications (PTM): A great proteomics challenge.

The identification of protein modifications is of paramount importance to understand the regulation and dynamics of a proteome. There is a tremendous interest in a variety of biologically significant PTMs such as phosphorylation, acetylation, glycosylation and methylation. Numerous targeted MS-based PTM studies have shown promising results. For instance protein phosphorylation, playing a major role in signaling networks, was extensively mapped in large-scale MS studies (Ficarro et al. 2002, Steen et al. 2002, Beausoleil et al.

2004). Similarly, the role of glycosylation as a functional modulation of secreted or membrane proteins have been investigated using MS/MS (Tissot et al. 2009).

However, exhaustive identification of PTMs in high-throughput MS-experiments has proven to be difficult for a number of reasons including the large number of possible PTMs, sub- stoichiometric amounts of modified proteins, and as peptides carrying certain PTMs display MS2 fragmentation patterns which can be difficult to interpret (Creasy et al. 2004). Therefore successful protein modification studies rely on carefully designed experimental setups applying extensive sample fractionation/enrichment protocols for the detection of low abundant protein species. MS data need to be accurate and contain information rich fragmentation patterns. Furthermore sensitive MS2 data analysis tools capable of exploring large data volumes for modifications at a reasonable computational time need to be available to the proteomics community.

2.2 Combining identification strategies to identify modified proteins

When analyzing the same CID dataset, the overlap between different sequence search protein identification tools is typically in the range of 70% -80%, suggesting that it would be meaningful to combine multiple search tools to increase the overall identification rate (Keller et al. 2002, Lopez-Ferrer et al. 2004, Anderson et al. 2003, Kislinger et al. 2003, Ulintz et al.

2006). Multi-stage analysis strategies where separate identification tools are executed in parallel and/or in series have been proposed to not only increase the number of identified MS/MS spectra, but also to improve identification rates of PTMs (Hernandez et al. 2006, Quandt et al. 2009, Nesvizhskii et al. 2010).

(23)

This thesis is made up of articles, two of which are published and the other two are under revision, contributing to the development of an analysis workflow designed for protein identification and characterization (see Figure 1).

In the first article “A simple workflow to increase MS2 identification rate by subsequent spectral library search” (Erik Ahrné, Alexandre Masselot, Pierre-Alain Binz, Markus Müller, Frederique Lisacek) presented in chapter 3 we demonstrate how coupling a classical sequence search with Phenyx, Genebio SA, (Colinges et al. 2003) to a rapid spectrum library search using SpectraST (Lam et al. 2007) may lead to a dramatic increase in the number of confidently identified experimental MS/MS spectra. Furthermore we demonstrate how a simple adjustment to this analysis pipeline allows for unsupervised identification of modified peptides.

Without robust statistical and computational methods, proteomic studies suffer from large False Positive Rates (Nesvizhskii 2004, Russell et al. 2004, Baldwin et al. 2004). In order to promote the use of new data analysis tools such as spectral library search analysis it is essential to develop sound methods for validating the results output. In chapter 4 of this thesis we present the manuscript entitled “How to create good MS2 decoy spectra?” (Erik Ahrné, Yuki Ohta, Frederic Nikitin, Alexander Scherl, Frederique Lisacek, Markus Müller). Here, we describe the development and testing of a software tool, DeLiberator, for the creation of randomized spectral libraries. Screening experimental data against such decoy spectral libraries allow for the validation of spectral library search based peptide identification results, on par with the guidelines for data publication enforced by leading proteomics journals.

We performed an extensive literature study of software solutions designed for exhaustive identification of PTMs in preparation to improve the capability of our spectrum library based analysis workflow to identify modified variants of the peptides listed in the spectrum library.

The article entitled “Unrestricted identification of modified proteins using MS/MS” (Erik Ahrné, Markus Müller, Frederique Lisacek) (Chapter 5) presents an overview of state of the art open-modification search tools at the date of publication.

The second manuscript presented in this thesis “QuickMod – a tool for open modification spectrum library searches” (Erik Ahrné, Frederic Nikitin, Frederique Lisacek,, Markus Müller) (Chapter 6) describes the development and testing of the open modification spectral library search tool QuickMod. The spectrum-spectrum matching algorithm of QuickMod has

(24)

been adapted to the fragmentation of modified peptides, and the search tool includes a flexible similarity scoring scheme, easily adjustable to different MS/MS data types.

The final chapter briefly describes a collaborative proteomics project where the QuickMod based MS/MS data analysis workflow was employed to handle part of the data analysis.

Figure 1. Open Modification Spectral Library Search Workflow

1) The experimental dataset is analyzed with one or multiple sequence search tools, allowing for none or a couple of variable peptide modifications.

2) Valid peptide identifications are used to compile a spectrum library. The DeLiberator software tool is used to create a complementary decoy spectral library.

3) The MS/MS spectra not confidently identified in the initial sequence search step is extracted and submitted to QuickMod for identification of modified peptides.

(25)

Chapter 3

A simple workflow to increase MS2 identification rate by

subsequent spectral library search.

(26)

3. Article 1, Extended abstract: A simple workflow to increase MS2 identification rate by subsequent spectral library search.

Different peptides have different fragmentation propensities and for any given protein certain peptides are more easily detectable and confidently identified in a LC-MS/MS workflow.

Certain model organisms and sub-proteomes have been subject to extensive investigation in proteomics. These studies show that some peptides are detected all the time while others are never seen (Lam et al. 2007). The repeated rediscovery of the same identifiable peptides using sequence search based data analysis tools is often time-consuming and error-prone. In targeted proteomics studies, where the goal is systematic and repeated investigation of a certain predefined set of peptides or proteins rather than the discovery of previously unobserved peptides, identification results from prior experiments stored in publicly available databases may greatly facilitate the data analysis (see Table 1, Chapter 2).

Spectral library searching is a natural way of incorporating previous knowledge about observed peptides and their respective fragmentation patterns into a new search. The National Institute of Standards and Technology (NIST, http://peptide.nist.gov/) and the Institute for Systems Biology (ISB, http://www.peptideatlas.org/speclib/) have made huge efforts to compile high quality publicly available spectrum libraries for different organism and machine types including some specialized spectrum libraries rich in modified peptides. To enable the exploration of these spectral libraries when analyzing new experimental data several spectral library tools have been developed such as SpectraST (Lam et al. 2007), NIST MSPepSearch et al. 2006), X!Hunter (Craig et al. 2006), ProMEX (Hummel et al. 2007), HMMatch (Wu et al. 2007) and MSDash (Wu et al. 2008).

Lam et al. showed that a spectral library matching approach outperforms conventional sequence tools in terms of speed, error rates and sensitivity of peptide identification (Lam et al. 2007). Spectral library search tools provide rapid peptide identification as the experimental data is screened against small size databases, while time-consuming modeling of theoretical spectra is not needed. As a library spectrum is rich in information (including fragmentation intensities of a wide range of fragment ions) relative to a theoretical peptide spectrum, library search tools typically employ simple and fast scoring algorithms and are expected to yield more discriminative match scores than sequence search tools.

(27)

The limited protein coverage of currently available spectrum libraries imposes an obvious drawback of spectral library based data analysis. Even though, MS studies have reached substantial depth and coverage in the case of some model organisms (King et al. 2006, Brunner et al. 2007), existing libraries are far from complete, especially with respect to peptides from low abundance proteins and modified peptides. Publicly available MS/MS data from human samples suffer from low proteome coverage. About, 35% of all predicted human proteins have yet to be observed reliably by MS (Nilsson et al. 2010).

However, the amount of data available to the research community grows steadily much thanks to the development of proteomics repositories such as PeptideAtlas, Pride, Peptidome, and Tranche data exchange system (Vizcaino et al. 2010, see Table 1). There is a hope that all peptides that are detectable by MS, at least for the most frequently studied organism, will eventually be discovered and annotated in spectral libraries.

Another drawback of the spectral library search approach arises from the varying peptide fragmentation patterns observed on different mass spectrometers and experimental conditions such as the collision energy. Ideally different laboratories should be able to build up customized spectral libraries, as an optimal spectrum library would include fragmentation spectra produced on the very same machine as the experimental data submitted to the library search tool.

The following article entitled “A simple workflow to increase MS2 identification rate by subsequent spectral library search” describe the development of a data processing tool allowing individual laboratories to build their own spectrum libraries. Furthermore, we demonstrate how a work-flow combining a sequence search and a spectrum library search of in-house compiled spectrum libraries can improve identification coverage of the experimental data dramatically. Here a spectrum library is created on-the-fly from the results output of a sequence search analysis. In a next step we make use of the precision and speed of a spectrum library search to identify spectra that the sequence search tool failed to annotate.

(28)

(29)

Technical Brief

A simple workflow to increase MS2 identification rate by subsequent spectral library search

Erik Ahrné¹, Alexandre Masselot², Pierre-Alain Binz^1,2, Markus Müller¹, Frederique Lisacek¹

1Swiss Institute of Bioinformatics, Proteome Informatics Group, Michel-Servet 1, CH-1211 Geneva, Switzerland; Geneva Bioinformatics (Genebio) SA, Av de Champel 25, CH-1206 Geneva, Switzerland

Abstract

Searching a spectral library for the identification of protein MS/MS data has proven to be a fast and accurate method, while yielding a high identification rate. We investigated the potential to increase peptide discovery rate, with little increase in computational time, by constructing a workflow based on a sequence search with Phenyx followed by a library search with SpectraST. Searching a consensus library compiled from the search results of the prior Phenyx search increased the number of confidently matched spectra by up to 156%.

Additionally matched spectra by SpectraST included noisy spectra, spectra representing missed cleaved peptides as well as spectra from posttranslationally modified peptides.

Keywords:

Mass Spectrometry/ Protein Identification/ Spectral Library Search/ False Discovery Rate Correspondence

Mr. Erik Ahrné, Swiss Institute of Bioinformatics,,1, rue Michel Servet,CH-1211 Genève 4,Switzerland

E-mail: Erik.Ahrne@isb-sib.ch Fax: (+41 22) 379 58 58

(30)

The use of tandem mass spectrometry (MS/MS) is a well established method to identify and characterize proteins from complex samples. Various tools are designed to analyse MS/MS data.

Popular software such as Sequest [1] Mascot [2] and Phenyx [3], employ a Peptide Fragment Fingerprinting (PFF) algorithm where experimental spectra are compared to theoretical spectra generated in silico from a protein sequence database.

Recently a different method for protein identification and characterisation based on spectral library search has shown promising results, in terms of computational time, identification rate and accuracy, when compared to traditional identification techniques based on sequence search [4-7]. In this approach experimental spectra are scored against a carefully compiled database of previously identified experimental spectra. Matching experimental spectra against experimentally generated spectra and not in silico predicted ones, tends to lead to a higher sensitivity. One reason is that the actual intensities of all fragment types present in the library spectrum are considered including neutral loss and various uncommon or even unknown fragments. Further, high precision and speed is achieved as the search space of a spectral library is significantly smaller than that of a sequence database, since many putative peptides considered in a sequence search are supposedly not detectable by a mass spectrometer, and will therefore not be represented in a spectral library. The main weakness of this method resides in two aspects involving the universality of the library. Firstly, the fragmentation pattern of a peptide depends on the type of mass spectrometer used and the experimental conditions such as the collision energy, suggesting that different laboratories may need customised spectral libraries. Secondly it is evident that only peptides stored in the library will have a chance to be matched in a search. Several different spectral library search tools have been developed; SpectraST [4], X!Hunter [5], BiblioSpec [6], Libquest [7].

Here we present a method that aims at maximizing the number of interpreted spectra by combining the exhaustiveness of a sequence search approach with the sensitivity of a spectrum library search. We constructed a workflow combining sequentially a Phenyx sequence search with a spectral library search, using SpectraST where a spectral library is created from spectra confidently matched by Phenyx in an initial search (Fig. 1). The spectra are searched against this dedicated library and the identifications from the initial sequence search and the library search are combined. Given the small size of a spectral library generated from the sequence search results of a single dataset, more spectral matches can be made, in an unrestricted search, with little extra cost in computational time. The additional matches may include spectra for which the sequence search on its own fails to assign a confident score, spectra where the wrong precursor mass ion isotope has been selected as parent mass

(31)

and spectra of modified peptides. The workflow was tested on two datasets of different complexity [8, 9].

Figure 1. Schema of the analysis workflow: The dataset of unidentified experimental spectra is

searched by Phenyx. A consensus spectral library is constructed from the search results. Next, a library search using SpectraST is performed on the full dataset. The search results of the two identification tools are merged.

The first test was performed on three replicate LTQ-FT (Thermo Fisher Scientific Inc.) datasets [8] consisting of two-fold dilution series spanning a dynamic range from 25 to 800 fmol of six non- human proteins spiked into a complex sample background of human proteins. The six proteins are:

horse myoglobin (Swiss-Prot accession number (AC) P68083, identifier MYG HORSE), carbonic anhydrase (Swiss-Prot AC P00921, identifier CAH2 BOVIN), horse Cytochrome c (Swiss-Prot AC P00004, identifier CYC HORSE), chicken lysozyme (Swiss-Prot AC P00698, identifier LYSC CHICK), yeast alcohol dehydrogenase (Swiss-Prot AC P00330, identifier ADH1 YEAST), rabbit aldolase A (Swiss-Prot AC P00883, identifier ALDOA RABIT). The search parameters with Phenyx (Geneva Bioinformatics SA, Geneva, Switzerland) were set to carboxyamidomethylation of cysteine residues and parent mass tolerance of 50 ppm, allowing for two missed tryptic cleavages. A combined target-decoy database search was performed to estimate the false discovery rate (FDR). The target database used was human IPI (v.3.15) containing the protein sequences of the six standard proteins.

The decoy database was generated by applying a third order Markov chain [10]. For each of the six dilution runs, matches with a z-score higher than what corresponds to an approximate 5% FDR were

(32)

clustered together and a consensus spectrum was generated. For each cluster all peaks were ordered by mass, peaks that group together within in a 0.2 Da mass window were merged, where the consensus peak mass was set to the centroided mass and the intensity was set to the average intensity. Only peaks appearing in more then 20% of the spectra were included.

Next, the full dataset was searched with SpectraST. SpectraST employs a weighted dot product scoring algorithm for spectral comparison and was developed at the Institute for Systems Biology as integrated tool of the Trans Proteomic Pipeline. Considering the small library size the precursor m/z tolerance has little influence on the overall search time, therefore we tested the workflow setting the precursor mass tolerance of the SpectraST search to 100 Da. The SpectraST results were subjected to automated validation and probability assignment by PeptideProphet [11]. Peptide matches, with scores passing a FDR threshold of 5% were registered and the results of the Phenyx search and the SpectraST library search were compared. Spectra confidently identified with SpectraST but not with Phenyx were manually examined using an in-house spectral visualisation tool.

SpectraST was installed on a regular desktop PC. The spectral library of the LTQ-FT dataset of 303 spectra was built from the Phenyx search results exported to pepXML format in approximately 90 seconds. The library search time per spectrum was approximately 0.003 seconds.

In total, for the six dilutions runs of the first replicate dataset, Phenyx identified 1485 spectra.

SpectraST identified 3805 spectra at the same FDR. To explain the additionally identified spectra we examined the spectra identified to the six-non human proteins known to be present in the samples.

Phenyx matched 362 spectra to these proteins. SpectraST made 639 matches at the same FDR.

SpectraST failed to confidently identify 3 out of the Phenyx identifications. Upon further inspection, these spectra were found to be identified with the same peptides as those proposed by Phenyx but with PeptideProphet probabilities falling below the threshold. Out of the 639 matches made by SpectraST 280 were not confidently matched by Phenyx. After manual examination of these spectra they could be grouped into 6 categories (Table 1). Approximately 40% of these additionally matched spectra were found to include deamidated Asparagine (N) and/or Glutamine (Q) residues. One third of the spectra were noisy spectra that Phenyx failed to confidently match. About 15% of the spectra represented peptides where the precursor ion had lost a molecule of water. Among the remaining additional identifications were spectra where the wrong isotope had been picked as parent mass, spectra representing N-term carboxyamidomethylated peptides and noisy missed cleaved peptides. Note that the sequence search parameter settings allowed for up to two missed cleavages. 13 out of the 280

(33)

additional SpectraST matches were found to be incorrect identifications. For further verification of the SpectraST identifications a subset of 100 spectra, matched to human proteins, picked at random were manually inspected. Among these spectra 6 appeared to be identified to the wrong peptide. The results described above were reproduced in the analysis of the second and third replicate samples.

In figure 2a, a noisy experimental spectrum confidently identified by SpectraST is displayed with the candidate library spectrum. The match is assigned a PeptideProphet probability score of 0.99.

Phenyx identified this spectrum to the same peptide as SpectraST but with a z-score of 5.74, falling below the z-score threshold of 6.1 which corresponds to an approximate FDR of 5%.

Figure 2b shows the library spectrum representing the unmodified doubly charged peptide LFTGHPETLEK matched to an experimental spectrum representing a doubly charged n-terminal carboxyamidomethylated variant of the same peptide. Peaks of the C-terminal ion series align in both spectra leading to a high scoring match, while those of the N-terminal ion series are shifted by 58 Da.

This spectrum was not identified by Phenyx as the search parameters were not set to include N-terminal carboxyamidomethylation.

A large percentage of the spectra additionally matched after the library search, are deamidated peptides. Deamidation is a common post-translational modification resulting in the conversion of an asparagine residue to a mixture of isoaspartate and aspartate. Deamidation of glutamine residues also occurs but does so at a lower rate. It can also artefactually happen under acidic catalysis. When these spectra are manually compared with the matched library spectrum it is apparent that deamidation often causes important changes in the fragmentation pattern.

Figure 2c shows the non-modified and asparagine deamidated spectra of the RHGLDNYR peptide, from chicken lysozyme. The high intensity of b6 and y6 ions in the deamidated spectra is possibly explained by the enhancement of fragmentation after acidic amino acids when no mobile proton is present. Even though the intensity distribution of the fragment ions of the two peptides is rather different the match is assigned a PeptideProphet probability score of 0.99. A likely explanation for this is that in addition to the simple dot-product of the square root of the peak intensities of the compared spectra, the SpectraST fval score incorporates two sub-scores, the delta dot score (∆D) and the dot bias score (DB). The ∆D score is the difference in dot product of the top -hit and the second highest scoring candidate match. The DB score is a measure of how biased the dot product score is to a few matching peaks. Given the small size of our spectral library the probability of getting a high scoring runner up by chance is small. Furthermore, the contribution to the dot product is evenly

(34)

distributed among the peaks in the deamidated spectrum, thus the ∆D and DB scores will promote a high fval score.

For a more detailed discussion of the analysis of the additionally found matches, by SpectraST, in the LTQ-FT data see figure S4 in the supplementary material.

Figure 2a)

Figure 2b)

(35)

Figure 2c)

Figure 2. Spectra confidently matched with SpectraST (bottom), but with no confident Phenyx identification, displayed with the matched consensus library spectrum (top). (a) A noisy experimental spectrum (bottom), confidently matched by SpectraST to a consensus library spectrum. (b) A spectrum representing a N-term carboxyamidomethylated peptide (below) matched to an unmodified library spectrum of the same peptide. (c) A spectrum representing a deamidated peptide matched to the unmodified library spectrum.

(36)

Note that we also studied the accuracy of spectral count as a quantification method based on the Phenyx search results alone and when taking into account the additionally matched spectra found in the library search. Liu et al, 2004 [12] and Old et al, 2005 [13] suggested that there is a linear correlation between spectral count, being the sum of all identified spectra for a given protein including spectra redundant for ion charge states, and protein abundance. The results of this comparison are presented in the supplemental material (see figures S1-S3).

Table 1, Additionally identified spectra by SpectraST deamidation No precursor

mass difference

water loss

wrong precusor mass isotope

Pyro Glu

Oxidation of

Methionine

carboxyamido- methylatio n

other^* wrong id

Dataset I

103 89 37 16 0 0 5 29 13

Dataset II

4 122 2 5 12 20 - - -

Table 1. Categorisation of spectra with no confident Phenyx hit, but confidently matched by SpectraST.

For dataset I (LTQ-FT data) the categorisation covers the manually validated matches of the first replicate run identified to one of the six-non human proteins. For dataset II (QqTOF data) the categorisation covers the full dataset and is based on the manual identifications in Chalkley et. al.

* Bad mass precision (20), mis-cleavage( 8), co-elution (1).

The second test involved a more complex dataset of 3269 spectra, from a sample containing over 200 proteins, acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight (QqTOF) geometry instrument [9]. The Phenyx search parameters were set to parent mass tolerance of 200 ppm and to include the following PTMs as variable modifications; oxidation of methionine, N- terminal acetylation, deamidation, methylation of lysine and pyroglutamate formation from N-terminal glutamine residues. The database searched was UniprotKB/SwissProt v.54.6. The library was built from non-modified confident Phenyx identifications. False discovery rates were estimated and the library was created using the same protocol as for the LTQ-FT dataset. The QqTOF library contained 1017 spectra and was built in approximately 200 seconds. The search time per spectrum was on an average 0.009 seconds.

In Chalkley et. al. 2005 [9], 2479 of the yeast spectra were manually annotated. We used these

(37)

should be used with some caution, but we see no reason to believe that potential wrongly assigned spectra would affect the overall comparison of the Phenyx and SpectraST results. When using the manual identifications to calculate the FDR for the Phenyx and SpectraST searches at given score thresholds we found them to be approximately one percent higher for both search tools, compared to the FDR estimated from the decoy search and the PeptideProphet analysis. This could be explained by spectra matched to peptides with very similar sequences to the manual identifications. An example is the spectrum at 31.21 min from cation exchange fraction 1 which was manually annotated with the peptide sequence STEIIR but assigned the sequence TSEIIR by Phenyx and SpectraST or the spectrum at 19.76 min from cation exchange fraction 6 manually annotated TITFHR but identified to the peptide DVTFHR by Phenyx and SpectraST.

At a FDR of 5% Phenyx alone identified 1084 non-modified peptides identical to the manual identifications of these spectra. Given the content of the spectral library SpectraST could at best match 1304 (20% more) spectra to the same peptides as the manual identifications. SpectraST made 1249 (15% more) “correct” identifications. Out of the 55 spectra that SpectraST failed to recover 38 were identified to the right peptide but with scores below the set threshold. These were primarily modified spectra and spectra that also Phenyx failed to confidently identify. The remaining 17 spectra identified to peptides different from the manual identifications were mainly cases where SpectraST matched a peptide with a highly similar sequence to the manual identification, and cases where Phenyx had made a high scoring match, different from the manual annotation, and consequently the same peptide was matched by SpectraST.

The additional matches made by SpectraST are categorized in Table 1. Out of the 122 non- modified spectra confidently matched by SpectraST but with no confident Phenyx identification 52 were matched to the same peptide as SpectraST with z-scores between 5 and 6 where a z-score of 6 corresponds to an approximate FDR of 5% and a zscore of 5 corresponds to an approximate FDR of 15%.

In total Phenyx correctly identified 47 posttranslationally modified peptides, including oxidation of methionine residues (33), deamidation (8), n-term acetylation (4), and pyroglutamate formation from n-terminal glutamine residues. (2). SpectraST confidently identified 43 modified peptides to the unmodified variant in the spectral library (see Table1). Note that the acetylated peptides could not be identified by SpectraST as their unmodified variant was not present in the library. Together Phenyx and SpectraST confidently matched 69 modified peptides. Summing up the confident matches made by

(38)

SpectraST and Phenyx 1271 identification are made in agreement with the manual annotations (12.5%

more spectra than, the non-modified and modified spectra, found in Phenyx search alone).

The results presented here indicate that a sequence search followed by a spectral library search, where the spectrum library is built from the results of the preceding analysis, increases, quickly and substantially, the degree of explanation of tandem mass spectrometry data.

Identifying more spectra with the same peptide and protein not only leaves a smaller fraction of the experimental data unexplained but also helps validating the proteins found in the sequence search.

The additionally identified spectra include noisy spectra for which the sequence search fails to assign a confident score as well as related variants of the spectra in the library.

A confidently matched spectrum with a differing precursor mass from the candidate library spectrum can be explained by water loss, trypsin miscleavage, PTMs (known or unsuspected) and fragmentation of the wrongly detected monoisotopic precursor ion m/z. Particularly interesting is the identification of PTMs. In the datasets we analysed spectra of deamidated, oxidized, dehydrated, carboxyamido-methylated and pyruoglutamated peptides confidently matched to their unmodified variants in the library. It is evident that SpectraST is less likely to find peptides carrying more than one modification and the ability to make a confident identification also depends on the location of the modification in the peptide sequence. However, since the search for modified variants of the library spectra is largely unrestricted the library search gives a global view of the modifications present in the sample and reveals if the PTM search parameters for the sequence search were appropriately set.

A key factor in the proposed workflow is the quality of the spectral library. For testing purposes we considered all Phenyx matches passing a z-score threshold corresponding to a 5% FDR for the library creation. We did this to allow for the library search tool to recover all sequence search identifications. However it makes sense to keep a higher threshold on the scores of identified spectra submitted to the library creation to avoid a propagation of false positive identifications in the sequence search onto the library search.

We tested the workflow on two datasets of differing complexity. It is clear that the number of additionally matched spectra in the subsequent library search is dependent on the ratio of total number of spectra to number of unique peptides in the dataset. An unmatched spectrum in the sequence search can only be recovered in the library search given that a spectrum of the same peptide or a variant of that peptide was confidently matched in the sequence search and thus included in the spectral library. The potential increase in number of additional matches made when employing the Phenyx – SprectraST

(39)

workflow, compared to a sequence search alone, is largely determined by the complexity of the dataset.

On a low complexity LTQ-FT dataset identification rates were dramatically increased when including a library search. However, even when analyzing the high complexity QqTOF dataset containing few redundant spectra of each peptide, the simple workflow significantly increases the number of confidently identified spectra, while matching additional posttranslationally modified peptides.

The authors would like to thank Henry Lam for technical support regarding the use of SpectraST. This work is preliminary to a collaborative project with Microsoft Research.

(40)

References

[1] Eng JK, McCormack AL, Yates IIIJ (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5:976-989 [2] Perkins, D.N., Pappin D.J., Creasy D.M., Cottrell J. S., Probability-based

protein identification by searching sequence databases using mass spectrometry data, Electrophoresis 1999, 20, 3551-3567

[3] Colinge J., Masselot A., Giron M., Dessingy T., Magnin J., OLAV: towards high-throughput tandem mass spectrometry data identification, Proteomics, 2003 3, 1454-1463

[4] Lam, H., Deutsch, E. W., Eddes, J. S., Eng. J. K. et al. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics, 2007, 7, 655--667

[5] Craig R., Cortens, J. C., Fenyo, D., Beavis R. C., Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res, 2006, 5, 1843—1849

[6] Frewen, B. E., Merrihew G. E., Wu C. C., Noble W. S., MacCoss, M. J.Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal Chem, 2006, 78, 5678- 5684

[7] Yates, J. R., Morgan, S. F., Gatlin, C. L., Griffin, P. R., Eng, J. K. Method to compare collision- induced dissociation spectra of peptides: potential for library searching and subtractive analysis. Anal Chem, 1998, 70, 3557-3565

[8] Mueller, L. N., Rinner, O., Schmidt, A., Letarte, S. et al., SuperHirn - a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics, 2007, 7, 3470-3480

[9] Chalkley R. J., Baker P. R., Hansen K. C., Medzihradszky K. F., et al., Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: I.How much of the data is theoretically interpretable by search engines?, Mol Cell Proteomics 2005, 4, 1189--1193

(41)

[10] Haas, W., Faherty, B. K., Gerber, S. A., Elias, J. E., et al., Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Mol Cell Proteomics, 2006, 5, 1326-1337

[11] Keller A., Nesvizhskii A. I., Kolker E, Aebersold R., Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal Chem 2002, 74, 5383- 92

[12] Liu, H., Sadygov, R.,G., Yates, J. R., A model for random sampling and estimation of relative protein abundance in shotgun proteomics, Anal Chem, 2004, 76, 4193-4201

[13] Old, W. M., Meyer-Arendt K., Aveline-Wolf L., Pierce, K. G. et al., Comparison of label-free methods for quantifying human proteins by shotgun proteomics.

Mol Cell Proteomics, 2005,4, 1487—1502

(42)

Supplemental Material

Figure s1), Figure s2), Figure s3)

Supplemental Figure S1-S3

For each of the six non-human proteins, the number of identified spectra at each dilution was counted and normalised to the total number of spectrum matches for that protein. The normalised spectral count was used as a measure of the relative protein abundance and compared to the known relative abundances. Figure S1-S3 show the estimated normalised abundance profiles of the six non-human proteins based on spectral count quantification for the Phenyx search results, two SpectraST searches when the precursor mass tolerance was set to 2 and 100 Da. All three searches show similar results with good estimates of the relative abundance for four out of the six proteins. Larger deviations from the reference curve

(43)

representing the known protein abundances, was observed for proteins SwissProt AC P00330 and AC P00883. Thus, spectral count quantification was not improved when including the additional matches found by SpectraST, to the contrary a slight suppression of the estimates of high abundance dilutions was observed. This results from a higher relative increase in spectral counts for low abundance dilutions. The spectral count quantification results are to be compared with the MS-1 based quantification , using the software tool SuperHirn, of the same dataset presented in Mueller et al 2007. Here the dilution profiles are estimated with good accuracy for all six non-human proteins.

Supplemental Figure S4

To further investigate the additionally found spectra by SpectraST, these spectra were re- scored against the peptide sequences proposed by SpectraST, using the Phenyx re-scoring module. For each spectrum we re-scored all possible scenarios in accordance with the precursor mass difference between the query spectrum and the library spectrum. Figure S4a shows the re-scoring results of the deamidated peptide TGPNLHGLFGR. The z-score for each scenario is listed in the third column. The best scoring match is found when the peaks of the theoretical spectrum are shifted in accordance with a 1 Da modification of Asparagine.

The last row gives the the score of the query spectrum matched to the theoretical spectrum with no shifted peaks.

The results of the re-scoring are displayed in the scatter plot ( figure S4b ). Matches with a z- score (x-axis) and z-score -best scenario (y-axis) below 6.1 (threshold for 5% FDR) are found by SpectraST as a results of its higher precision to Phenyx. The spectra with either zscore higher than 6.1 would have been correctly matched in the initial Phenyx search if the

Exploring the use of MS/MS spectral libraries to improve protein identification and characterization

Thesis

Reference

Exploring the use of MS/MS spectral libraries to improve protein identification and characterization

UNIVERSITE DE GENÈVE FACULTE DES SCIENCES

Département d’informatique Professeur Ron D. Appel Institut Suisse de Bioinformatique Dr. Frédérique Lisacek

Exploring the use of MS/MS Spectral Libraries to Improve Protein Identification and Characterization.

THÈSE

présentée à la Faculté des sciences de l’Université de Genève

pour obtenir le grade de Docteur ès sciences, mention bioinformatique

par Erik Ahrné

de

Uppsala (SWE)

Thèse N° 4311

Genève

Atelier d'impression ReproMail

2011

Résumé en Français

Abstract

Publications

Table of Contents

Chapter 1

Introduction

Chapter 2

Thesis Overview

Chapter 3

A simple workflow to increase MS2 identification rate by

subsequent spectral library search.

A simple workflow to increase MS2 identification rate by subsequent spectral library search