• Aucun résultat trouvé

Article 2, Extended Abstract: Unrestricted identification of modified proteins using

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

5.5 Article 2, Extended Abstract: Unrestricted identification of modified proteins using

The commercial identification tools Mascot and Sequest are the software of choice in the majority of proteomics labs to analyze MS/MS data. These tools are primarily protein identification tools and have limited design for the identification of PTMs. MS/MS spectra of modified peptides can be more complex than their unmodified counterpart and PTM tolerant data analysis is often associated with long computational times and high false discovery rates.

For these reasons classical sequence search tools limit the user to screen the experimental data for a restricted list of predefined protein modifications.

Given the biological importance of posttranslational modification events of the proteome a lot of research efforts have been put into developing more exhaustive data analysis software. In recent years several MS/MS search tools capable of performing so called Open Modification Searches (OMS, also referred to as blind search tools), have been described in the literature.

OMS tools do not rely on prior input from the user specifying potential modification types present in the sample, but analyze the MS/MS data in a more or less unsupervised manner;

allowing for all known modifications listed in a PTM database, such as UNIMOD, or all possible modification masses up to a user-defined maximum value.

In preparation to improve the capability of our spectrum library based analysis workflow to identify modified variants of the peptides listed in the spectrum library, we performed an extensive literature study of existing algorithmic solutions for comprehensive protein PTM analysis.

The following article entitled “Unrestricted identification of modified proteins using

MS/MS” presents an overview of state of the art open-modification search tools at the date of publication.

92 Review

Unrestricted identification of modified proteins using MS/MS

Erik Ahrné1, Markus Müller1, Frederique Lisacek1

1Swiss Institute of Bioinformatics, Proteome Informatics Group, Michel-Servet 1, CH-1211 Geneva, Switzerland

Abstract

Proteins undergo post-translational modification which modulates their structure and regulates their function. Estimates of the PTM occurrence vary but it is safe to assume that there is an important gap between what is currently known and what remains to be discovered. The highest throughput and most comprehensive efforts to catalogue protein mixtures have so far been using mass spectrometry based shotgun proteomics. The standard approach to analyse MS/MS data is to use Peptide Fragment Fingerprinting tools such as Sequest, Mascot or Phenyx. These tools commonly identify 5-30% of the spectra in an MS/MS dataset while only a limited list of predefined protein modifications can be screened. An important part of the unidentified spectra is likely to be spectra of peptides carrying modifications not considered in the search.

Bioinformatics for PTM discovery is an active area of research. In this review we focus on software solutions developed for unrestricted identification of modifications in MS/MS data, here referred to as Open Modification Search tools (OMS). We give an overview of the conceptually different algorithmic solutions to evaluate the large number of candidate peptides per spectrum when accounting for modifications of unrestricted size and demonstrate the value of results of large-scale OMS studies. Efficient and easy-to-use tools for protein modification discovery should prove valuable in the quest for mapping the dynamics of proteomes.

Keywords:

Tandem Mass Spectrometry/ Protein Identification/ Post Translational Modification Abbreviations:

FDR, False Discovery Rate; OMS, Open Modification Search; PFF, Peptide Fragment Fingerprinting; PSM, Peptide Spectrum Match; PTM, Post-Translational Modification

Introduction

Proteins undergo post-translational modification (PTM), which modulates their structure and regulates their function. The identification of protein modifications is of paramount importance to understand the regulation and dynamics of a proteome. A range of methodologies have been designed for the discovery of PTMs in the past decades. The detection of a single PTM has hinged on structural methods like X-ray or NMR and on chemical methods involving labeling and separation techniques (e.g., liquid chromatography).

Besides, PTM annotation in protein sequences can also be produced with algorithms that attempt to predict the presence of certain modifications based on sequence patterns (see http://www.expasy.org/tools/#ptm, for a comprehensive list of such tools).

Today mass spectrometry (MS) is a central technology for the identification of PTMs [1-4].

The highest throughput and most comprehensive efforts to catalog protein mixtures, including the identification of post translational modifications, have so far been based on shotgun proteomics [5]. For instance protein phosphorylation, playing a major role in signaling networks, was extensively mapped in large-scale MS studies [6-8]. Likewise, the role of glycosylation as a functional modulation of secreted or membrane proteins, has been investigated using tandem mass spectrometry [9].

In a study by MacCoss et al. [10] it was estimated that proteins on an average carry 3 PTMs.

In another paper [11] the number of modified variants in proteomic samples was predicted to be as many as 8-12 per unmodified peptide although most of these modified species are presumed to be present at very low concentration. On the other hand, less than 1% of all proteins in UniProtKB/Swiss-Prot are annotated with a PTM [12]. The protein modification databases Unimod [13] and RESID [14] contain approximately 500 different modification entries each. Estimates of the PTM occurrence vary but it is safe to assume that there is an important gap between what is currently known and what remains to be discovered.

The analysis of high throughput data depends heavily on bioinformatics. The standard approach to analyse MS/MS data is to use a Peptide Fragment Fingerprinting (PFF) tool such as Sequest, Mascot, Phenyx, X!Tandem, Sonar, or OMSSA [15-20]. These tools all have the

94 limitation that the user has to define potential modifications prior to the search, and therefore often fail to identify an important fraction of the MS/MS dataset.

In this review we focus on software solutions developed for unrestricted identification of modifications, here referred to as Open Modification Search tools (OMS), where no a priori assumptions on the modification state of the sample needs to be made by the user. These tools are designed to identify already known modification types annotated in databases as well as previously unknown post-translational and chemically induced modifications.

We will first raise issues relating to experimental set-ups for PTM detection as well as to conventional identification methods. We will then detail the various OMS strategies defined by different authors to discover PTMs in high throughput data. Finally we will discuss the results of some large scale OMS studies.