• Aucun résultat trouvé

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

6.3 Results

6.3.5 Benchmarking

The spectrum-spectrum matching and scoring algorithm of QuickMod has been optimized to identify modified variants of the spectral library entries. In order to evaluate the relevance of this effort we compared the performance of QuickMod to a standard spectrum library search tool SpectraST, which was not developed for OMS. The search parameters of SpectraST where set to allow for the matching of library and query spectra with a large precursor mass difference. We also compared QuickMod to InsPecT combined with the PTMFinder11 post-processing tool, a popular sequence search based open modification data analysis pipeline.

The three tools were benchmarked analyzing the Exp_HP dataset including approximately 56000 spectra. In the QuickMod and SpectraST analysis the query dataset was screened against the spectral library Lib_HP concatenated to a decoy spectral library of the same size.

InsPecT was searched in blind mode against a peptide sequence database including the same peptide entries as the spectral library (Cys-Cam was specified as a fixed modification). The search parameters of all search tools were configured to allow for the matching of PTMs with a maximum mass of 200 Da (See Supporting Information for more details). The QuickMod search was based on the scoring model trained on the test data described in the dataset section.

148 Figure 4a displays the number of modified peptide spectra, identified per modification mass bin (bins of 1 Da) for all three search tools (SSMs with a delta mass lower than -20 Da were discarded as modifications corresponding to the loss of c- or n-term amino acids are reported differently by the three tools). For QuickMod and SpectraST a FDR cutoff of 0.05 was employed and the InsPecT-PTMFinder results were filtered for identifications with a p-value < 0.05 (an empirical p-p-value based on the score distributions of false hits comparable to the FDR). The results show that largely the three tools identify the same modification types while QuickMod identifies substantially more spectra carrying these modifications. The most abundant modifications masses based on spectral count correspond to: carbamidomethylation (57.02 Da, on residues other than cysteine), carbamylation (43.01 Da), sodium adduct (21.98 Da), di-methylation/ethylation (28.03 Da), oxidation (15.99 Da) and acetylation (42.01 Da).

At the specified validation criteria QuickMod identified a total of 5301 charge state 2+

spectra and 2466 charge state 3+ spectra, SpectraST indentified 2225 2+ spectra and 679 3+

spectra, while InsPecT identified 3509 2+ spectra and 1742 3+ spectra (we count the total number of spectra per peptide cluster, reported in the PTMFinder output file). Figure 4b displays the number of unique peptide identifications, where unique means that a peptide is only counted once for each modification mass shift. We note a very high overlap between modified peptides identified by SpectraST and QuickMod (almost all peptides identified by SpectraST are found by QuickMod) while the InsPecT analysis partly includes complementary identifications. The Inspect-PTMFinder analysis took 35 minutes (Inspect 30 min, PTMFinder 5 min), while the total search time for QuickMod was 4 minutes. SpectraST took 55 minutes to complete. Note that SpectraST is not designed for large precursor mass tolerance searches, which likely explains the unexpectedly long analysis time. All searches were performed on a 2.13 GHz Intel Core 2 Duo processor laptop.

Finally, we investigated another scenario, where the 5 most abundant modifications are first identified by QuickMod. In a subsequent X!Tandem sequence search with 0.01 Da precursor mass tolerance, these modifications were taken into account as variable modifications on those amino acids that were most frequently observed in the QuickMod results for the corresponding mass shift (on the average 3 amino acids per modification). The small precursor mass tolerance reduces the search space and could eventually lead to a high number of identified modified peptides. However, at a FDR of 0.05 X!Tandem returned a much lower number of modified PSMs (573 for doubly and 23 for triply charged spectra) compared to the OMS approaches.

Figure 4a)

Figure 4b)

Figure 4. A) The histogram shows the number of valid SSM per integer modification mass returned by the three search tools QuickMod, SpectraST and InsPecT when analyzing the dataset Exp_HP. Solid bars show the number of spectra identified by QuickMod (FDR cut-off 0.05), triangles the number of spectra identified by SpectraST (FDR cut-cut-off 0.05), and solid circles the number of spectra identified by InsPecT (Pvalue cut-off of 0.05). B) The Venn diagram displays peptide identifications found by the three search tools. Each peptide is only counted once for each modification mass.

150

6.4 Concluding Remarks

Several spectrum library search tools have been developed in recent years for targeted analysis of MS2 datasets, while the use of spectrum libraries for the identification of modified variants of peptides included in spectrum libraries has seen little investigation. To evaluate this approach we compiled a modification rich test dataset of annotated peptide spectra. We demonstrate that for the identification of peptide modifications, the selection of peaks present in the experimental spectra of unmodified peptides serve as a better reference than theoretical peptide spectra. We also show that the spectrum libraries of non-modified peptide spectra used in this study allow for comparable modification site localization accuracy as full theoretical spectra.

The open modification spectrum library search tool, QuickMod, includes a spectrum-spectrum match scoring algorithm developed after careful analysis of an extensive list of spectral similarity measures. We show how spectra from peptides carrying distinct modification types have different scoring characteristics and evaluate the final scoring scheme per modification type. QuickMod was designed for the analysis of CID fragmentation data but is easily adapted to different MS2 data types. In future work we plan to evaluate our search tool on Electron Transfer Dissociation (ETD) MS2 data. ETD fragmentation has proven to be a successful method when targeting labile PTMs and usually yields a more complete peptide backbone fragmentation. This is good news for those researchers who apply OMS strategies since it should result in a higher similarity between modified and non-modified spectra leading to a higher identification rate of modified peptides and especially to more accurate modification positioning. In this context, it will also be interesting to see how the QuickMod approach can be extended to two or more modifications per peptide.

Finally we would like to encourage proteomics labs to organize their high confidence analysis results reported by sequence search tools into spectral libraries in order to allow future data analysis to benefit from these experiments. It has been shown in previous studies that spectral library search analysis provides fast and sensitive identification of peptides as they are present in the library. As exemplified in this study, the high speed and accuracy of spectrum library searches also holds when searching for modified variants of library peptides.

Achnowledgement. We acknowledge Microsoft Research for their collaboration and we would also like to thank Alex Scherl, Paola Antinori and Laurant Geiser for providing the human plasma data.

152

References

(1) Hernandez, P.; Müller M.; Appel, R., Automated protein identification by tandem mass spectrometry: issues and strategies. Mass Spectrom Rev 2006, 25, (2), 235–254

(2) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villén, J.; Li, J.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P., Large-scale characterization of hela cell nuclear phosphoproteins. Proc Natl Acad Sci U S A 2004, 101, 12130–12135

(3) Bodenmiller B.; Malmstrom, J.; Gerrits, B.; Campbell, D.; Lam, H.; Schmidt, A.;

Rinner, O.; Mueller, L. N.; Shannon, P. T.; Pedrioli, P. G.; Panse, P.; Lee, H. –K.;

Schlapbach, R.; Aebersold, R., Phosphopep–a phosphoproteome resource for systems biology research in drosophila kc167 cells. Mol Syst Biol 2007, 3, 139, 2007.

(4) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M.;

Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 2006, 127, 635–648

(5) Tissot, B.; North, S. J.; Ceroni, A.; Pang, P.-C.; Panico, M.; Rosati, F.; Capone, A.;

Haslam, S. M.; Dell, A.; Morris, H. R., Glycoproteomics: past, present and future. FEBS Lett 2009, 583, 1728–1735.

(6) Eng, J. K.; McCormack, A. L.; Yates, J. R., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. AM. Soc. Mass Spectrom. 1994, 5, 976–989.

(7) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S., “Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567.

(8) Craig R.; Beavis, R. C., Tandem: matching proteins with tandem mass spectra.

Bioinformatics 2004, 20, 1466–1467.

(9) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J., Olav: towards high-throughput tandem mass spectrometry data identification. Proteomics 2003, 3, 1454–1463.

(10) Tanner, S.; Shu, H.; Frank, A.; Wang, L.-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.;

Bafna, V., Inspect: identification of posttranslationally modified peptides from tandem mass spectra., Anal Chem, 2005, 77, 4626–4639.

(11) Tanner, S.; Payne, S. H. ; Dasari, S.; Shen, Z.; Wilmarth, P. A.; David, L. L.; Loomis, W. F.; Briggs, S. P.; Bafna, V., Accurate annotation of peptide modifications through unrestrictive database search. J Proteome Res 2008, 7, 170–181.

(12) Ahrné, E.; Masselot, A.; Binz, P.-A.; Müller, M.; Lisacek, F., A simple workflow to increase ms2 identification rate by subsequent spectral library search., Proteomics 2009. 9, 1731–1736.

(13) Falkner, J. A.; Falkner J. W.; Yocum, A. K.; Andrews, P. C., A spectral clustering approach to ms/ms identification of post-translational modifications. J Proteome Res 2008, 7, 4614–4622.

(14) Ye, D.; Fu, Y.; Sun, R.-X.; Wang, H.-P.; Yuan, Z.-F.; Chi, H.; He, S.-M.., Open ms/ms spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics 2010, 26, i399–i406.

(15) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K. ; King, N.; Stein, S. E.; Aebersold, R., Development and validation of a spectral library searching method for peptide identification from ms/ms. Proteomics 2007, 7, 655–667.

(16) Desiere, F.; Deutsch, E. W. ; King, N. L.; Nesvizhskii, A. I.; Mallick, P.; Eng, J.; Chen, S.; Eddes, J.; Loevenich, S. N.; Aebersold, R.,The peptideatlas project.,” Nucleic Acids Res 2006, 34, D655–D658.

(17) Craig, R.; Cortens, J. C. ; Fenyo, D.; Beavis, R. C., Using annotated peptide mass spectrum libraries for protein identification. J Proteome Res 2006, 5, 1843–1849.

(18) Frewen, B. E.; Merrihew, G. E.; Wu, C. C.; Noble, W. S.; MacCoss, M. J., Analysis of peptide ms/ms spectra from large-scale proteomics experiments using spectrum libraries. Anal Chem 2006, 78, 5678–5684.

(19) Lam, H.; Deutsch, E. W.; Aebersold, R., Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. J Proteome Res 2009.

154 (20) Cortez C.; Vapnik, V., Support vector networks, Machine Learning 1995, 20, 273–

297.

(21) Creasy D. M.; Cottrell, J. S., Unimod: Protein modifications for mass spectrometry.

Proteomics 2004, 4, 1534–1536.

(22) Sadygov, R. G.; Yates, J. R., A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.

Anal Chem 2003, 75, 3792–3798.

Supporting Information:

Table s1. A complete version of Table 3, listing the modification site accuracy per modification type.

156

all % 0.6744 0.5496 0.922

Mod_z3 vs.

THEO_z3

oxid 500 315 188 476

phospho 500 272 175 349

carb 500 387 280 487

acet 500 429 412 500

pyro 500 404 388 500

all 2500 1807 1443 2312

all % 0.7228 0.5772 0.9248

Figure s1a), Figure s1b)

Figure s1. The number of identified spectra when analyzing dataset Modz2/3 using an SVM with a linear, radial and polynomial kernel function.

Figure s2a), Figure s2b)

Figure s2. A complete version of Figure 3, including a PCA of all 5 modification types contained in dataset Modz2/3.

158

Chapter 7

An application of the QuickMod protein identification

MS/MS data analysis pipeline.