Concluding Remarks - Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

4.4 Concluding Remarks

In summary we have shown that the shuffling and repositioning algorithm implemented in DeLiberator produces realistic decoy spectra. In addition we argue that decoy databases should be evaluated in terms of their similarity to the target database and conclude that strictly controlling the similarity between a target spectrum and its decoy version leads to a less conservative and more accurate estimation of the FDR. This study also emphasizes how sensitive the FDR calculation is with regards to variations in the high score tail of decoy distributions. Already a few decoy spectra, which are too similar to some query spectra, are able to produce high scoring decoy SSMs and to significantly alter the FDR in the regime that is usually of importance for proteomic applications (FDR 1-5%).

Peptide fragmentation is a complex process therefore MS2 spectra are significantly different from their theoretical counterparts modeled by an MS2 sequence search tool. As a result, sophisticated algorithms have to be developed to calculate similarity scores of PSMs.

Contrary to MS2 sequence search engines, relatively little attention has been paid to scoring issues for spectral library search tools. Studying the peak matching patterns of incorrect and correct SSMs, provides guidance on how to develop more discriminant SSM scoring functions. We have seen that intense peaks of specific ion-types are frequently shared among spectra of the same charge state and similar precursor mass. Furthermore our analysis shows that high-scoring SSMs of distinct peptides often share multiple peaks originating from the same fragmentation site. Thus a data-type specific SSM scoring scheme which considers the ion-type and sequence position of the matched peaks may prove advantageous.

As peaks at certain m/z values are frequently shared among spectra of similar precursor mass it is not unlikely that co-eluting peptides share overlapping m/z values. This is of importance when evaluating transitions suitable for a selected reaction monitoring (SRM) approach to quantify peptides [35]. In this mode, proteins are quantified based on a limited list of MS2 fragment ions of proteotypic peptides; characterized by their uniqueness for a single protein

with care in order to avoid interfering signals from co-eluting peptides, which could disturb quantification. The type of analysis presented in this study may be useful in guiding the determination of the best transitions.

Acknowledgements

We would like to thank Microsoft Research for their collaboration.

Availability

Deliberator was written in the Java programming language based on Java Proteomics Library (JPL) base classes. A jar file and sources can be downloaded from sourceforge (http://sourceforge.net/projects/javaprotlib/files/Tools/DeLiberator_0.1.1.jar). We also provide a java visualization tool that creates MS2-MS1 plots similar to those in the paper (http://sourceforge.net/projects/javaprotlib/files/Tools/libviz_0.1.1.jar).

References

[1] Lam, H., Deutsch, E. W., Aebersold, R., Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics., J Proteome Res. 2009.

[2] Desiere, F., Deutsch, E. W., King, N. L., Nesvizhskii, A. I., Mallick, P., et al., The peptideatlas project., Nucleic Acids Res. 2006, 34 D655–D658.

[3] Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., Stein, S. E., et al., Building consensus spectral libraries for peptide identification in proteomics., Nat Methods. 2008, 5 873–875.

[4] Ahrné, E., Masselot, A., Binz, P.-A., Müller, M., Lisacek, F., A simple workflow to increase ms2 identification rate by subsequent spectral library search., Proteomics. 2009, 9 1731–1736.

[5] Craig, R., Cortens, J. C., Fenyo, D., Beavis, R. C., Using annotated peptide mass spectrum libraries for protein identification., J Proteome Res. 2006, 5 1843–1849.

[6] Frewen, B. E., Merrihew, G. E., Wu, C. C., Noble, W. S., MacCoss, M. J., Analysis of peptide ms/ms spectra from large-scale proteomics experiments using spectrum libraries., Anal Chem. 2006, 78 5678–5684.

[7] Eng, J. K., McCormack, A. L., Yates, J. R., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database., J. AM. Soc. Mass Spectrom. 1994, 5 976–989.

[8] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S., Probability-based protein identification by searching sequence databases using mass spectrometry data., Electrophoresis. 1999, 20 3551–3567.

[9] Colinge, J., Masselot, A., Giron, M., Dessingy, T., Magnin, J., Olav: towards high-throughput tandem mass spectrometry data identification., Proteomics. 2003, 3 1454–1463.

[10] Deutsch, E. W., Shteynberg, D., Lam, H., Sun, Z., Eng, J. K., et al., Trans-proteomic pipeline supports and improves analysis of electron transfer dissociation data sets., Proteomics. 2010, 10 1190–1195.

[11] Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., King, N., et al., Development and validation of a spectral library searching method for peptide identification from ms/ms., Proteomics. 2007, 7 655–667.

[12] Ahrné, E., Müller, M., Lisacek, F., Unrestricted identification of modified proteins using ms/ms., Proteomics. 2010, 10 671–686.

[13] Falkner, J. A., Falkner, J. W., Yocum, A. K., Andrews, P. C., A spectral clustering approach to ms/ms identification of post-translational modifications., J Proteome Res. 2008, 7 4614–4622.

[14] Ye, D., Fu, Y., Sun, R.-X., Wang, H.-P., Yuan, Z.-F., et al., Open ms/ms spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate., Bioinformatics. 2010, 26 i399–i406.

[15] Käll, L., Storey, J. D., MacCoss, M. J., Noble, W. S., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases., J Proteome Res.

2008, 7 29–34.

[16] Wang, G., Wu, W. W., Zhang, Z., Masilamani, S., Shen, R.-F., Decoy methods for assessing false positives and false discovery rates in shotgun proteomics., Anal Chem. 2009, 81 146–159.

[17] Navarro, P., Zquez, J. S. V., A refined method to calculate false discovery rates for peptide identification using decoy databases., J Proteome Res. 2009.

[18] Elias, J. E., Gygi, S. P., Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry., Nat Methods. 2007, 4 207–214.

[19] Wells, J. M., McLuckey, S. A., Collision-induced dissociation (cid) of peptides and proteins., Methods Enzymol. 2005, 402 148–185.

[20] Coon, J. J., Ueberheide, B., Syka, J. E. P., Dryhurst, D. D., Ausio, J., et al., Protein identification using sequential ion/ion reactions and tandem mass spectrometry., Proc Natl Acad Sci U S A. 2005, 102 9463–9468.

[21] Molina, H., Horn, D. M., Tang, N., Mathivanan, S., Pandey, A., Global proteomic profiling of phosphopeptides using electron transfer dissociation tandem mass spectrometry., Proc Natl Acad Sci U S A. 2007, 104 2199–2204.

[22] Chi, A., Huttenhower, C., Geer, L. Y., Coon, J. J., Syka, J. E. P., et al., Analysis of phosphorylation sites on proteins from saccharomyces cerevisiae by electron transfer dissociation (etd) mass spectrometry., Proc Natl Acad Sci U S A. 2007, 104 2193–2198.

[23] Good, D. M., Wenger, C. D., McAlister, G. C., Bai, D. L., Hunt, D. F., et al., Post-acquisition etd spectral processing for increased peptide identifications., J Am Soc Mass Spectrom. 2009, 20 1435–1440.

[24] Maynard, D. M., Masuda, J., Yang, X., Kowalak, J. A., Markey, S. P., Characterizing complex peptide mixtures using a multi-dimensional liquid chromatography-mass spectrometry system: Saccharomyces cerevisiae as a model system., J Chromatogr B Analyt Technol Biomed Life Sci. 2004, 810 69–76.

[25] Omenn, G. S., The human proteome organization plasma proteome project pilot phase:

reference specimens, technology platform comparisons, and standardized data submissions and analyses., Proteomics. 2004, 4 1235–1240.

[26] Chalkley, R. J., Baker, P. R., Hansen, K. C., Medzihradszky, K. F., Allen, N. P., et al., Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: I. how much of the data is theoretically interpretable by search engines?, Mol Cell Proteomics. 2005, 4 1189–1193.

[27] Renard, B. Y., Kirchner, M., Monigatti, F., Ivanov, A. R., Rappsilber, J., et al., When less can yield more - computational preprocessing of ms/ms spectra for peptide identification., Proteomics. 2009, 9 4978–4984.

[28] Liu, J., Bell, A. W., Bergeron, J. J. M., Yanofsky, C. M., Carrillo, B., et al., Methods for peptide identification by spectral comparison., Proteome Sci. 2007, 5 3.

[29] Beer, I., Barnea, E., Ziv, T., Admon, A., Improving large-scale proteomics by clustering of mass spectrometry data., Proteomics. 2004, 4 950–960.

[30] Tabb, D. L., Thompson, M. R., Khalsa-Moyers, G., VerBerkmoes, N. C., McDonald, W. H., Ms2grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra., J Am Soc Mass Spectrom. 2005, 16 1250–1261.

[31] Good, D. M., Wenger, C. D., Coon, J. J., The effect of interfering ions on search algorithm performance for electron-transfer dissociation data., Proteomics. 2010, 10 164–167.

[32] Beausoleil, S. A., Villén, J., Gerber, S. A., Rush, J., Gygi, S. P., A probability-based approach for high-throughput protein phosphorylation analysis and site localization., Nat Biotechnol. 2006, 24 1285–1292.

[33] Everley, P. A., Bakalarski, C. E., Elias, J. E., Waghorne, C. G., Beausoleil, S. A., et al., Enhanced analysis of metastatic prostate cancer using stable isotopes and high mass accuracy instrumentation., J Proteome Res. 2006, 5 1224–1231.

[34] Sherman, J., McKay, M. J., Ashman, K., Molloy, M. P., How specific is my srm?: The issue of precursor and product ion redundancy., Proteomics. 2009, 9 1120–1123.

[35] Lange, V., Picotti, P., Domon, B., Aebersold, R., Selected reaction monitoring for quantitative proteomics: a tutorial., Mol Syst Biol. 2008, 4 222.

Supplementary Material

Creation of sampled and shifted libraries

Sampled decoy spectral libraries: Spectral peaks are randomly sampled from the target spectrum library. In order to preserve a realistic spectrum intensity distribution the overall distribution of peaks are grouped by precursor charge state and sampling intervals of 100 m/z units. For every peptide in the target spectrum library a sampled version is generated containing an equal number of peaks and the same peak annotations. Precursor ions are kept at their original m/z values.

Shifted decoy spectral libraries: All peaks in the target spectra are shifted by 3 m/z units (precursor ion peaks are not shifted).

S1 Evaluating simple decoy creation algorithms

The dot-product score distribution of top-ranked SSMs produced when searching an experimental dataset against a DSL can be used as a reference to evaluate decoy spectral libraries. When separately searching an experimental dataset against a DSL and its decoy version the two SSM score distributions should be similar, if the decoy library contains sufficiently realistic spectra. We started out by testing simple sampled and shifted decoy spectral libraries.

Searching an experimental dataset against a DSL consistently produces higher scoring SSMs than when screening the same dataset against simple sampled and shifted decoy versions of the DSL. We can thereby conclude that spectra created by the simple decoy creation

algorithms are not sufficiently realistic, and such decoy libraries are not expected to accurately model the score distribution of incorrect matches to a target library. Figure S1 also shows that the difference in dot-product score between DSL and decoy SSMs is substantially higher than the difference in number of matched peaks. This indicates that real spectra of distinct peptides share some structural similarity, as higher intensity spectrum features are matched in the DSL.

Figure s1)

Supplementary Figure 1

The total number of matched peaks (cumulative Shared Peak Count, SPC) and the cumulative dot-product score of top-ranking SSM’s when searching an experimental spectrum against a DSL. These two properties were also calculated when screening the same experimental dataset against sampled and shifted decoy libraries. The box-plot shows the ratio of decoy to DSL values of these two properties (Decoy cumul. spc / DSL cumul. spc, Decoy cumul. dp / DSL cumul. dp). A ratio of approximately 1, for both properties, would indicate that the simple decoy libraries contain sufficiently realistic spectra. The datasets used for this analysis

were; (IT) ExpH-CID-IT searched against LibY-CID-IT, (QTOF) ExpH-CID-QTOF searched against LibY-CID-QTOF and (ETD) ExpH-ETD searched against LibY-ETD.

Figure s2 a)

Figure s2 b)

Supplementary Figure 2

MS2-MS1 plots for spectra of the ETD library (LibY-ETD). Gray-color coding indicates the quintile of the ion intensity in the spectrum (0-20%, 20-40%, 40-60%, 60-80%, 80-100%).

Light gray correspond to low intensity and dark gray to high intensity. Only annotations with a mass difference less than 0.2 Da were considered. A) Zoom into intermediate m/z range.

Blue dots indicate c-ion and red dots z-ions (all intensities). B) Red dots represent all high intensity (80-100%) c (z) ions that have a complementary high intensity z (c) ion.

Figure s3 a)

Figure s3 b)

Figure s3 c)

Supplementary Figure 3

MS2-MS1 plots for spectra of the library (LibY-CID-IT). Gray-color coding indicates the quintile of the ion intensity in the spectrum (0-20%, 20-40%, 40-60%, 60-80%, 80-100%).

Light gray correspond to low intensity and dark gray to high intensity. Only annotations with a mass difference less than 0.2 Da were considered. A) Scatterplot of the entire mass range.

Blue dots identify high intensity ions (80-100%) annotated as parent mass peaks and red dots are high intensity non-annotation peaks. B) Zoom into lower m/z region. Blue dots indicate y ions and red dots a- or b ions (all intensities). C) Red dots represent all high intensity (80-100%) b (y) ions that have a complementary high intensity y (b) ion, whereas blue dots show all high intensity (80-100%) b (a) ions that have a complementary high intensity a (b) ion.

Figure s4 a)

Figure s4 b)

Figure s4 d)

Supplementary Figure 4

MS2-MS1 plots for spectra of the QTOF library (LibY-CID-QTOF). Gray-color coding indicates the quintile of the ion intensity in the spectrum (0-20%, 20-40%, 40-60%, 60-80%, 80-100%). Light gray correspond to low intensity and dark gray to high intensity. For the annotations, only annotations with a mass difference less than 0.2 Da were considered. A) Scatterplot of the entire mass range. Blue dots identify high intensity ions (80-100%) annotated as parent mass peaks and red dots are high intensity non-annotation peaks. B) Zoom into lower m/z region. Blue dots indicate y ions and red dots a- or b ions (all intensities). C) Zoom into intermediate m/z range. Blue dots indicate y ions and red dots a- or b ions (all intensities). D) Red dots represent all high intensity (80-100%) b (y) ions that have a complementary high intensity y (b) ion, whereas blue dots show all high intensity (80-100%) b (a) ions that have a complementary high intensity a (b) ion.

Figure s5)

Supplementary Figure 5

The bar-plot describes the ion-type annotations (y, z, c-ions and precursor ions) of matched peaks in top-ranking SSM when screening dataset ExpH-ETD against library LibY-ETD where precursor mass ions have not been discarded from the spectral data. The bars indicate the frequency of the different ion-types in the library spectra representing peptides of length 14 (precursor charge state 3+), and the overlaid points indicate the frequency at which these ion-types were matched in top-ranked SSM. Stars indicate ion-types that are significantly

“over-matched” (binomial pval < 0.05).

Figure s6 a), Figure s6 b)

Supplementary Figure 6

The bar-plot a) describes the ion-type annotations (a, b, y-ions and precursor mass ions) of matched peaks in top-ranking SSM when screening dataset ExpH-CID-QTOF against library LibY-CID-QTOF. The bars indicate the frequency of the different ion-types in the library spectra representing peptides of length 9 (precursor charge state 2+), and the overlaid points indicate the frequency at which these ion-types were matched in top-ranked SSM.

The bar-plot b) describes the ion-type annotations of matched peaks in top-ranking SSM when screening dataset ExpH-CID-IT against library LibY-CID-IT. The bars indicate the frequency of the different ion-types in the library spectra representing peptides of length 11 (precursor charge state 2+), and the overlaid points indicate the frequency at which these ion-types were matched in top-ranked SSM.

Stars indicate ion-types that are significantly “over-matched” (binomial pval < 0.05).

Figure s7 a), Figure s7 b)

Supplementary Figure 7

We compared the frequencies of the simultaneous matching of implicitly linked peak pairs when separately screening the experimental dataset ExpH-CID-IT against the DLS LibY-CID-IT (Ref. library) and a sampled version of this library (Sampled Library) A).

B) shows the analysis results when searching the experimental dataset ExpH-CID-QTOF against the DLS LibY-CID-QTOF (Ref. library) and a sampled version of this library (Sampled Library).

The bar-plot display the matching frequencies of consecutive ions pairs (e.g. y_n and y_n+1, BB,YY ), charge state pairs (e.g. b_n+ and b_n++, C.S.), same sequence fragment pairs (e.g. a_n and b_n, AB), neutral loss pairs (e.g. b_n and b_n-H₂O, N.L. ) and complimentary sequence pairs (e.g b₁ and y_n-1, BY), in top ranked SSM for all doubly charged precursor query spectra.

Figures 8)

Supplementary Figure 8

The upper spectrum represents the doubly charged peptide DFEYEINGNEGK (precursor ion

= 707.8) in the spectrum library LibY-CID-IT. The lower spectrum is a decoy version of this library entry created by the DeLiberator algorithm (decoy peptide EFEDNIGNGYEK).

Matched sequence ions are drawn as solid black lines, unmatched sequence ions drawn as solid gray lines, matched annotated ions as bold dotted lines and unmatched non-annotated ions as dotted lines.

Figure s9 a), Figure s9 b)

Supplementary Figure 9

DPThrs defines the maximum allowed spectral similarity between the sequence ions of a target library spectrum and a decoy library spectrum. If the dot-product spectral similarity between a target and decoy spectrum is greater than DPThrs the decoy peptide sequence is reshuffled.

The boxplot show the results of an evaluation of different DeLiberator target-decoy spectral similarity thresholds searching two well annotated query datasets against a target-decoy (DeLiberator) spectral library.

As the peptide identity of all query spectra was known a reference FDR curve could be calculated. FDR Sqrt Deviation reflects the agreement between FDR estimates calculated from the decoy score distribution and the reference FDR curve: the cumulative square root distance (in the FDR range 0-0.15) between the reference FDR and decoy FDR estimates at fixed dot-product scores (0 means perfect agreement, see Figure 5)

A) ExpD-CID was screened against LibD-CID concatenated to different DeLiberator decoy spectral libraries where ten DPThrs cutoffs were tested (0.1-1).

B) ExpY-CID was screened against LibY-CID concatenated to different DeLiberator decoy spectral libraries where ten DPThrs cutoffs were tested (0.1-1).

Based on the analysis of the two test dataset a DPThrs cut-off of 0.7 seems appropriate when considering a trade off between computational time and FDR Sqrt. Deviation. No significant improvement in FDR Sqrt. Deviation is observed for DPThrs cut-offs lower than 0.7.

Figure s10 a), Figure s10, b) Figure s10 b)

Supplementary Figure 10

Four decoy creation algorithms were evaluated: The DeLiberator algorithm, the decoy creation algorithm of SpectraST and the sampling and shifting algorithms, described in the first paragraph of this documanet. The ratio-axis gives the ratio of the number of DSL SSM to

concatenated to its decoy version. The similarity-axis indicates the mean dot-product score of top-ranked SSM when the spectral libraries were searched against their decoy versions. a) Experimental spectra: ExpH-ETD, DSL: LibY-ETD, b) experimental spectra: ExpH-CID-IT, DSL: LibY-CID-IT, and c) experimental spectra: ExpH-CID-QTOF, DSL: LibY-CID-QTOF.

Figure s11 a), Figure s11 b), Figure s11 c)

Supplementary Figure 11

Dot-product score distributions of first and second rank SSM in a random library search. The histogram shows the combined score distribution of rank 1 and rank 2 SSM dot-product scores and the light gray and dark gray lines show the separate score distributions of rank 1 and rank 2 SSMs. A) The experimental dataset ExpH-ETD was searched against the DSL

LibY-ETD. B) LibY-ETD was searched against its decoy version created by DeLiberator. C) LibY-ETD was searched against its decoy version created by SpectraST.

Note that the search results presented above represent an atypical scenario where the decoy library contains a randomized version of the correct peptide identification of all query spectra.

This is rarely the case as MS2 datasets typically contain a large number of poor quality spectra or spectra of peptides not present in the target library. Nevertheless, these results explain why FDR estimates obtained from a SpectraST decoy search are expected to be conservative.

Figure s12 a), Figure s12b)

Figure s12 c), Figure s12 d)

Supplementary Figure 12

A) SSM sore distribution when ExpD-CID was screened against LibD-CID concatenated to a

B) SSM sore distribution when ExpD-CID was screened against LibD-CID concatenated to a SpectraST decoy spectral library.

C) SSM sore distribution when ExpY-CID was screened against LibY-CID concatenated to a DeLiberator decoy spectral library.

D) SSM sore distribution when ExpY-CID was screened against LibY-CID concatenated to a SpectraST decoy spectral library.

Since all query spectra were annotated with a peptide sequence we can plot the distributions of Correct and Incorrect SSM scores.

Decoy score distributions were scaled to have the same area under curve as the distribution of incorrect target SSMs.

Figure s13 a), Figure s13 b)

Supplementary Figure 13

The number of valid SSM at FDRs in the range 0-0.15 when A) ExpD-CID was screened against LibD-CID concatenated to a DeLiberator decoy spectral library (solid line) and a SpectraST (dot-dash line) decoy spectral library, B) ExpY-CID was screened against LibY-CID concatenated to a DeLiberator decoy spectral library (solid line) and a SpectraST decoy spectral library (solid line). The dotted curve represents the reference FDR. A reference FDR could be calculated as all query spectra were annotated with a peptide sequence ( Ref. FDR = incorrect target SSM / correct target SSM + incorrect target SSM).

DeLiberator algorithm - Pseudo code FOR i = 1 to number of spectra in TargetLibrary

TargetSpectrum = TargetLibrary[i]

PrecursorWindow // set window for the precursor peaks

// precursor window

FOR z = 1 to value of Charge

WindowStart = (MassWeight+z*MassProton-WindowSize)/z WindowEnd = (MassWeight+z*MassProton)/z

PrecursorWindow.setWindow(WindowStart,WindowEnd) // specify the window from start to end END FOR

IF fragmentType of TargetSpectrum is ETD

CoelutionWindow // set window for the precursor peaks of potential coeluted peptide

// consider potential charge state of coeluted peptide FOR ChargeCoelu = Charge-1 to Charge+1

FOR z = 1 to value of ChargeCoelu

WindowStart = (MassWeight+z*MassProton-WindowSize)/z WindowEnd = (MassWeight+z*MassProton)/z

CoelutionWindow.setWindow(WindowStart,WindowEnd) // specify the window from start to end END FOR

END FOR

CoelutionWindow.filterPeaks(TargetSpectrum) // recreate new TargetSpectrum END IF

DotProductScore = 1 Count = 0

DecoySpectrum

WHILE DotProductScore > ScoreThreshold or Count < MaximumShuffling

DecoyCandidateSpectrum

DecoySequence = ShuffleSequence(TargetSequence) FOR j = 1 to number of peaks in TargetSpectrum

TargetPeak = TargetSpectrum[j]

TargetMz = getMz(TargetPeak)

TargetAnnotation = getAnnotation(TargetPeak)

// decoy and target have common annotations DecoyAnnotation = TargetAnnotation

IF TargetAnnotation is fragment ion

// calculate the mz of decoy using the annotation

// ie) if the annotation is "y4", calculate new m/z from decoy sequence DecoyMz = RepositionPeak(DecoyAnnotation,DecoySequence)

ELSE IF TargetAnnotation is precursor or immonium ion DecoyMz = TargetMz

ELSE IF TargetAnnotation is non-annotated

IF TargetMz < 200 or PrecursorWindow.isIncluded(TargetMz)

Dans le document Exploring the use of MS/MS spectral libraries to improve protein identification and characterization (Page 66-92)