The QuickMod scoring algorithm - Manuscript 1, Extended Abstract: How to create good MS2 decoy

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

5.6 Conclusion

6.2.2 The QuickMod scoring algorithm

Several spectrum library search tools have been developed for targeted analysis of MS2 datasets^11-6. Even though the spectrum of a non-modified peptide and the spectrum of a modified variant of the same peptide often share important similarities these spectrum library search tools have not been designed to match such spectral pairs. QuickMod is a modification tolerant spectrum library search tool, where the query data is explored for both unmodified and modified variants of the peptide entries listed in the spectrum library. Instead of matching the query spectra against theoretical spectra, created in accordance with some simple peptide fragmentation rules, QuickMod makes use of the experimentally observed fragmentation pattern of the unmodified peptide spectrum when attempting to identify modified variants of this peptide.

A number of scoring algorithms have been developed in order to determine the similarity between an experimental spectrum and a theoretical spectrum. In its simplest form when analyzing CID MS2 data, the theoretical spectrum contains the calculated b- and y-ion fragments and the similarity score is based on the shared peak count between the compared spectra. In contrast, many scoring schemes include a multitude of fragment types including a-ions, immonium ions and neutral loss ions and extract several features in addition to the shared peak count such as the ratio of experimental to theoretical b- and y-ions, the length of

continuous ion-series, or matching peak intensity. A discriminant score based on the combined measure of these features is then derived.

Spectrum library search tools typically score SSMs taking into account the intensity of the matching spectrum peaks. The normalized dot product score (DPS), i.e. the scalar product of normalized spectral vectors, is commonly used for this purpose. SpectraST combines it with a delta score measuring the difference between the highest DPS of a spectrum and its runner up, and a term that avoids that the final score is dominated by a few high intensity peaks. The SpectraST score is also made more robust with regards to random matches of high intensity fragments by transforming the raw intensities to their square root values. Bonanza¹³ and pMatch¹⁴ are spectrum library search tools designed for OMS, which assume that the precursor mass difference ∆M can be explained by one modified amino acid. Based on this assumption query and library spectra are matched, considering that a query peak and a library peak of the same ion-type may be separated by the mass ∆M. Bonanza scores the match with a DPS. pMatch combines the DPS with an intensity independent binomial score, which calculates the probability that the matching peaks occur by chance. In order to control the false discovery rate (FDR) in the final list of SSMs, SpectraST creates decoy spectra by randomly shuffling the peptide sequence and shifting all annotated peaks in accordance with the shuffled sequence¹.

QuickMod data analysis starts off with a spectrum pre-processing step. In this study, only peaks that are the highest within a ± 5 m/z window around the peak’s m/z value are retained. Then, for doubly (triply) charged peaks only the 40 (60) highest peaks are kept.

After this filtering step, the intensities of the peaks are divided by the maximal intensity in the spectrum. All the preprocessing parameters are user defined and can be extended to different charge states.

QuickMod matches each query spectrum against each library spectrum with a precursor mass difference ∆M within a user defined range (usually ±200 Da). The matching algorithm accounts for the peak annotation of each library peak. For a peak with several possible annotations only the first one is considered where the ordering is provided by the spectrum library annotation. A library peak of m/z value mL matches a query peak of m/z value mQ if | mL – mQ |< δ or | mL – mQ – ∆M/ZF | < δ, where δ is the fragment m/z tolerance and ZF is the charge of the library peak. If a peak in the library spectrum is not annotated all

136 shifts corresponding to all possible charge states Z_F, which are compatible with m/z value m_F of the fragment ion (i.e. { ZF | ZF ∈ {1 : ZP}, mF < MP/ZF }, where ZP and MP are precursor charge and neutral mass), are taken into account. In cases where a library peak matches several query peaks, the query peak with the closest normalized intensity is matched.

The SSM scoring algorithm of QuickMod was developed after careful evaluation of a list of features, which reflect various properties of the similarity between an experimental spectrum and a library spectrum, such as the intensity of the matched peaks, the number of matched peaks and the peptide sequence coverage. In order to limit the computational time required to evaluate the match between a query spectrum and a library spectrum we wanted to reduce the list of scoring features included in the QuickMod scoring scheme to those that are expected to significantly contribute to more identified peptides. This feature selection was carried out using the annotated modification rich test datasets described in the dataset section. The features were ranked by their partial area under the ROC curve (AUC). The n highest ranked features were combined in a support vector machine (SVM)²⁰, for n going from 1 to the number of features. For every number n of features the SVM was evaluated, and the smallest value of n with a performance comparable to best performance was used for feature selection (see below).

Next we describe all similarity features taken into account. First we calculate the DPS between the experimental spectrum and the aligned library spectrum using two types of fragment ion peak intensity transformations; a simple square root transformation (dpSqrt) and the rank transformation (dpRank) where the intensity of each peak is replaced by the intensity rank ordered from lowest to highest intensity. For both square root and rank transformations the transformed peak intensities are divided by the transformed intensity of the most intense peak. Next we calculate the DPS of the query and library spectra neglecting shifted peaks (dpSqrtNoAlign and dpRankNoAlign). Then, the hyperProb feature neglects peak intensities and reflects the hypergeometric probability that the fragment peak matches between two spectra occur by chance. The hyperProbNoAlign feature is similar, but without allowing modification shifts in fragment m/z values. Finally, we add the rate of matched b- and y-ions (byIonCoverage) as well as a score giving the difference of the best and worst modification site score (PosScore).

As mentioned above spectrum library search tools give important weight to the peak intensities of matching and non-matching peaks when scoring SSMs. However, the presence

of a PTM on the peptide may affect its fragmentation propensities changing the spectrum intensity envelope of the modified compared to non-modified peptide spectra. With this in mind, we studied the difference in intensity distribution between modified and non-modified peptide spectra of our five distinct modification types included in our test dataset (Mod_z2) and evaluated the discriminatory power of the final QuickMod scoring scheme, for each modification type.

Dans le document Exploring the use of MS/MS spectral libraries to improve protein identification and characterization (Page 135-138)