• Aucun résultat trouvé

4. Manuscript 1, Extended Abstract: How to create good MS2 decoy spectra?

4.3 Results and Discussion

4.3.2 DeLiberator algorithm to create decoy spectra

The main result presented in the previous sections is that MS2 spectra are not random assemblies of peaks and intensities, but they contain strong correlations, which are specific to a fragmentation method and protease. A global visualization of a collection of MS2 spectra, as displayed in Figure 1, shows that certain low mass m/z values and precursor mass correlated m/z values are common in spectra of distinct peptides. These frequently re-occurring m/z values typically correspond to low and high order sequence ions as well as precursor mass ions of different charge states. In Figure 3 we see how top-scoring SSMs of distinct peptides commonly share groups of related peaks, such as consecutive peaks of an ion series or multiple peaks originating from the same fragmentation site. Based on these investigations and on the condition that there is one potential decoy match per library match,

we are now able to formulate what is needed for a good library creation algorithm: 1) for each target precursor ion mass there is a decoy precursor ion mass. 2) For each target spectrum a decoy spectrum is created with the same number of peaks and intensity sum. 3) Decoy spectrum peaks are positioned on realistic m/z values of a protease specific peptide sequence.

Correlations between fragment intensities and the intensities of their neutral loss or complementary fragment peaks are conserved. 4) A realistic intensity distribution for all m/z values and realistic correlations between ion-type and intensity are ensured for the decoy spectrum. 5) Each decoy spectrum is guaranteed to be substantially different from its target version.

The shuffling and repositioning algorithm [1] implemented in the SpectraST software tool considers these requirements. Briefly, in this algorithm the peptide sequence of each target library spectrum is shuffled and all annotated fragment ions are repositioned according to the new sequence, i.e. their annotations and intensities remain (e.g. an y8-ion remains a y8-ion), but their mass is changed according to the shuffled sequence (Supplementary Figure 8). In order to avoid decoy sequences that are too similar to their target sequences, only shuffled sequences with a sequence similarity lower that 70% are accepted. Additionally, SpectraST keeps the sequence positions of all arginine, lysine, and proline residues in order to guarantee the same number of missed cleavage sites after shuffling. The peaks without annotation are left unchanged.

The DeLiberator algorithm is similar, but implements some important differences that mainly affect the similarity between original and decoy spectra (condition 5). First, when shuffling the peptide sequence the C-terminal amino acid is left untouched while the other amino acids are reordered. We do not keep internal R, K, and P residues in place since especially for short peptides this reduces the number of possible permutations and could produce decoy sequences, which are too similar to their target sequences. Second, the similarity between original and shuffled peptide is evaluated on the fragment mass and not on the sequence level.

Therefore, we calculate the normalized dot-product of the annotated peaks of the target spectrum and its decoy spectrum version. If this dot-product score, which ranges from 0 - 1 (0 no similarity, 1 identical spectra), is larger than the similarity threshold 0.7 the peptide sequence is iteratively re-shuffled until a decoy variant with a spectral similarity to the target library spectrum lower than the set threshold is found. If no reorganization of the peptide sequence produces a dot-product score lower than the threshold, the spectrum version with the

lowest similarity score is selected. Supplementary Figure 9a,b shows an evaluation of different similarity thresholds using a set of well annotated query spectra, justifying the choice of 0.7 as a threshold value.

Third, as seen in Table 1 non-annotated peaks make up a non-negligible fraction of the spectrum library peaks. Keeping all non-annotated peaks at the same m/z values in both target and decoy spectra, as proposed by the SpectraST decoy creation algorithm, may produce spectral pairs with too high a similarity in cases where non-annotated peaks make up a large fraction of the spectrum peak intensity. The DeLiberator algorithm only keeps non-annotated peaks below 200 m/z units in the decoy version. Also, non-annotated peaks within a window of 60 Da below all possible precursor ions charge states are shared between a target spectrum and its decoy version (a spectrum of a triply charged precursor includes three such windows below the singly, doubly and triply charged precursor ion). The decoy m/z values of all other non-annotated peaks are sampled from the overall distribution of non-annotated library peaks, grouped by precursor charge state and m/z bins of 100 m/z units.

The reasoning behind this treatment of non-annotated peaks is based on the fact that most spectrum libraries are made up of consensus spectra, meaning that each peptide entry is represented by an averaged spectrum compiled from multiple spectra identified to the same peptide and precursor charge state. A number of different algorithms have been developed to create a peptide representative consensus spectrum from a spectrum cluster [28, 29, 30].

Typically the resulting spectrum generated by any of these algorithms emphasizes peptide specific peaks detected in multiple spectra and discards or penalizes non-reproducible peaks.

Consequently a significant portion of non-annotated consensus peaks are likely to be peptide specific ions such as sequence ions of an ion-type or neutral loss not considered by the library creation tool or internal fragments. Ideally when generating a decoy version of a target library spectrum the sequence related portion of the non-annotated peaks should be positioned in accordance with the new decoy peptide sequence, but other peaks (such as immonium ions and precursor ions) should be kept where they are and peaks not related to the sequence should be randomized.

In order to illustrate the importance of the spectrum based similarity measure and the handling of non-annotated peaks we display a library spectrum (LibY-CID-IT) and a decoy version created by SpectraST, which would be rejected by DeLiberator (Figure 4). Even though the

library entries have a very high spectral similarity (dot-product score 0.94), due to shared intense sequence ions and non-annotated peaks not repositioned by the decoy algorithm. The decoy spectrum in Figure 4 is not an isolated case, but we could find many similar examples.

Figure 4. The upper spectrum represents the doubly charged peptide YENGEPPMEVYEVLR in the

spectrum library LibY-CID-IT. The lower spectrum is a decoy version of this library entry created by the SpectraST algorithm (decoy peptide EYNGEPPYEMVELVR). Even though the target and decoy peptides have a sequence similarity lower than 70% (the similarity threshold implemented in the SpectraST algorithm) target and decoy library entries have a very high spectral similarity (dot-product score 0.94), resulting from shared high intensity sequence ions (solid black lines) and non-annotated ions (dotted lines). Solid gray lines show the non-matching peaks.