IndiMax: fast and reliable filtering methodology for Tandem Mass Spectrometry data

(1)

Thesis

Reference

IndiMax: fast and reliable filtering methodology for Tandem Mass Spectrometry data

KUZYAKIV, Rostyslav

Abstract

Many algorithms for protein/peptide identification from Tandem Mass Spectrometry (MS/MS) data have been developed. The majority of such methods are based on sequence or spectral database search: experimental spectra are matched against spectra characterizing known peptide sequences. Hits define candidate peptides. In this context, the development of reliable and fast candidate peptide/spectrum filtering or ranking techniques would enhance the pre-processing of large scale datasets. The present work introduces IndiMax, a multi-stage, fast and accurate peptide/spectrum candidate filtering and ranking methodology for peptide MS/MS data. Based on a low complexity hierarchical search approach, IndiMax can be applicable to pre-process multiply charged (up to +5), de-isotoped, experimental MS/MS spectra of large quantity and variable quality. The developed methodology can be integrated into the general workflows to increase the identification rate and the consistency of the results.

Thus, the potential user would get an opportunity to optimize the previously established data analysis strategy.

KUZYAKIV, Rostyslav. IndiMax: fast and reliable filtering methodology for Tandem Mass Spectrometry data. Thèse de doctorat : Univ. Genève, 2013, no. Sc. 4596

URN : urn:nbn:ch:unige-394980

DOI : 10.13097/archive-ouverte/unige:39498

Available at:

http://archive-ouverte.unige.ch/unige:39498

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’Bioinformatique Prof. Ron Appel, directeur de thèse

Institut Suisse de Bioinformatique Dr. Frédérique Lisacek, co-directeur de thèse Prof. Sviatoslav Voloshynovskiy, co-directeur de thèse

IndiMax: Fast and Reliable Filtering Methodology for Tandem Mass Spectrometry Data

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention

bioinformatique par

Rostyslav Kuzyakiv de

Toronto (Canada) Thèse no : 4596

GENÈVE

Université de Genève

2013

(3)

Table of tables

... 5

Table of figures

... 7

Abstract

... 9

French summary

... 13

1

Introduction

... 17

1.1 Proteomics ... 17

1.2 Protein Identification ... 18

1.3 Mass Spectrometry ... 19

1.3.1 Mass Spectrum (MS1) ... 19

1.3.2 Tandem Mass Spectrum (MS2) ... 20

1.3.2.1 Peptide Fragmentation ... 20

1.3.2.2 Most common ions observed ... 21

1.4 Protein Identification by Mass Spectrometry ... 22

1.4.1 Peptide Mass Fingerprinting (PMF) ... 22

1.4.2 Protein Identification by Tandem Mass Spectrometry (MS/MS) ... 24

1.5 Thesis Outline ... 25

1.5 Main Contribution ... 27

2

Tandem Mass Spectra Analysis and Protein Identification

... 28

2.1 Sequence Database Search ... 28

2.1.1 1^st Generation Sequence Database Searching Algorithms ... 29

2.1.2 2^nd Generation Sequence Database Searching Algorithms ... 32

2.2 De Novo Sequencing ... 35

2.3 Sequence Tag Search ... 38

2.4 Spectral Library Search ... 40

2.4.1 Major Spectral Library Searching Algorithms ... 41

2.5 Estimation of the Accuracy of Peptide Identifications Made by Database Search ... 43

2.5.1 The Single Spectrum Confidence Score ... 44

2.5.2 Posterior Probabilities and False Discovery Rate (FDR) ... 46

2.5.3 Mixture Model to Compute Posterior Probabilities and FDR ... 49

(4)

2.6 Identification of Modified Proteins by MS/MS ... 51

2.6.1 Unrestricted (Open) Modification Search (OMS) ... 54

2.6.2 OMS Algorithms ... 55

2.7 Filtering and Clustering of Tandem Mass Spectrometry Data ... 59

2.7.1 Problem Formulation ... 59

2.7.2 Existing Solutions ... 60

2.7.2.1 Filtering of MS/MS Spectra of PTM Proteins/Peptides ... 73

2.7.3 Limitations of Existing MS/MS Spectra Clustering/Filtering Solutions ... 74

2.7.4 Analysis, Generalization and Restrictions of Existing PSM/SSM Scoring Strategies ... 76

2.7.5 Major Requirements to the Filtering Methodology for MS/MS data ... 82

3

Peptide/Spectrum Candidate Filtering Methodology for MS/MS data

... 83

3.1 General Description of the Proposed Methodology ... 83

3.2 Detailed Description of the Proposed Methodology ... 84

3.2.1 Structure of a Query Dataset, a Sequence Database and a Spectral Library ... 84

3.2.2 Pre-processing of the Query MS/MS Spectrum ... 86

3.2.2.1 Peak Filtering of the Query MS/MS Spectrum ... 86

3.2.2.2 Peak Sorting and Peak Extraction ... 87

3.2.3 Sequence Database/Spectral Library Filtering ... 88

3.2.4 PSM/SSM Scoring ... 89

3.2.5 Output of Filtering ... 92

3.2.6 Clustering Redundant Experimental MS/MS Spectra ... 94

3.2.7 Details of Implementation ... 94

3.2.7.1 Generation of Theoretical Fragmentation MS/MS Spectra ... 95

3.2.7.2 Generation of Decoy Sequence Database/Spectral Library ... 95

3.3 Statistical Validations of the Proposed Methodology (Sequence Database) ... 95

3.3.1 Statistical Validation of MS/MS Spectrum Filtering Strategy ... 96

3.3.2 NPM Distance Calculations ... 98

3.3.3 Peak Distance Calculations ... 99

3.3.4 PSM Score Configuration and Statistical Validations ... 100

3.3.5 Summary of Statistical Validations ... 104

4

Results

... 104

4.1 Sequence Database Filtering ... 105

(5)

4.1.1 Benchmark Performance Analysis ... 106

4.1.1.1 Decoy Database Search ... 108

4.1.2 Multiply Charged MS/MS Spectra. General Performance Analysis ... 110

4.1.3 Unlimited vs. Limited Final List of Candidate. Performance Analysis ... 111

4.1.4 Database Filtering Rate. Performance Analysis ... 113

4.1.5 Extended Sequence Database. Performance Analysis ... 114

4.1.6 Summary of Sequence Database Filtering ... 116

4.2 Spectral Library Filtering ... 117

4.2.1 Clustering Redundant Experimental MS/MS Spectra ... 118

4.2.2 General Spectral Library Filtering ... 120

4.2.3 Filtering for MS/MS Spectra of Post-Translationally Modified Peptides ... 122

4.2.4 Summary of Spectral Library Filtering ... 125

5

Conclusion

... 126

Appendix

... 133

References

... 143

(6)

Table of Tables

Table 1. Applications of proteomics ... 18

Table 2. 1^st generation sequence database searching algorithms for MS/MS data ... 32

Table 3. 2^nd generation sequence database searching algorithms for MS/MS data ... 34

Table 4. Additional de novo sequencing algorithms for MS/MS data ... 38

Table 5. Additional sequence tagging algorithms for MS/MS data ... 40

Table 6. Limitations of existing solutions ... 75

Table 7. Percentage of well positioned peaks filtered with various filtering strategies ... 97

Table 8. Score: A) - num. of matched peaks; B) – with intensity (2^nd training dataset). ... 103

Table 9. Sensitivity vs. Specificity (2^nd training dataset). ... 104

Table 10. Testing datasets for sequence database filtering ... 105

Table 11. Target sequence database. NPM tolerance: 1.5 GT and IM; 0.026 IM ... 107

Table 12. Decoy sequence database. NPM tolerance: 1.5 GT and IM; 0.026 IM ... 108

Table 13. Sensitivity vs. FDR. NPM tolerance: 1.5 GT and IM; 0.026 IM ... 109

Table 14. Target sequence database (3^rd, 4^th datasets) ... 110

Table 15. Decoy sequence database (3^rd, 4^th datasets) ... 111

Table 16. Sensitivity vs. FDR (3^rd, 4^th datasets) ... 111

Table 17. Target sequence database (5^th dataset) ... 112

Table 18. Decoy sequence database (5^th dataset) ... 112

Table 19. Sensitivity vs. FDR (5^th dataset) ... 112

Table 20. Database filtering rate vs. false discovery rate (FDR) (3^rd, 4^th, 5^th datasets) ... 113

Table 21. Extended sequence database (1^st, 3^rd datasets) ... 115

Table 22. Extended decoy database (1^st and 3^rd datasets) ... 115

Table 23. Sensitivity vs. FDR (1^st, 3^rd datasets). Extended and original DB ... 116

Table 24. Testing datasets for spectral library search ... 117

Table 25. Target spectral library (5^rd, 6^th datasets) ... 118

Table 26. Decoy spectral library (5^th, 6^th datasets) ... 119

Table 27. Sensitivity vs. FDR (5^th, 6^th datasets) ... 119

Table 28. Target spectral library (7^thdataset) ... 121

Table 29. Decoy spectral library (7^th datasets) ... 121

Table 30. Sensitivity vs. FDR (7^th dataset) ... 121

Table 31. Testing dataset ... 122

(7)

Table 32. Spectral library results (8^thdataset). A – SL 18470; B – SL 344000 ... 123 Table 33. Sensitivity vs. final list size. SL 18470 MS/MS; SL 344000 MS/MS ... 124 Table 34. General comparison of ion detectors ... 137

(8)

Table of Figures

Figure 1. Liquid chromatogram of 5 separated proteins ... 18

Figure 2. General components of a modern mass spectrometer ... 19

Figure 3. A mass spectrum (MS1). General view ... 20

Figure 4. Fragmentation of a peptide with the generation of corresponding ions. ... 20

Figure 5. A MS/MS spectrum of the peptide, dominated by b- and y- ions ... 22

Figure 6. Protein identification by peptide mass fingerprint ... 24

Figure 7. Protein identification by tandem mass spectrometry ... 25

Figure 8. Strategies to interpret MS/MS data ... 28

Figure 9. De novo sequencing approach ... 35

Figure 10. Single spectrum confidence score. General workflow ... 45

Figure 11. Single spectrum confidence score. P-value computation ... 46

Figure 12. Target-decoy strategy for FDR estimation. General workflow ... 47

Figure 13. Target-decoy strategy for FDR estimation ... 48

Figure 14. Target-decoy strategy for FDR estimation. FDR based score threshold ... 48

Figure 15. Clustering of redundant MS/MS spectra ... 61

Figure 16. Consensus MS/MS spectrum construction ... 63

Figure 17. Sequence tag filtering approach ... 63

Figure 18. Fundamentals of Locality Sensitive Hashing ... 70

Figure 19. Correlative matrix and examples of correlative window ... 72

Figure 20. The observation model ... 77

Figure 21. The maximum a posteriori probability rule ... 77

Figure 22. The spectral contrast angle ... 79

Figure 23. Noisy observation of the experimental MS/MS ... 80

Figure 24. “Local” displacement of the peak of the experimental MS/MS spectrum ... 81

Figure 25. Proposed methodology ... 84

Figure 26. MGF file ... 85

Figure 27. General structure of: query dataset, sequence database and spectral library .. 86

Figure 28. Local Maximum Filter: General Structure ... 86

Figure 29. Local Maximum Filter used for MS/MS spectrum filtering ... 87

Figure 30. Sorting and extracting the peaks from the filtered query MS/MS spectrum .... 87

Figure 31. Workflow of the SD/SL filtering. Generation of 1^st list of matches ... 88

(9)

Figure 32. Workflow of the SD/SL filtering. Generation of the 2^nd list of PSMs/SSMs ... 89

Figure 33. Workflow of simplified candidate filtering ... 93

Figure 34. Workflow of advanced candidate filtering ... 94

Figure 35. Probability of Error for an experimental peak ... 98

Figure 36. NPM distance distribution (average, 1^st and 2^nd training datasets) ... 98

Figure 37. List sizes distributions (NPM based, training datasets) ... 99

Figure 38. Peak distance distribution. Correct, False (NPM pre-filtered) ... 100

Figure 39. PSM scores distributions (2^nd training dataset) ... 102

Figure 40. Sizes of the final list (2^nd training dataset). ... 102

Figure 41. Position of the correct PSM on the final list (2^nd training dataset) ... 103

Figure 42. Sensitivity vs. Specificity (2^nd training dataset)... 103

Figure 43. Filtered PSM scores distributions (norm.): a) 1^st dataset; b) 2^nd dataset ... 107

Figure 44. The generation of the decoy sequence database ... 108

Figure 45. Sensitivity vs. FDR: a) 1^st dataset; b) 2^nd dataset ... 109

Figure 46. Filtered PSM scores distributions (3^rd, 4^thdatasets) ... 110

Figure 47. Sensitivity vs. FDR (3^rd, 4^th datasets) ... 111

Figure 48. Filtered PSM scores distributions (5^th dataset) ... 112

Figure 49. Filtered PSM scores distributions. a) 1^st dataset; b) 3^rd dataset ... 114

Figure 50. Sensitivity vs. FDR. a) 1^st datasets; b) 3^rd datasets ... 115

Figure 51. Filtered SSM scores distributions (5^th, 6^th datasets) ... 119

Figure 52. Sensitivity vs. FDR (5^th, 6^th datasets) . ... 119

Figure 53. Filtered SSM scores distribution (7^th dataset) ... 120

Figure 54. Sensitivity vs. FDR (7^th dataset) ... 121

Figure 55. Filtered SSM scores distribution (8^th dataset) ... 123

Figure 56. Sizes of the final list (8^th dataset) ... 123

Figure 57. Positions of the correct SSM on the list (8^th dataset) ... 124

Figure 58. Ionization methodologies ... 134

Figure 59. Mass analysers ... 135

Figure 60. Fourier transformation used FT-ICR mass analyser ... 137

(10)

Abstract

Nowadays tandem mass spectrometry (MS/MS) has become a widely used technique for protein identification in complex mixtures. The methodology finds the application in:

 biomarker discovery and validation;

 drug discovery and validation;

 clinical diagnosing and treatment validation;

 food quality control.

The amount of information generated by MS/MS in different applications is growing at the remarkable rate. The datasets of tandem spectra generated by different research groups already contain hundreds of thousand samples. These datasets are generated by different mass spectrometers, in various formats and by different users. In some cases the large amount of generated data remains unlabelled and partially undiscovered due to inability of previously developed search engines to efficiently manage the exponentially increasing amount of data. All this together complicates the interaction between different groups of researchers and considerably slows down the research and development process. Therefore, there is a great interest in the development of systems allowing accurately and quickly managing these collections of experimental data.

To manage the increase of datasets and the size of databases, scientists have been integrating new approaches aiming at clustering of redundant MS/MS spectra (Beer et al., 2004; Frank et al., 2008) or filtering/indexing the experimental MS/MS data (Tanner et al., 2005; Tabb et al., 2003; Frank and Pevzner 2005; You Li et al., 2010) prior to advanced data analysis with commonly used search engines. However, as other algorithms, the developed methodologies possess some limitations affecting their applicability towards large-scale datasets. These drawbacks are: limitation to MS/MS spectra of high quality; limitation to process experimental MS/MS spectra of a certain charge (e.g., most common +2 charged MS/MS spectra); accelerated sequences database search without further reduction of the final list of candidate peptides; the extensive utilization of random projection and binarization for filtering of MS/MS data that might reduce the data dimensionality but considerably affects the filtering/indexing accuracy. Under such conditions, the development of simple, powerful and discriminatory candidate filtering and scoring methodology for MS/MS data becomes an essential task.

Similar problems exist in the domain of multimedia management where people look for similar images, video or text documents in the large collection of data like in Google Image Search or YouTube applications (Deng et al. 2009, Hays and Efros 2007). Not less important similar problem exists in the domain of human biometrics where people are identified based on their fingerprint, iris etc. (Ignatenko and Willems, 2009, Kalker et al., 2010).

(11)

Despite the different domains and origins, all these problems have in common the necessity to find the best matches to a given query according to the certain, defined measure of similarity. The result of the search can be given in the form of either the best unique match or a list of matches of fixed or variable list size. This problem is known to be a NP-hard problem due to the course of dimensionality (Beyer et al., 1999, Böhm and Keim 2001).

Therefore, the current state-of-the-art techniques try to overcome this issue by performing approximate matching (Datar et al., 2004, Gionis et al., 1999, Muja and Lowe 2009). The key idea shared by these algorithms is to find the best match with only high probability close to 1- ^, where ^is a small positive value established instead of the exact match with the probability of 1. One of the first techniques of approximate matching is locality sensitive hashing (LSH) (Datar et al., 2004, Jegou et al., 2011). However, for multiple applications based on real data LSH is outperformed by heuristic methods (Muja and Lowe 2009).

Moreover, in peptide/protein identification by MS/MS, the situation is complicated by the fact that underlying distortions (miss cleavages, contaminations, ion or water loss etc.) can be hardly fitted to the Euclidian metrics which is the basis of LSH. Therefore in general, the peptide identification systems are facing inaccuracy-complexity trade-off also in part of underlying misassumptions about the model of observations.

In this Thesis, we have analysed the limitations encountered by other MS/MS data clustering and filtering/indexing algorithms. Based on that analysis, we have introduced IndiMax, a multi-stage, fast and accurate peptide/spectrum candidate filtering and ranking methodology for peptide tandem mass spectrometry data. Based on a low complexity hierarchical search approach (Kuzyakiv et al., 2010), IndiMax could be applicable to pre- process multiply charged (up to +5), de-isotoped, experimental MS/MS spectra of large quantity and variable quality. The developed methodology could be integrated into the general workflows to increase the identification rate and the consistency of the results.

Thus, the potential user would get an opportunity to optimize the previously established data analysis strategy.

The proposed methodology is based on the stochastic model of observations and eliminates noise from the query MS/MS spectra as well as exploits the statistical information on intensities and m/z values of the retained experimental peaks. Based on chosen parameters, it filters either a sequence database (SD) or spectral library (SL) and generates short lists of ranked peptide-spectrum matches (PSMs) or spectrum-spectrum matches (SSMs) for each query spectrum respectively. The score applied to the PSM/SSM suits both sequence database and spectral library filtering and it is more robust to peak misalignment than cross- correlation based methods. The robustness of the proposed spectral matching procedure was guaranteed by the complex analysis of peak position deviations performed for training datasets. Our investigation showed the potential applicability of the proposed filtering scheme to pre-process experimental data before the advanced data analysis.

(12)

We have used 8 experimental MS/MS datasets of modified and unmodified human and yeast peptides to validate the general performance of the proposed candidate filtering methodology under the sequence databases and spectral library set ups. The validation was based on the series of calculations. For the sequence database fileting, the validation included: benchmark performance analysis and filtering for MS/MS data of unmodified, multiply charged peptides. For the spectral library filtering, the validation consisted of filtering for MS/MS data of unmodified, multiply charged peptides as well as post- transnationally modified (PTM) peptides. The results of the calculations have been utilized in practical computations of sensitivity, a false discovery rate, average list size of matches;

average position of the correct match on the final list and database filtering rate.

For chosen experimental dataset, under presented experimental conditions, the methodology filtered fast and accurately the short lists of PSMs/SSMs. The output was characterized by high sensitivity coupled with the top positions of correct PSMs/SSMs in the final lists. The benchmark performance analysis involved GutenTag (Tabb, Saraf and Yates, 2003), a sequence tagging algorithm for peptide identification from MS/MS data. The benchmark calculations confirmed the advantage of speed and accuracy of IndiMax over GutenTag under two experimental set ups.

In general, the obtained results (for given conditions) confirmed the efficiency of the information on m/z values and intensities of the experimental peaks for accurate peptide/spectrum candidate filtering. Low average list size of PSMs/SSMs and high sensitivity for FDR <= 0.05 have proven the soundness of the idea of probabilistic calculations of precursor ion mass and m/z positions of highly intensive experimental peaks for fast and reliable filtering of sequence databases and spectral libraries.

The future extension of the algorithm could entail evaluating its performance on various MS/MS datasets of modified and unmodified peptides. In such a case, the quality of results could be correlated with improving the proposed PSM/SSM scoring method. The advancements should be based on careful evaluation of a list of features which reflect various properties of the similarity between a query MS/MS spectrum and a sequence database/spectral library spectrum. In case of PTM identification the improvements should not only include the peak distortions due to noise, ion loss or miss cleavages but also those resulting from the presence of a certain modification. For this, better understanding of the effects of modifications on peptide fragmentation paters could contribute to the development of a better theoretical model of modified spectra. The developed model could be integrated into the existing PSM/SMS scoring method thus enhancing the general performance of filtering.

The implementation of IndiMax into C++, Python or Java would speed up the filtering therefore enhancing the general performance and usability of the methodology towards large scale MS/MS dates of modified and unmodified peptides. It can be integrated in a MS/MS peptide identification platform performing multiple stages of identification

(13)

workflow. On the other hand, it can stand as an independent module participating only in some stages of peptide identification process.

Implemented in Java, IndiMax could join the Java Proteomics Library (JPL) (http://javaprotlib.sourceforge.net/), in-house java proteomics class library. It could be available under open source license and could be downloaded from the JPL web site to assist with:

 pre-processing of experimental MS/MS data of unmodified peptides under sequence database/spectral library set ups;

 clustering of redundant experimental MS/MS spectra;

 complying spectral libraries;

 identifying PTM peptides (Open Modification Search Approach).

(14)

French Summary

Actuellement, la Spectrométrie de Masse en Tandem (MS/MS) est devenue une technique largement utilisée pour l'identification des protéines dans des solutions complexes. Cette méthodologie trouve son application dans :

 la découverte et la validation de bio-marqueurs;

 la découverte et la validation de médicaments;

 la validation de diagnostics et de traitements cliniques;

 le contrôle de la qualité d'aliments.

La quantité d'information générée par la MS/MS dans différentes applications augmente de façon remarquable. Les données générées par tandem de spectre par différents groupes de recherche contiennent déjà des centaines de milliers de spécimens. Ces données sont générées par différents spectromètres de masse, dans différents formats et par différents utilisateurs. Dans certains cas, une grande quantité de ces données reste non-cataloguée et partiellement inexplorée à cause de l'incapacité des moteurs de recherche précédemment développés à gérer une augmentation exponentielle des données. Tout cela complique l'interaction entre les différents groupes de chercheurs et ralenti considérablement le processus de recherche et de développement. C'est pourquoi, il existe un grand intérêt pour le développement de systèmes permettant la gestion précise et rapide de ces collections de données expérimentales.

Pour gérer l'augmentation des données et la taille des bases de données, les scientifiques ont intégré de nouvelles approches ayant pour but un regroupement de spectres MS/MS redondants (Beer et al., 2004; Frank et al., 2008) ou le filtrage / l'indexation des données expérimentales MS/MS (Tanner et al., 2005; Tabb et al., 2003; Frank and Pevzner 2005; You Li et al., 2010) avant de procéder à l'analyse avancée des données avec les moteurs de recherche communément utilisés. Cependant, comme pour d'autres algorithmes, les méthodologies développées possèdent leur propres limitations affectant leur capacité à traiter des collections de très grande taille. Ces inconvénients sont : limitation au spectre MS/MS de haute qualité, limitation du processus expérimental de spectre MS/MS à une certaine charge (p. ex., spectre MS/MS chargé + 2, le plus courant) ; recherche accélérées en base de données de séquences sans réduction additionnelle de la liste finale des peptides candidats ; l'utilisation extensive de projection aléatoire et la binarisation pour le filtrage des données MS/MS qui peut réduire la dimension des données mais considérablement affecter la précision du filtrage / de l'indexation. Sous de telles conditions, le développement d'une simple, puissante méthodologie filtrant et marquant les candidats de manière discriminatoire pour les données MS/MS devient une tâche essentielle.

Le même problème existe dans le domaine de la gestion multimédia où les gens recherchent des contenus similaires parmi des images, des vidéos ou des documents textuels dans de larges collections de données comme les applications Google Image Search ou Youtube

(15)

(Deng et al., 2009, Hays and Efros 2007). Non moins important est le problème de similarité existant dans le domaine de la biométrie humaine où les gens sont identifiés par leurs empreintes digitales, iris, etc. (Ignatenko and Willems, 2009, Kalker et al., 2010).

Malgré la différence de domaines et d'origines, ces problèmes ont en commun la nécessité de trouver les meilleures correspondances à une requête en répondant à une certaine mesure de similarité définie. Le résultat de la recherche peut être retourné sous la forme de la meilleure et unique correspondance, ou sous la forme d'une liste de correspondances d'une taille variable ou fixée. Ce problème est connu comme étant NP-difficile à cause de la double dimension (Beyer et al., 1999, Böhm and Keim 2001). C'est pourquoi, les techniques actuelles essaient de contourner ce problème en exécutant une correspondance approximative (Datar et al., 2004, Gionis et al., 1999, Muja and Lowe 2009). L'idée de base partagée par ces algorithmes et de trouver la meilleure correspondance avec une probabilité suffisamment proche de 1-'E', où 'E' est une petite valeur positive établie en lieu et place de la correspondance exacte avec une probabilité de 1. Une des premières techniques de correspondance approximative est le « Locality Sensitive Hashing » (LSH) (Datar et al., 2004, Jegou et al., 2011). Toutefois, pour de multiples applications basées sur des données réelles, le LSH est dépassé par les méthodes heuristiques (Muja and Lowe 2009). De plus, dans l'identification de peptide/protéine par MS/MS, la situation est compliquée par le fait que des distorsions sous-jacentes (« miss cleavages », « contaminations », « ion or mater loss », etc.) peuvent difficilement être prise en compte par la distance euclidienne, qui est justement la base du LSH. C'est pourquoi en général, les systèmes d'identification des peptides, sont confrontés à un « inaccuracy-complexity trade-off », en partie à cause de mauvaise hypothèses sous-jacentes à propos du modèle d'observation.

Dans cette thèse, nous avons analysé les limitations rencontrées par les autres algorithmes de regroupement et de filtrage/indexation de données MS/MS. Basé sur cette analyse, nous avons introduit IndiMax, une méthodologie de filtrage et de classement de peptides/spectres candidats rapide et précis, en plusieurs étapes, pour les données de spectrométrie de masse en tandem. Basé sur une approche par recherche hiérarchique de faible complexité (Kuzyakiv et al., 2010), IndiMax peut être appliqué à des spectres MS/MS expérimentaux pré-traités à charges multiples (jusqu'à +5), « de-isotoped », de quantité et de qualité variable. La méthodologie développée peut être intégrée dans un processus plus général pour augmenter le taux d'identification et la consistance des résultats. Ainsi, l'utilisateur potentiel pourra avoir l'opportunité d'optimiser la stratégie d'analyse de données précédemment établie.

La méthodologie proposée est basée sur le modèle stochastique d'observation et élimine le bruit de la requête du spectre MS/MS aussi bien qu'elle exploite l'information statistique des intensités et des valeurs m/z des pics retenus expérimentalement. Basée sur des paramètres choisis, elle filtre soit une base de données de séquences (SD) soit une librairie de spectres (SL) et génère une courte liste ordonnée des correspondances de peptides à spectres (PSMs)

(16)

ou des correspondances de spectres à spectres (SSMs), respectivement pour chaque requête de spectre. Le score appliqué au PSM/SSM s'adapte au filtrage tant à une base de données de séquences qu'à une librairie de spectres et est plus robuste au désalignement des pics que les méthodes basées sur la corrélation croisée. La robustesse de la procédure de correspondance spectrale proposée est garantie par l'analyse complexe de la déviation de la position des pics exécutée pour l'entraînement de la collection de données. Notre investigation montre l'applicabilité potentielle du schéma de filtrage proposé pour le pré- traitement des données expérimentales avant l'analyse avancée des données.

Nous avons utilisé 8 collections MS/MS expérimentales de peptides humains et de levures, modifiés et non-modifiés, pour valider la performance générale de la méthodologie de filtrage des candidats proposée sous les configurations base de données de séquences et librairie de spectres. La validation était basée sur les séries de calculs. Pour le filtrage des base de données de séquences , la validation inclus l'analyse et le filtrage de performance de référence pour des données MS/MS de peptides non-modifiés à charges multiples. Pour le filtrage de librairie de spectres, la validation consiste en un filtrage de données MS/MS de peptides non-modifiés à charges multiples aussi bien que de peptides modifiées post- traductionnellement (PTM). Les résultats des calculs ont été utilisés dans des calculs pratiques de sensibilité, de taux de faux positifs (FDR), de taille moyenne de liste de résultats, de position moyenne de la bonne correspondance sur la liste finale et de taux de filtrage de la base de données.

Pour la collection de données expérimentales choisie, sous les conditions expérimentales présentées ci-dessus, la méthodologie a rapidement et précisément filtré la courte liste de PSMs/SSMs. La sortie était caractérisée par une haute sensibilité couplée aux premières positions des PSMs/SSMs correctes dans la liste finale. L'analyse de performance de référence implique GutenTag (Tabb, Saraf and Yates, 2003), un algorithme de marquage de séquence pour l'identification de peptides provenant de données MS/MS. Les calculs de référence ont confirmé l'avantage en vitesse et en précision d'IndiMax sur GutenTag avec les deux configurations expérimentales.

En général, les résultats obtenus (pour les conditions données) ont confirmés l'efficience de l'information des valeurs m/z et des intensités des pics expérimentaux pour le filtrage précis des peptides/spectres candidats. La taille de la liste de faible moyenne des PSMs/SSMs et la haute sensibilité du FDR (<= 0.05) ont prouvé la solidité de l'idée de calculs probabilistiques de la masse du ion précurseur et des positions m/z des pics expérimentaux de haute intensité pour le filtrage rapide et précis des base de données de séquences et des librairies de spectres.

L'évolution de l'algorithme dans le futur devrait se concentrer sur l'évaluation des performances avec des collections MS/MS variées de peptides modifiés et non-modifiés.

Dans un tel cas, la qualité des résultats pourrait être corrélée avec l'amélioration de la méthode de classement PSM/SSM proposée. Les avancées seront basées sur une évaluation

(17)

attentive d'une liste de caractéristiques qui reflète diverses propriétés de similarité entre une requête de spectre MS/MS et un spectre de base de données de séquences ou de librairie de spectres. Dans le cas d'une identification de PTM, les améliorations ne doivent pas seulement inclure les distorsions de pics dues au bruit, à la perte de ion ou aux divisions manquées, mais aussi à celles résultants de la présence de certaines modifications. Pour cela, une meilleure compréhension des effets des modifications de la méthode de fragmentation des peptides pourrait contribuer au développement d'un meilleure modèle théorique du spectre modifié. Le modèle développé pourrait être intégré dans une méthode de classement PSM/SSM existante et ainsi améliorer la performance générale de filtrage.

L'implémentation d'IndiMax en C++, Python ou Java pourrait accélérer le filtrage et ainsi améliorer la performance générale et l'usage de la méthodologie à travers une grande palette de données MS/MS de peptides modifiés et non-modifiés. Cela pourrait être intégré dans une plateforme d'identification de peptides MS/MS incluant un processus en plusieurs étapes. D'un autre côté, cela pourrait rester un module indépendant participant seulement à certaines étapes du processus d'identification des peptides.

Implémenté en Java, IndiMax pourrait rejoindre la "Java Proteomics Library" (JPL) (http://javaprotlib.sourceforge.net/), librairie de classes java de protéomique interne. Cela pourrait être disponible sous licence open source et pourrait être téléchargé depuis le site web JPL pour aider à :

 pré-traitement de données expérimentales MS/MS de peptides non-modifiés sous configurations séquence de base de données / librairie spectrale;

 regroupement de spectres MS/MS expérimentaux redondants;

 compilation de librairies spectrales ;

 identification de peptides PTM (« Open Modification Search Approach »).

(18)

1. Introduction 1.1 Proteomics

Proteomics is the systematic study of the proteome, an entire set of proteins at a given time and under certain conditions (Wilkins et al., 1996). After the completion of the Human Genome Sequencing Project (http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml), the discipline has been considered to be the next step towards better understanding of biological processes. By taking into account the high complexity of proteins comparing to genes, the research in Proteomics has been summarized into four major objectives:

1. Detection and quantification of proteins;

2. Detection and quantification of protein modifications;

3. Detection and quantification of subcellular protein localization;

4. Detection and quantification of protein interactions.

Each objective represents a combination of different research fields aiming to acquire accurate and comprehensive information about certain proteins in cells, organs or species.

The most important research areas of Proteomics are:

 Protein Separation – development of methodologies for separation of complex mixtures of proteins with further processing of the proteins of interest.

 Protein Identification - application of mass spectrometry based techniques coupled with computational methods for protein detection and identification.

 Protein Modification Study - combination of chromatographic separation techniques and mass spectrometry to analyse various modifications of the proteins.

 Sequence Analysis - comparative analysis of polypeptide sequences across the various species to predict protein function and investigate the intracellular interrelations among proteins.

 Protein Structure Analysis - x-ray crystallography and nuclear magnetic resonance spectroscopy to investigate the three-dimensional structure of proteins.

 Protein/Protein Interaction Studies - application biochemical (co- immunoprecipitation, affinity electrophoresis, tandem affinity purification etc.) and biophysical (bio-layer interferometry, dynamic light scattering, surface plasmon resonance ect.) methodologies to investigate protein interactions on the atomic, molecular and cellular levels.

 Quantitative Protein Analysis - combination of immune assays, gel-electrophoresis, label free and stable isotope tagging techniques and mass spectrometry to obtain quantitative information on a protein with possible biomarker identification.

Common applications of proteomics are shown in the Table 1.

(19)

Tab. 1. Applications of proteomics.

Study Applicability of Proteomics

Fundamental Biological Processes

Analysis of the correlation among genome, transcriptome and proteome;

Study of model organism and parasites.

Molecular Mechanism of Cellular Processes

Physiological adaptations;

Correlation of composition and functions of organelles;

Study of signal transduction events.

Protein Structure and Function Analysis

Study of the associations of proteins;

Analysis of posttranslational modifications;

Analysis of the functionality of proteins.

Product Analysis

Detection of food contaminations;

Analysis of seeds;

Optimization of products.

Comparison of strains and species

Evolutionary Studies;

Breeding;

Rapid detection of bacteria.

Biomarker Discovery

Disease diagnostic and prognostic biomarkers;

Biomarkers for organ functionality;

Drug response biomarkers.

System Analysis

Drug development, toxicity;

Development of drug targets;

Personalized medicine.

1.2 Protein Identification

Often associated as a collection of various technical disciplines, nowadays, proteomics is mainly based on liquid chromatography (LC) coupled with mass spectrometry (MS). LC technique is used to separate a mixture of compounds (Fig. 1) in order to be identified and quantified by mass spectrometry.

Fig. 1. Liquid chromatogram of 5 separated proteins. A peak represents a single protein which can be transferred to mass spectrometer for detailed analysis.

(20)

1.3 Mass Spectrometry

Modern mass spectrometers combine two distinct analytical functions: Ionization and mass analysis. The main components of a modern mass spectrometer include (Fig. 2) (Lane, 2005):

1. Sample inlet;

2. Ion source (ESL or MALDI);

3. Mass analyser, to separate the obtained ion according to their m/z ratio;

4. Detector, to register the number of ions coming from the analyser;

5. Computer (data processing, generation of mass spectra, machine maintenance).

Fig. 2. General components of a modern mass spectrometer (C.S. Lane 2005).

During a regular mass spectrometry experiment sample molecules are introduced into the machine through a sample inlet where they are converted to ions by the ionization source.

After being pushed into the mass analyser ions are separated according to their m/z ratio.

Finally, the detector convers the energy produced by ions into electrical signals. The obtained signals are transferred to a computer and mass spectra are generated.

Functional details of mass spectrometers are given in appendix

1.3.1 Mass Spectrum (MS1)

A mass spectrum (MS1) is a plot representing the distribution of ions by mass in a sample describing a chemical composition of the compounds. It can represent different information depending on the type of mass spectrometer used and the type of experiment conducted;

however, all plots have two major characteristics: mass-to-charge (m/z) ratio (x-axis) and intensity (y-axis) (Fig.3). The m/z value is measured in Daltons (Da) while the intensity is represented in percentage.

(21)

Fig. 3. A mass spectrum (MS1). General view.

1.3.2 Tandem mass spectrum (MS2)

1.3.2.1 Peptide Fragmentation

A tandem mass spectrometer is capable of multiple rounds of mass spectrometry. The rounds are usually separated by some form of molecule fragmentation. There are various methods for fragmenting molecules for tandem mass spectrometry. Depending on the protein sequence, the amount of internal energy used for fragmentation and the charge state, the fragmentation of a peptide can generate several types of ions (Fig. 4). The accepted nomenclature for fragment ions was first proposed by Roepstorff and Fohlman (Roepstorff and Fohlman, 1984), and modified by Johnson (Johnson et. al., 1987).

Fig. 4. Fragmentation of a peptide with the generation of corresponding ions.

The most commonly used peptide fragmentation techniques are collision-induced dissociation, electron-capture dissociation and electron-transfer dissociation.

(22)

Collision-induced dissociation (CID) is a methodology to fragment molecular ions in the gas phase (Wells and McLuckey, 2005). Following the CID workflow, the molecular ions are accelerated by some electrical potential to high kinetic energy. Then, accelerated ions are pushed to collide with neutral molecules such as: helium, nitrogen or argon. During the collision, the kinetic energy is converted into internal energy. The conversion of the energy results in bond breakage and the fragmentation of the molecular ion into smaller fragments.

CID most often leads to the generation of b and y type ions (Fig. 4).

Electron-capture dissociation (ECD) is a method of fragmenting molecular ions by the direct introduction of low energy electrons to trapped gas phase ions (Zubarev et al., 1998). During ECD, the parent ions capture single electrons and fragment simultaneously. In such case, the fragmentation occurs also in the positions that are not energetically favoured, providing more complete ions series and fragmentation in long peptides/proteins. ECD most often leads to the generation of a, c and z type ions (Fig. 4).

Electron-transfer dissociation (ETD) is another technique to fragment molecular ions in the gas phase (Mikesh et al., 2006). Similar to ECD, ETD induces fragmentation of peptides/proteins by transferring electrons to them. A radical anion attacks the positively charged peptide/protein, resulting in the transfer of an electron. ETD cleaves randomly along the peptide backbone while side chains are left intact. The technique works well for higher charge state ions (>+2) generating c and z type ions (Fig. 4). ETD is advantageous for the fragmentation of longer peptides or even entire proteins.

1.3.2.2 Most common ions observed

The most common ions observed in the spectrum are a, b, and y (Fig. 4). The b ions extend from the N-terminus while y ions extend from the C-terminus. The a ions occur at a lower frequency and abundance comparing to b and y ions. The a-b pairs are often observed in fragment MS/MS spectra. The presence of such pairs is often used for the detection of b ions.

Other types of ions observed in the spectrum are c, x, and z ions. In case of a high energy collision, additional fragment ions may also be created including internal fragments or side chain specific ions such as d, v and w (not shown).

A tandem (MS/MS) mass spectrum (Fig. 5) is composed of a precursor peptide and a combination of fragment peaks produced from the fragmentation the peptide sequence. The number of peaks may vary from ten to several hundred. The peak of a MS/MS spectrum is characterized by two values: mass-to charge ratio (m/z) in its intensity (abundance).

(23)

Fig. 5 A MS/MS spectrum of the peptide, dominated by b- and y- ions.

1.4 Protein Identification by Mass Spectrometry 1.4.1 Peptide Mass Fingerprint (PMF)

Protein identification by mass spectrometry was firstly described as Peptide Mass Fingerprint (PMF) in 1993 (Rappin et al., 1993, Henzel et al., 1993, Mann et al., 1993). The methodology consists of several stages such as (Fig. 6):

 separation and insolation of proteins from a cell line, tissue or organism;

 acquisition of protein structural information (‘fingerprint’) through the utilization of a mass spectrometer;

 theoretical generation of peptide masses of protein sequences in databases;

 protein identification based on comparing the experimentally determined peptides masses with theoretical ones. Search engines: Mascot (Perkins et al. 1999), SEQUEST (Eng et al., 1994), Protein Prospector (http://prospector.ucsf.edu);

 scoring of the candidate proteins selected from the list of best matches.

Two-dimensional gel electrophoresis is commonly used to separate and isolate proteins. The two dimensions that proteins are separated into using this technique can be an isoelectric point and a protein mass. Extracted from the gel, the unknown proteins of interest are cleaved/digested into smaller peptides using a site-specific proteolytic enzyme (Shevchenko et al., 1996). Currently, trypsin is the most preferred enzyme for protein digestion. It possesses a reliable specificity by cleaving peptide bonds of C-terminal for every

(24)

arginine (R) and lysine (K). The masses of the resulting peptides can be accurately measured by mass spectrometer such as MALDI-TOF or ESI-TOF.

Evaluated by the mass spectrometer, the masses of the analysed peptides construct an experimental spectrum (see section 1.3.1). Since each protein has a different sequence, the peptides and their masses obtained for each protein represent a unique “fingerprint" of the original protein. Protein identification is performed by comparing the experimental

“fingerprints” with theoretical ones generated in silico by enzymatic digestion of the proteins of the database. A database protein whose theoretical spectrum mostly correlates with the experimental one is considered as a match. Scoring systems are used to rank matching proteins.

Candidate proteins can be selected from the list of the best matches on the basis of the highest score generated by PMF tools. Consequently, the scoring system for a PMF tools such as, Mascot (Perkins et al. 1999), SEQUEST (Eng et al., 1994) or X!Tandem (Craig and Beavis, 2004) plays a crucial role. In order to produce a robust score, the mention algorithms have to take into consideration multiple factors such as a calibration error, noise, extra or missing peaks, post-translational modifications etc.

Fig. 6. Protein identification by peptide mass fingerprint (Graves and Haystead, Microbiol.

Mol. Biol. Rev. 2002, 66, 1, 39).

Although it used to be considered as a rapid and efficient approach, PMF shows many limitations. Making the extensive usage of PMF challenging, some major drawbacks are:

 high number of randomly matched peptide masses due to large sizes of the sequence databases or high complexity of the spectra;

 high probability of false negatives due to the presence of PTMs, non-annotated alternative splicing and other protein processing events;

 unidentified proteins as a result of incompleteness of the protein sequence databases.

(25)

1.4.2 Protein Identification by Tandem Mass Spectrometry (MS/MS)

Due to limitations of PMF, tandem (MS/MS) mass spectrometry has been introduced to measure the masses of proteins and peptides. The Peptide Fragment Fingerprinting (PFF) technique has become the most common application of MS/MS in proteomics especially for the analysis of complex protein mixtures.

The PFF strategy towards protein samples involves several runs of a mass spectrometer with further fragmentation of peptide precursor ions and generation of tandem mass spectra.

A typical, PFF based proteomics experiment consists of several stages (Fig. 7) (Aebersold and Mann, 2003):

 proteins of interest are isolated from a cell line, tissue or organism;

 obtained proteins are digested enzymatically (trypsin) to peptides;

 resulting peptides are separated by the high-pressure liquid chromatography and eluted into a mass spectrometer;

 peptides are measured by the mass spectrometer and MS1 mass spectra are generated;

 MS1 spectrum is interpreted and a list of peptides is generated for further subsequent fragmentation;

 isolated peptide ion of interest is fragmented and a tandem mass spectrum is generated;

 obtained tandem spectra are stored and analysed using:

a) sequence database search engines such as Mascot (Perkins et al., 1999), X!Tandem (Craig and Beavis, 2004), Sequest (Eng et al., 1994) or Phenyx (Colinge et al., 2003) – generate the list of peptide-spectrum matches (PSMs);

b) spectral library search engines such as X! Hunter (Craig et al., 2006), SpectraST (Lam et al., 2007) - generate the list of spectrum-spectrum matches (SSMs).

(26)

Fig. 7. Protein identification by tandem mass spectrometry (Graves and Haystead, Microbiol.

Mol. Biol. Rev. 2002, 66, 1, 39).

1.5 Thesis Outline

The objective of this thesis was the development of a fast and reliable peptide/spectrum filtering methodology for peptide tandem mass spectrometry data.

The manuscript covers the following:

A. General overview of existing algorithms and methodologies for MS/MS data analysis;

B. Overview of the filtering approach for MS/MS data;

C. Theoretical study and development of fast and reliable filtering methodology for MS/MS data under Gaussian assumption;

D. Statistical validation of the developed methodology (training datasets);

E. Validation of the performance of the proposed methodology for unique and list based filtering of MS/MS data in sequence database and spectral library set-ups.

(27)

The manuscript is organized in 4 parts.

The first part introduces the subject and covers the state-of-the art in bioinformatics for the analysis of tandem mass spectrometry data.

The second part describes the theoretical study and the development of a fast and reliable filtering methodology for MS/MS data under Gaussian assumption. To this end, we:

 Developed and established the major steps of the filtering methodology for MS/MS data for sequence database and spectral library set ups;

 Investigated the theoretical performance of the filtering methodology for Gaussian data and Gaussian channel;

 Established the theoretical limits of errorless, unique and list based filtering of MS/MS data in sequence database and spectral library set ups

The third part covers the statistical validation of the developed methodology (training datasets). To this end, we:

 Investigated various MS/MS spectrum filtering strategies and analysed their impact on the general performance of the developed methodology;

 Investigated the impact of Precursor Ion Mass shifts and tandem mass spectrum peak distortions on the general performance of the developed methodology;

 Validated the general performance of the proposed peptide-spectrum match (PSM)/spectrum-spectrum match (SSM) scoring strategy applied along with the proposed filtering methodology.

The last part focuses on the validation of the performance of the proposed methodology for unique and list based filtering of MS/MS data in sequence database and spectral library set ups. To this end, we:

 Benchmarked the developed methodology with existing systems using common datasets and sequence databases;

 Validated the performance of the proposed filtering methodology while working with different experimental MS/MS datasets of multiply charged unmodified peptides;

 Investigated the impact of the restriction the finale list of PSM/SSM on the general performance of the proposed methodology (sequence database);

 Investigated the general impact of the increase the combinatory of the sequence database on the general performance of the methodology;

 Validated the methodology performance under consensus spectrum creation set up;

 Validated the general performance of the methodology for MS/MS data of multiply charged, modified (PTM) peptides (spectral libraries).

(28)

1.6 Main Contributions

The main contribution of this Thesis can be summarized as follows:

 a generalized model of fast and reliable peptide/spectrum filtering methodology for MS/MS data based on stochastic framework;

 statistical model of precursor ion mass estimation and corresponding matching score;

 the statistical model of the local peak misplacement of the MS/MS spectrum;

 statistical matching criteria based on multiple peak misplacement of the MS/MS spectrum and corresponding scoring metrics;

 general PSM/SSM score based on precursor ion mass, spectral matching (number of matched peaks) and intensity score calculated for matched experimental peaks of the query MS/MS spectrum;

 practical implementation of the developed filtering methodology for the sequence database and spectral library search;

 experimental validation of the developed methodology on existing datasets of MS/MS spectra of unmodified and modified peptides.

Our research shows that the developed filtering methodology is able to improve the general MS/MS data analysis workflow in part of enhancement of data filtering accuracy and data filtering time complexity.

(29)

2 Tandem Mass Spectra Analysis and Protein Identification

Nowadays, to analyse MS/MS data scientists have been using different methodologies. The major strategies (Fig. 8) are based on:

 Sequence database Search: experimental tandem mass spectra are matched to theoretical ones generated for each peptide of the protein sequence database.

 Sequence Tag Search: peptide sub-sequences (tags) are extracted from tandem mass spectrum and matched against the peptides of a protein sequence database.

 Spectra Library Search: experimental tandem mass spectra are matched to annotated experimental spectra of a Spectrum Library.

 De novo sequencing: direct extraction of the composition of a peptide sequence form tandem mass spectrum.

Fig. 8. Strategies to analyse MS/MS data (Nesvizhkii et. al., 2010).

2.1 Sequence Database Search

The majority of the algorithms for MS/MS data analysis employ sequence database search to identify spectra. The approach is based on matching experimental spectra against theoretical ones characterizing a known peptide sequence. The theoretical spectra are generated for peptides of the protein database using theoretical cleavage rules.

(30)

Searching a protein sequence database with MS/MS spectra follows the next steps for each protein entry of a database:

1. Computational generation of possible peptides considering a specific type of enzyme and accounting for post-translational modifications and other database annotations.

2. Calculation of the precursor ion masses for generated peptides.

3. Matching the experimental precursor ion masses to theoretical ones within a given tolerance window.

4. The generation of the list of potential candidate peptides.

5. Calculation of theoretical MS/MS spectra for the obtained candidates.

6. Matching the experimental MS/MS spectrum to theoretical spectrum of the candidate peptide. Scoring and ranking of the candidate peptides.

7. The candidate with the highest score is considered as a correct identification.

Currently most database search algorithms match MS/MS spectra to the database on the basis of the m/z component of the spectrum. Intensities are mostly used to distinguish candidate ion peak from noise. However, expected values of ion intensities in each spectrum are not used to perform the assignment. Such m/z-only based methodologies are often termed as first generation (1^st). (Shadforth et al., 2005)

Predictable patterns of intensities observed in the spectrum of fragmented ions can confirm the identity of the matched peptide. Such patterns are influenced to a large degree by the energies used in the collision cells and controlled by the composition of the peptide. This information is used by the second generation (2^nd) peptide identification tools. (Shadforth et al. 2005)

Additionally, sequence database searching methods can be characterised as heuristic or probabilistic (Kapp et al., 2003; Sadygov et al., 2004). The difference between the two types can be described as follows:

 Heuristic Algorithm correlates the experimental MS/MS spectrum with the theoretical one and calculates a score based on the similarity between the two.

 Probabilistic Algorithm models the peptide fragmentation pattern and calculates the probability that a particular peptide sequence produces the observed spectrum by chance.

2.1.1 The 1

^st

Generation Sequence Database Search Algorithms

Recall that database search tools of the 1^st generation apply only the m/z values of the spectrum and skip the intensities while performing spectra matching and candidate estimation.

(31)

Here, we present and comment on the most commonly used tools to perform these searches.

MASCOT

Being the most widely used, MASCOT is based on a probabilistic approach for MS/MS identification. It relies on the MOWSE algorithm (Pappin 1993) firstly introduced for PMF.

The tool efficiently utilizes probabilistic scoring with some heuristics for intensity and sequential ion ladders to perform MS/MS identification. The software has been commercialized limiting the information on the implemented methodology. In the published paper, the authors only partially describe the probabilistic scoring technique. It is based on the observed match between the experimental dataset and each database entry. While running the experimental data against a theoretical database, Mascot calculates the probability that the observed match is a random event. The real match, which is not a random event, is ranked with the lowest probability score (Perkins 1999). The calculated probabilities are converted into scores by taking logs, so that a good match has a high score.

Due to commercial licensing, the information regarding the estimation of false discovery rate remains limited.

X! Tandem

Based on the heuristic approach (Craig and Beavis, 2004), X! Tandem is one of the few freely available tools for MS/MS data analysis. It can be utilized as an online application or deployed locally while using precompiled binaries and FASTA formatted files. The tool has been developed to work with large scale MS/MS datasets thus enhancing the speed and accuracy of peptide/protein identification. Like other 1^st generation sequence database search engines, X!Tandem matches each query MS/MS spectrum against theoretical MS/MS spectra created from each peptide in the sequence database. The X!Tandem workflow includes several stages:

1. Calculation of the precursor ion mass of the query MS/MS spectrum;

2. Matching the obtained precursor mass to the precursor masses of peptides of the sequence database within a user pre-defined mass tolerance. The stage generates the list of precursor mass pre-filtered candidates;

3. Generation of the theoretical MS/MS for the precursor mass pre-filtered candidates.

The spectra are synthesized using known intensity patterns for particular residue combinations. They contain b and y type ions;

4. Matching the theoretical spectra with the query MS/MS spectrum. To rank the match, the software applies a preliminary score which is the sum of intensities of the matched b and y ions of the query and theoretical MS/MS spectra (called: dot product score). Then, X!Tandem modifies the preliminary score by multiplying its value by N factorial for the number of b and y ions assigned. The use of factorial for ranking constructs the hyper-score;

(32)

5. The software makes a histogram of all the hyper-scores for all the peptides in the database that might match the query MS/MS spectrum. It accepts the peptides with the highest hyper-scores as correct ones, and considers all others as incorrect;

6. The algorithm calculates E-values to evaluate the results of the search. The E value shows how far the top-scoring match from the rest of the matches is (see section 2.6 and. 2.6.1 for more information of the E value).

According to Craig and Beavis, besides the advantages of speed and accuracy, the software is able to search automatically for modified peptides. However, this scenario is only possible with already identified proteins.

SEQUEST

SEQUEST (Eng et al., 1994) is one of the leading programs for MS/MS spectra analysis and peptide identification from protein sequence databases. To search for a candidate, it correlates tandem mass spectra of experimental peptides against theoretical MS/MS spectra of a sequence database.

Based on the heuristic model, SEQUEST follows several steps to identify a proper candidate from MS/MS data. The steps include:

1. Query MS/MS spectrum de-nosing. To eliminate noise from the spectrum and to reduce the number of ions to be considered, all but the 200 most abundant ions are removed and the remaining ions are renormalized to 100.

2. Determination of the precursor ion mass of the query MS/MS spectrum;

3. Matching the obtained precursor mass to the precursor masses of peptides of the sequence database. The tool applies a preliminary score to filter through all peptides in the sequence database. The preliminary score is calculated by summing the number of matched experimental peaks (given m/z tolerance) and their intensities.

The step generates a list of 500 pre-ranked candidates.

4. A theoretical MS/MS spectrum is generated for each of 500 pre-ranked candidates following the preferable digestion rules (Colinge and Brennet, 2007). The theoretical spectrum contains values for the predicted mass-to-charge ratio of fragment ions of the given amino acid sequence as well as a magnitude component. A magnitude component is assigned to the predicted mass-to-charge ratio values of the fragment ions by using an empirical knowledge of the appearance of tandem mass spectra for peptides and the constraints of correlation analysis. The relative abundances of type- b and y ions cannot be predicted, but we know from experience that sequential losses from these ions are generally less facile at the collision energies used to generate low energy tandem mass spectra of peptides. All values that represent the m/z ratios of fragment ions of type-b and -y ions are assigned a magnitude of 50.0. A magnitude of 25.0 is assigned to m/z ratios within ±1 of the b or y ion values. The other types of ions are assigned a magnitude of 10.0.

IndiMax: fast and reliable filtering methodology for Tandem Mass Spectrometry data

Thesis

Reference

IndiMax: fast and reliable filtering methodology for Tandem Mass Spectrometry data

IndiMax: Fast and Reliable Filtering Methodology for Tandem Mass Spectrometry Data

THÈSE

présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention

bioinformatique par

Rostyslav Kuzyakiv de

Toronto (Canada) Thèse no : 4596

GENÈVE

Université de Genève

2013

Table of Contents

Table of contents

Table of tables

Table of figures

Abstract

French summary

Introduction

Tandem Mass Spectra Analysis and Protein Identification

Peptide/Spectrum Candidate Filtering Methodology for MS/MS data

Results

Conclusion

Appendix

References

Table of Tables

Table of Figures

Abstract

French Summary

1. Introduction 1.1 Proteomics

1.2 Protein Identification

1.3 Mass Spectrometry

1.3.1 Mass Spectrum (MS1)

1.3.2 Tandem mass spectrum (MS2)

1.3.2.1 Peptide Fragmentation

1.3.2.2 Most common ions observed

1.4 Protein Identification by Mass Spectrometry 1.4.1 Peptide Mass Fingerprint (PMF)

1.4.2 Protein Identification by Tandem Mass Spectrometry (MS/MS)

1.5 Thesis Outline

1.6 Main Contributions

2 Tandem Mass Spectra Analysis and Protein Identification

2.1 Sequence Database Search

2.1.1 The 1

Generation Sequence Database Search Algorithms