Assigning molecules to MS/MS spectra - Software Engineering and Bioinformatics Challenges

Chapter 1 Introduction

1.3. Software Engineering and Bioinformatics Challenges

1.3.3. Assigning molecules to MS/MS spectra

The bioinformatics challenge when developing algorithms to identify post translational modification in LC-MS/MS data, is to assign the correct molecule to a spectrum. For open modification PTM proteomics this is assigning modified peptides to spectra and in glycomics it is assigning glycan structures to spectra.

The first algorithm to assign molecules to spectra was developed in 1966⁴². Since then more sophisticated algorithms have become available ^43–52 , but further improvement can be made by refining old algorithms or developing new algorithms for assigning molecules to spectra.

Conceptually assigning a molecule to a spectrum can be split into two steps. The first is to generate a candidate set of potential molecules. And the second is to score the candidates so that the best molecule can be selected from the candidate set.

Generating Potential Candidates. The challenge when generating the set of candidates is to keep the candidate set as small as possible without discarding the correct structure. Keeping the candidate set small makes it easier for the scoring function to find the correct candidate.

The two methods that have been used to generate candidates are: using the information contained in the query spectrum to generate the candidates and to select the candidates from a set of molecules that are known to occur in the sample.

Using the information in the query spectrum to generate the molecule is called a de-novo search.

De-novo searches have been used in glycomics⁵³ to assign structures and in proteomics⁵⁴ to find unmodified peptides. The advantage of de-novo searches is that it is possible to generate any peptide or glycan. The disadvantage is that very high-quality spectra are required, making the de-novo approach unsuitable for most spectra obtained in shotgun proteomics or glycomics experiments.

Selecting the candidates from a set of molecules that are known to occur in the sample is used by two related methods called spectrum library and database searches. For both methods, the candidate set is obtained by selecting all molecules that have a precursor ion that is within tolerance of the query spectrum. The biggest difference between spectra libraries and databases is that spectra libraries store a validated high-quality experimental MS/MS spectrum for each

molecule. Whereas databases store a theoretical MS/MS spectrum, generated from in-silico fragmentation, for each molecule.

Spectra libraries contain experimental spectra and are created by collecting and validating high quality spectra from previous experiments. The advantage of this is that all molecules in the spectrum library are known to be detectable on a mass spectrometer, allowing the smallest possible candidate set to be selected. The disadvantage is that only molecules that were previously identified using MS/MS can be automatically detected. Spectra libraries were first used in the 1980s for small molecule identification⁵⁵. In 1998 Yates et al.⁵⁶ showed that spectra libraries can be used to identify peptides in shotgun proteomics. Spectral library searches were first used for glycomics in 2005 by Kameyama et al.⁵⁷.

The advantage of using database searches is that the database does not need to be created from MS/MS data. Consequently, database searches can be used to identify molecules that have not been detected using MS/MS. Therefore, database searches work best if it is possible to make a database that contains all the molecules that are likely for a given sample. This is possible for proteomics, where the genome or transcriptome can be used as a template to generate all possible proteins. The proteins are then digested in silico and the resulting peptides stored in the database along with the theoretical spectrum that is generated by in silico fragmentation.

Database searches have been successfully implemented in proteomics by tools such as Sequest⁵⁸, Mascot⁵⁹ and X!Tandem⁶⁰ and are widely used.

Creating a database for glycomics is more challenging because there is no direct template from which all the glycans can be derived. An alternative, but not exhaustive, source of glycan structures are databases of previously identified glycans, such as GlycomeDB⁶¹. Database searches are available for glycomics (e.g. GlycosidIQ⁶²). However, because the database does not contain all glycans, databases searches have not been widely used in glycomics.

Both spectra library and database searches can be used to generate candidates that take PTMs into account. For open modifications searches, the range of mass differences that are allowed is increased from the spectrometers tolerance to the maximum allowed modification mass. For example, if the query precursor m/z is 800 and the maximum allowed modification weight is 200 Da, any peptide that has a precursor m/z between 600 and 800 m/z will be selected. For

targeted modifications, the peptide sequence is also checked to make sure it contains at least one amino acid that can have the modification, and the mass of the modification is taken into account when the query and peptide precursor are compared. For example, when looking for phosphorylation only peptides that contain an S, T or Y are considered and the query precursor must be within tolerance of the combined peptide and modification mass.

Scoring. Once the candidate molecules have been selected, a scoring algorithm is used to calculate the quality of the match between the query spectrum and the library or database spectrum. If the candidate with the highest score exceeds the threshold for a reliable match it is assigned to the query spectrum.

The scoring algorithm that is used depends on the search type. Spectra library score the similarity between two experimental spectra, database searches score the match between an experimental and theoretical spectrum, and open modification searches need to account for the mass shift due to the modification when calculating the score. The similarity score, when doing spectra library searches, is typically calculated using the normalized dot product (ndp).

However, several other scoring functions such as shared peak count, weighted normalized dot product or Pearson’s coefficient have also been investigated. When calculating the similarity for open modification searches using spectra libraries, the peaks first need to be aligned to account for the modification mass, as illustrated by Figure 8, followed by calculating the ndp^45,63. The open modification search tools that implement the spectra library approach include QuickMod⁵⁰, Bonanza⁴⁹, pMatch⁵¹ and Tier-Wise⁵².

Figure 8. Aligning the unmodified spectrum of TMY with the spectrum of TMY with an oxidised M (TM{O}Y).

Database searches differs from spectra library searches in that the score is calculated from an experimental spectrum and a theoretical spectrum that was generated from in silico fragmentation. Due to the complex fragmentation chemistry, it is currently not possible to generate realistic theoretical spectra. The theoretical spectra that are generated only contain backbone ion peaks and all the peaks for the same ion type have the same intensity.

The scoring functions that are used by database searches take into account this difference between experimental and theoretical spectra. Commonly used database search algorithms are Sequest⁵⁸, Mascot⁵⁹ and X!Tandem⁶⁰.

The proteomic database algorithms can also be used for targeted PTM searches. This is done by adding the modification to the peptide and generating a new theoretical spectrum. The scoring is then the same as described above. Like the open modification search using spectra libraries, the database searches also first align the peaks before scoring the match between the experimental and theoretical spectra. Tools that can be used for finding PTM using database searches include MS-alignment⁴⁵, MODi⁴⁶, ModifiComb⁴⁸, MODa⁴⁷ and MSFragger⁶⁴.

Dans le document Developing algorithms to automate the identification of post translational modification in LC-MS/MS data (Page 24-28)